Details on searchStorage

Tool searchStorage provides a command line interface for launching similarity searches. Searching against precomputed descriptors (prepared with buildStorage; detailed in Basic similarity search workflow) is its recommended use case, however on the fly descriptor generation and custom descriptor import is also supported.

High level overview

Descriptors can be acquired from various sources. These sources are mutually exclusive; an in-memory descriptor storage can be read from only one source currently. Note that a storage with IDs also can be loaded/imported which will be used for formatting the search results.

Highest level dataflow

The task of searchStorage is to collect query and target descriptors then search queries against targets. Finally it should process search results. The in-memory descriptors can be read/imported from various sources.

High level data flow of tool searchStorage

Descriptors from binary file

In-memory descriptors can be deserialized from binary files prepared by buildStorage or importStorage. These binary files store the context used which is needed for search. IDs can also be read from binary files which will be used for printing the results. When ID source is not specified a generated ID storage, representing the indices as IDs will be used. Currently queries can not be read from binary file.

Reading descriptors from binary files

Descriptors calculated for molecules

Molecules can be parsed and descriptors calculated for them. IDs can be extracted from molecule name or an SD property. When no ID is specified a generated ID storage with indices as IDs will be used. To generate descriptors from molecules a context is needed. In this example the context is specified with command line options. Note that when target descriptors are read from binary file the stored context will be used to parse query descriptors. Molecules can be read from a file or they can be specified inline as command line arguments.

Calculating descriptors from molecules

Descriptors imported from text

Descriptors can be imported from text source. In this case a context is needed to be specified which describes the textual format. The text source can be stored in an input file or can be specified inline as command line arguments. Each text line is parsed into a descriptor using the specified context. A part of the input text line can be used as IDs when an ID splitter is specified.

Import descriptors from text

Detailed data flow

The following diagram gives an overview of the tools internal data flow composed from the details above as well as relevant command line options.

Internal data flow of tool searchStorage

This chart shows the main data paths for collecting the queries, targets and their IDs for searches. Command line options relevant for each data path segments are marked. Note that query and target IDs are always used for printing results, however when they are not specified simply the query / target indices will be used.

Example: comparing inline molecules

The following command compares molecules specified as inline arguments:

bin/searchStorage.sh \
    -context createSimpleCfp7Context \
    -qm "C1CCCCC1 cyclohexane" \
    -tm "C1CCCCC1CC ethylcyclohexane"

The output:

Query   Target  Dissimilarity
0       0       0.2

Data flow of this example with unused paths removed:

Internal data flow of the simple inline molecule comparison

Example: comparing inline molecules with IDs

IDs can be imported and printed. Note that tab size of the terminal is adjusted:

tabs 20
bin/searchStorage.sh \
    -context createSimpleCfp7Context \
    -qm "C1CCCCC1 cyclohexane" -qidname \
    -tm "C1CCCCC1CC ethylcyclohexane" -tidname

The output:

Query              Target              Dissimilarity
cyclohexane        ethylcyclohexane    0.2

Data flow of this example with unused paths removed:

Internal data flow of the simple inline molecule comparison with IDs

Example: searching against on the fly computed targets

Note that compressed structure files currently not supported, so using gzip to decompress structures and targets are read from stdin. Tab size of the terminal is adjusted.

tabs 40
gzip -dc data/molecules/drugbank/drugbank-all.sdf.gz | bin/searchStorage.sh \
    -context createSimpleCfp7Context \
    -qmf data/molecules/vitamins/vitamins.smi \
    -qidname \
    -tmf - \
    -tidprop COMMON_NAME

The output:

Query                                   Target                                  Dissimilarity
Vitamin A - Retinol                     Vitamin A                               0.0
Vitamin A - Retinal                     Alitretinoin                            0.14814814814814814
Vitamin A - beta-Carotene               1,3,3-trimethyl-2-[(1E,3E)-3-methylpenta-1,3-dien-1-yl]cyclohexene              0.02631578947368421
Vitamin B1 - Thiamine                   Thiamine                                0.0
Vitamin B2 - Riboflavin                 Riboflavin                              0.0
Vitamin B3 - Niacin                     Niacin                                  0.0
Vitamin B3 - Nicotinamide               Nicotinamide                            0.0
Vitamin B5 - Pantothenic acid           Pantothenic acid                        0.0
Vitamin B6 - Pyridoxine                 Pyridoxine                              0.0
Vitamin B6 - Pyridoxal                  Pyridoxal                               0.0
Vitamin B7 - Biotin                     Biotin                                  0.0
Vitamin B9 - Folic acid                 Folic Acid                              0.0
Vitamin B9 - Folinic acid               Leucovorin                              0.0
Vitamin B12 - Cyanocobalamin            Hydroxocobalamin                        0.10152284263959391
Vitamin B12 - Hydroxocobalamin          Hydroxocobalamin                        0.08740359897172237
Vitamin B12 - Methylcobalamin           Hydroxocobalamin                        0.08740359897172237
Vitamin C - Ascorbic acid               Vitamin C                               0.0
Vitamin D3 - Cholecalciferol            Cholecalciferol                         0.0
Vitamin D3 - Ergocalciferol             Ergocalciferol                          0.0
Vitamin E - alpha-Tocopherol            Vitamin E                               0.0
Vitamin E - beta-Tocopherol             Vitamin E                               0.0
Vitamin E - gamma-Tocopherol            Vitamin E                               0.04424778761061947
Vitamin E - delta-Tocopherol            Vitamin E                               0.09734513274336283
Vitamin E - alpha-Tocotrienol           Vitamin E                               0.22627737226277372
Vitamin E - beta-Trocotrienol           Vitamin E                               0.22627737226277372
Vitamin E - gamma-Trocotrienol          Vitamin E                               0.26277372262773724
Vitamin E - delta-Trocotrienol          Vitamin E                               0.30656934306569344
Vitamin K1 - Phylloquinone              Phylloquinone                           0.0
Vitamin K2 - Menatetrenone              Phylloquinone                           0.08181818181818182
Vitamin K2 - Menaquinone-7              Phylloquinone                           0.08181818181818182

Data flow of this example with unused paths removed:

Internal data flow for on the fly target descriptor calculation with IDs

Search modes

Search mode can be selected by option -mode <MODE>. The following examples use 6 target and 3 query molecules from the vitamins dataset:

head -6 data/molecules/vitamins/vitamins.smi > targets.smi
head -9 data/molecules/vitamins/vitamins.smi | tail -3 > queries.smi

Most similar search

Mode MOSTSIMILAR searches for the most similar target for each query. This is the default search mode. Option -maxdissim <THRESHOLD> limits the maximal dissimilarity returned.

head -6 data/molecules/vitamins/vitamins.smi > targets.smi
head -9 data/molecules/vitamins/vitamins.smi | tail -3 > queries.smi
tabs 35
bin/searchStorage.sh \
    -context createSimpleCfp7Context \
    -tmf targets.smi \
    -tidname \
    -qmf queries.smi \
    -qidname \
    -mode MOSTSIMILAR
Query                                   Target                                  Dissimilarity
Vitamin B3 - Nicotinamide               Vitamin B3 - Niacin                     0.37037037037037035
Vitamin B5 - Pantothenic acid           Vitamin B2 - Riboflavin                 0.8652849740932642
Vitamin B6 - Pyridoxine                 Vitamin B3 - Niacin                     0.569620253164557

Most similars search

Mode MOSTSIMILARS searches for a maximum number of the most similar targets for each query. Option -count <COUNT> specifies the maximum number of targets to return for each query. Option -maxdissim <THRESHOLD> limits the maximal dissimilarity returned.

head -6 data/molecules/vitamins/vitamins.smi > targets.smi
head -9 data/molecules/vitamins/vitamins.smi | tail -3 > queries.smi
tabs 35
bin/searchStorage.sh \
    -context createSimpleCfp7Context \
    -tmf targets.smi \
    -tidname \
    -qmf queries.smi \
    -qidname \
    -mode MOSTSIMILARS \
    -count 2
Query                              Target                             Dissimilarity
Vitamin B3 - Nicotinamide          Vitamin B3 - Niacin                0.37037037037037035
Vitamin B3 - Nicotinamide          Vitamin B1 - Thiamine              0.8269230769230769
Vitamin B5 - Pantothenic acid      Vitamin B2 - Riboflavin            0.8652849740932642
Vitamin B5 - Pantothenic acid      Vitamin B1 - Thiamine              0.8823529411764706
Vitamin B6 - Pyridoxine            Vitamin B3 - Niacin                0.569620253164557
Vitamin B6 - Pyridoxine            Vitamin B1 - Thiamine              0.7976878612716763

Fullmatrix with matrix format

Mode FULLMATRIX returns the result of all query-target comparisons. By default the textual output has a matrix format. In this default mode option -maxdissim <THRESHOLD> is not effective.

head -6 data/molecules/vitamins/vitamins.smi > targets.smi
head -9 data/molecules/vitamins/vitamins.smi | tail -3 > queries.smi
tabs 30
bin/searchStorage.sh \
    -context createSimpleCfp7Context \
    -tmf targets.smi \
    -tidname \
    -qmf queries.smi \
    -qidname \
    -mode FULLMATRIX
Target                        Query Vitamin B3 - Nicotinamide dissimilarity	Query Vitamin B5 - Pantothenic acid dissimilarity	Query Vitamin B6 - Pyridoxine dissimilarity

Vitamin A - Retinol           0.9156626506024096        0.8850574712643678	0.9047619047619048
Vitamin A - Retinal           0.8888888888888888	0.8977272727272727	0.9351851851851852
Vitamin A - beta-Carotene     0.9210526315789473	0.927710843373494	0.9405940594059405
Vitamin B1 - Thiamine         0.8269230769230769	0.8823529411764706	0.7976878612716763
Vitamin B2 - Riboflavin	      0.8415300546448088	0.8652849740932642	0.8208955223880597
Vitamin B3 - Niacin           0.37037037037037035	0.8953488372093024	0.569620253164557

Fullmatrix with list format

Mode FULLMATRIX can be used together with option -out-matrix-as-list to print a query-target-dissimilarity list similar to other search modes. In this case option -maxdissim <THRESHOLD> can be used to specify a dissimilarity threshold.

head -6 data/molecules/vitamins/vitamins.smi > targets.smi
head -9 data/molecules/vitamins/vitamins.smi | tail -3 > queries.smi
tabs 35
bin/searchStorage.sh \
    -context createSimpleCfp7Context \
    -tmf targets.smi \
    -tidname \
    -qmf queries.smi \
    -qidname \
    -mode FULLMATRIX \
    -out-matrix-as-list
Query                              Target                             Dissimilarity
Vitamin B3 - Nicotinamide          Vitamin A - Retinol                0.9156626506024096
Vitamin B3 - Nicotinamide          Vitamin A - Retinal                0.8888888888888888
Vitamin B3 - Nicotinamide          Vitamin A - beta-Carotene          0.9210526315789473
Vitamin B3 - Nicotinamide          Vitamin B1 - Thiamine              0.8269230769230769
Vitamin B3 - Nicotinamide          Vitamin B2 - Riboflavin            0.8415300546448088
Vitamin B3 - Nicotinamide          Vitamin B3 - Niacin                0.37037037037037035
Vitamin B5 - Pantothenic acid      Vitamin A - Retinol                0.8850574712643678
Vitamin B5 - Pantothenic acid      Vitamin A - Retinal                0.8977272727272727
Vitamin B5 - Pantothenic acid      Vitamin A - beta-Carotene          0.927710843373494
Vitamin B5 - Pantothenic acid      Vitamin B1 - Thiamine              0.8823529411764706
Vitamin B5 - Pantothenic acid      Vitamin B2 - Riboflavin            0.8652849740932642
Vitamin B5 - Pantothenic acid      Vitamin B3 - Niacin                0.8953488372093024
Vitamin B6 - Pyridoxine            Vitamin A - Retinol                0.9047619047619048
Vitamin B6 - Pyridoxine            Vitamin A - Retinal                0.9351851851851852
Vitamin B6 - Pyridoxine            Vitamin A - beta-Carotene          0.9405940594059405
Vitamin B6 - Pyridoxine            Vitamin B1 - Thiamine              0.7976878612716763
Vitamin B6 - Pyridoxine            Vitamin B2 - Riboflavin            0.8208955223880597
Vitamin B6 - Pyridoxine            Vitamin B3 - Niacin                0.569620253164557

Textual search results

The search results are printed to the standard output which can be redirected to a file using option -out <FILE>. By default the dissimilarity values uses Java Double formatting. Using option -out-format <FORMAT> a custom formatting can be specified which delegates to Java java.text.Format. The following example use %.3f for a fixed 3 digit precision:

# Find the 5 most similar structures using asymmetric tversky with on the fly descriptor calculation
tabs 25
gzip -dc gzip -dc data/molecules/drugbank/drugbank-all.sdf.gz | bin/searchStorage.sh \
    -context createSimpleCfp7Context \
    -metric "tversky,coeffT:0.01,coeffQ:0.99" \
    -tmf - \
    -tidprop COMMON_NAME \
    -mode MOSTSIMILARS \
    -count 5 \
    -qm "O1CC1 epoxy" \
    -qidname \
    -out-numeric-format "%.3f"

The output:

Query                    Target                   Dissimilarity
epoxy                    Sevelamer                0.020
epoxy                    Colestipol               0.028
epoxy                    Fosfomycin               0.033
epoxy                    3-Oxiran-2ylalanine      0.037
epoxy                    R-Styrene Oxide          0.040

Heatmap image creation

A heatmap image of the search result can be rendered using option -heatmap-image <IMAGE>. In the generated image a cell is associated with every query-target pair. When a dissimilarity value for a query-target pair is retrieved it will be colored on the map according to the color scale. Image generation works with all (MOSTSIMILAR, MOSTSIMILARS and FULLMATRIX) search modes. The generated image can be customized. For a help on the available customization options invoke bin/searchStorage.sh -h.

Self overlap of the vitamins dataset

bin/searchStorage.sh \
    -context createSimpleCfp7Context \
    -qmf data/molecules/vitamins/vitamins.smi \
    -qidname \
    -tmf data/molecules/vitamins/vitamins.smi \
    -tidname \
    -mode FULLMATRIX \
    -out vitamins-fullmatrix.txt \
    -heatmap-image vitamins-fullmatrix.png \
    -heatmap-image-cellsize 15 \
    -heatmap-image-query-ids-length 250 \
    -heatmap-image-target-ids-length 250

Self overlap of the vitamins dataset

The generated image layout is adjusted to have larger than default cell sizes and enough space to accomodate the long structure ID strings of the dataset.

Breakdown of the arguments

Command line part Description
-context createSimpleCfp7Context Descriptor (fingerprint) to be used. See Basic overview of the concepts of overlap analysis context for details.
-qmf data/molecules/antibiotics/vitamins.smi Read queries from molecule file data/molecules/antibiotics/vitamins.smi, parse them and calculate descriptors according the context set.
-qidname Use the molecule name field of the queries as IDs.
-tmf data/molecules/antibiotics/vitamins.smi Read targets from molecule file data/molecules/antibiotics/vitamins.smi, parse them and calculate descriptors according the context set.
-tidname Use the molecule name field of the targets as IDs.
-mode FULLMATRIX Calculate and store the results of every query-target comparisons.
-out vitamins-fullmatrix.txt Write textual results (dissimilarity matrix) to file vitamins-fullmatrix.txt.
-heatmap-image vitamins-fullmatrix.png Create heatmap image from the search results and write it to file vitamins-fullmatrix.png.
-heatmap-image-cellsize 15 Size the cells of the heatmap to 15 pixel * 15 pixel. This is an optional parameter.
-heatmap-image-query-ids-length 250 Allow 250 pixels to print query IDs (the vitamins dataset contains relatively long molecule names). This is an optional parameter.
-heatmap-image-query-ids-length 250 Allow 250 pixels to print target IDs. This is an optional parameter.

Overlap of the antibiotics dataset with the essential medicines datatset

bin/searchStorage.sh \
    -qmf data/molecules/antibiotics/antibiotics.smi \
    -qidname \
    -tmf data/molecules/who-essential-medicines/who-essential-medicines.smi \
    -tidname \
    -context createSimpleCfp7Context \
    -mode MOSTSIMILARS \
    -count 100 \
    -maxdissim 0.15 \
    -out antibiotics-vs-essentials-mostsimilars.txt \
    -heatmap-image antibiotics-vs-essentials-mostsimilars.png \
    -heatmap-image-cellsize 10 \
    -heatmap-image-title-text "Antibiotics in the WHO Model List of Essential Medicines dataset" \
    -heatmap-image-query-ids-length 100 \
    -heatmap-image-query-label-text "Queries: List of antibiotics dataset" \
    -heatmap-image-target-ids-length 200 \
    -heatmap-image-target-label-text "Targets: WHO Model List of Essential Medicines dataset"

Overlap of the antibiotics dataset with the essential medicines datatset

Breakdown of the arguments not used in the previous example

Command line part Description
-mode MOSTSIMILARS Most similars search mode. At most the n most similar targets for each query are retrieved.
-count 100 Set the number of maximal hits for each query to 100.
-maxdissim 0.15 Targets with greater dissimilarity (smaller similarity) are rejected.
-heatmap-image-title-text "...." Set the chart title.
-heatmap-image-query-label-text "..." Set query labels.
-heatmap-image-target-label-text "..." Set target labels.

Self overlap of the drugbank-all dataset

gzip -dc data/molecules/drugbank/drugbank-common_name.smi.gz > drugbank.smi
bin/searchStorage.sh \
    -qmf drugbank.smi \
    -tmf drugbank.smi \
    -context createSimpleCfp7Context \
    -mode FULLMATRIX \
    -out "" \
    -heatmap-image drugbank-fullmatrix.png \
    -heatmap-image-cellsize 1 \
    -heatmap-image-title-text "Self overlap of the Drugbank-all dataset" \
    -heatmap-image-query-label-text "" \
    -heatmap-image-target-label-text ""

This dataset contains ~7k molecules, so the resulting image size is using 1 pixel by 1 pixel cells is larger than 50 Megapixels, 50 Megabytes in size. Resulting text output is not written (option -out "" used). Execution time of the command is expected to be around a minute. The output image is not available in this documentation.