Basic similarity search workflow

This is an example of using the supplied command line tools to generate descriptors for molecule sets and invoke similarity searches on them using molecule queries. Parts of the steps described below are implemented in script search-workflow.sh found in examples/ directory. This basic workflow consists of the following steps:

Overview of the search workflow

  1. Import molecules and IDs from structure file (creating master molecule storage)
  2. Calculate molecular descriptors to be used as targets for the search
  3. Invoke similarity search on prepared storages
  4. Diagnostic dump of prepared storages (optional)

For more details on the command line scripts involved see their description. For more details on performance see document Performance. An example for skipping the preparation steps and do on the fly descriptor calculation is also given. See also document Details on searchStorage.

Create master molecule storage

Master molecule storage used by other scripts (search) to retrieve structure sources and IDs. Structure IDs also stored in a similar data structure.

Notes

Commands

# Retrieve IDs from SDF properties
gzip -dc data/molecules/drugbank/drugbank-all.sdf.gz | bin/createMms.sh \
    -in - \
    -out drugbank-all-mms.bin \
    -prop COMMON_NAME:drugbank-all-commonname.bin

# Retrieve IDs from molecule name
gzip -dc data/molecules/nci/nci-250k.smi.gz | bin/createMms.sh \
    -in - \
    -out nci-250k-mms.bin \
    -name nci-250k-name.bin

Breakdown of the invocations

Command line part Description
gzip -dc <GZFILE> Decompress the content of gzip encoded file <GZFILE> and print it to the standard output.
| Pipe the standard output of the previous command into the standard input of the following command. See http://www.tldp.org/LDP/abs/html/io-redirection.html for details.
bin/createMms.sh Tool shipped to process input file and store structures and optionally IDs in a proprietary binary file readable by other tools.
\ Sign that command is continued in the following line.
-in <INPUT> Specify the location of input structures to process.
- Specify standard input.
-out <BINFILE> Specify the binary file for the master molecule storage to write.
-prop <PROPNAME>:<BINFILE> Specify an SDF property <PROPNAME> to be extracted and stored in a binary file <BINFILE>.
-name <BINFILE> Extract and store molecule name in a binary file <BINFILE>.

Expected performance

Preprocessing the nci-250k dataset on recent desktop machine is expected to be done under a minute.

Calculate fingerprints

Generated descriptors (fingerprints) are stored in a binary file. This file will be read by search tool.

Notes

Commands

gzip -dc data/molecules/drugbank/drugbank-all.sdf.gz  | bin/buildStorage.sh \
    -context createSimpleCfp7Context \
    -in - \
    -out drugbank-all-cfp7.bin

gzip -dc data/molecules/nci/nci-250k.smi.gz | bin/buildStorage.sh \
    -context createSimpleCfp7Context \
    -in - \
    -out nci-250k-cfp7.bin

Expected performance

Fingerprint calculation for the nci-250k dataset on a recent desktop machine is expected to be done well under a minute.

Breakdown of the invocations

Command line part Description
bin/buildStorage.sh Tool to process structure file input, calculate molecular descriptors (fingerprints) and store them in a binary file.
-context <CONTEXT> Specify molecular descriptor, default comparison metric and other parameters to be used during calculation and later search. For details see document Basic overview of the concepts of overlap analysis context.
-in <INFILE> Structure file to process.
-out <BINFILE> Binary file containg calculated descriptors to write.

Invoke similarity search from command line

By default targets are identified by their master index (0-based index in the input structure file). If serialized id or name storage (created with master molecule storage) is specified then the stored ID (or name) is retrieved and printed.

Commands

# Launch a simple search against the NCI database with a query molecule specified as a SMILES string
bin/searchStorage.sh \
    -frombytes nci-250k-cfp7.bin \
    -qm "C1CCCC1"

# Use previously extracted IDs
bin/searchStorage.sh \
    -frombytes nci-250k-cfp7.bin \
    -idstorage nci-250k-name.bin \
    -qm "C1CCCC1"

# Find the 10 most similar structures
bin/searchStorage.sh \
    -frombytes nci-250k-cfp7.bin \
    -idstorage nci-250k-name.bin \
    -mode MOSTSIMILARS \
    -count 10 \
    -qm "C1CCCC1"

# Search most similar from the NCI database for each of the first 10 of the Drugbank dataset
gzip -dc data/molecules/drugbank/drugbank-all.sdf.gz | head -10 | bin/searchStorage.sh \
    -frombytes nci-250k-cfp7.bin \
    -idstorage nci-250k-name.bin \
    -qf -

Breakdown of the invocations

Command line part Description
bin/searchStorage.sh Tool to invoke similarity searches against molecular descriptors stored in a binary file previously generated by buildStorage.
-frombytes <BINFILE> Binary file containing molecular descriptors.
-idstorage <BINFILE> Read target IDs from specified location.
-qm <QMOLSOURCE> Import molecule from source <QMOLSOURCE> and use it as a query.
-qf <QUERY> Import query molecules from specified location.
-qf - Import query molecules from standard input.
-mode MOSTSIMILARS Find the n most similar molecules for each query.
-count 10 Specify the max number of most similar structures to find.

Expected performance

Execution time of the above runs is expected to be in the few seconds range.

Diagnostics: dump contents of the serialized storages

Tool dumpStorage reads spcified binary files and prints an overview of their contents. Note that the given storage is fully read into memory (regardless of the printed line count).

Command

bin/dumpStorage.sh -in drugbank-all-cfp7.bin -in drugbank-all-mms.bin -in drugbank-all-commonname.bin

On the fly descriptor calculations

The example above use tools createMms and buildStorage to prepare descriptors and IDs for later search or for exposing through Web UI / REST API. It is possible to skip this preparation steps and let searchStorage to do the calculation. For further information on parametrization of searchStorage see Details on searchStorage.

# Find the 10 most similar structures using asymmetric tversky with on the fly descriptor calculation
tabs 40
gzip -dc gzip -dc data/molecules/drugbank/drugbank-all.sdf.gz | bin/searchStorage.sh \
    -context createSimpleCfp7Context \
    -metric "tversky,coeffT:0.01,coeffQ:0.99" \
    -tmf - \
    -tidprop COMMON_NAME \
    -mode MOSTSIMILARS \
    -count 10 \
    -qm "O1CC1 epoxy" \
    -qidname

The output:

Query                                   Target                                  Dissimilarity
epoxy                                   Sevelamer                               0.019607843137254943
epoxy                                   Colestipol                              0.027946537059538312
epoxy                                   Fosfomycin                              0.03264812575574361
epoxy                                   3-Oxiran-2ylalanine                     0.03730445246690739
epoxy                                   R-Styrene Oxide                         0.03961584633853543
epoxy                                   Oxiranpseudoglucose                     0.04534606205250602
epoxy                                   D-Limonene 1,2-Epoxide                  0.04534606205250602
epoxy                                   3,4-Epoxybutyl-Alpha-D-Glucopyranoside	0.06432748538011701
epoxy                                   (R)-4-Nitrostyrene oxide                0.06868451688009314
epoxy                                   (S)-4-Nitrostyrene oxide                0.06868451688009314

Breakdown of the invocations

Command line part Description
tabs 40 Set tab stops of the terminal to 40 characters. This ensures that the columns of the ouptut are visually aligned. See https://linux.die.net/man/1/tabs.
gzip -dc <GZFILE> Decompress the content of gzip encoded file <GZFILE> and print it to the standard output.
| Pipe the standard output of the previous command into the standard input of the following command. See http://www.tldp.org/LDP/abs/html/io-redirection.html for details.
bin/searchStorage.sh Tool to invoke similarity searches against molecular descriptors stored in a binary file previously generated by buildStorage or generated on the fly.
\ Sign that command is continued in the following line.
-context <CONTEXT> Specify molecular descriptor, default comparison metric and other parameters to be used during calculation and later search. For details see document Basic overview of the concepts of overlap analysis context.
-metric <METRIC> Customize comparison metric.
tversky,coeffT:0.01,coeffQ:0.99 Asymmetric tversky metric with parameters where query only features are highly penalized, while target only features are slightly penalized.
-tmf <MOLFILE> Read and parse targets from a molecule file
-tmf - Use stdin to read the target molecules from
-tidprop <PROPNAME> Extract target IDs from the given property of the parsed target molecules
-tidprop COMMON_NAME Property name to use for target IDs
-mode MOSTSIMILARS Find the n most similar molecules for each query.
-count 10 Specify the max number of most similar structures to find.
-qm <QMOLSOURCE> Import molecule from source <QMOLSOURCE> and use it as a query.
-qm "O1CC1 epoxy" SMILES structure source with molecule name specified.
-qidname Use molecule name of query molecule(s) as query IDs .

Output formatting

By default the dissimilarity values uses Java Double formatting. Using option -out-format <FORMAT> a custom formatting can be specified which delegates to Java java.text.Format. The following example use %.3f for a fixed 3 digit precision:

# Find the 5 most similar structures using asymmetric tversky with on the fly descriptor calculation
tabs 25
gzip -dc gzip -dc data/molecules/drugbank/drugbank-all.sdf.gz | bin/searchStorage.sh \
    -context createSimpleCfp7Context \
    -metric "tversky,coeffT:0.01,coeffQ:0.99" \
    -tmf - \
    -tidprop COMMON_NAME \
    -mode MOSTSIMILARS \
    -count 5 \
    -qm "O1CC1 epoxy" \
    -qidname \
    -out-numeric-format "%.3f"

The output:

Query                    Target                   Dissimilarity
epoxy                    Sevelamer                0.020
epoxy                    Colestipol               0.028
epoxy                    Fosfomycin               0.033
epoxy                    3-Oxiran-2ylalanine      0.037
epoxy                    R-Styrene Oxide          0.040

Heatmap visualization

As an experimental feature a heatmap of the search results can be calculated using options -heatmap-image <FILE> and -heatmap-image-cellsize <CELLSIZE>. Search modes MOSTSIMILAR, MOSTSIMILARS and FULLMATRIX are all supported. Please note that heatmap rendering is not recommended for very large datasets. The approximate pixel count of the resulting image is <QUERIES> * <TARGETS> * <CELLSIZE> * <CELLSIZE> which is recommended to be kept below a few tens of megapixels.

Self overlap of the vitamins dataset

bin/searchStorage.sh \
    -context createSimpleCfp7Context \
    -qmf data/molecules/vitamins/vitamins.smi \
    -qidname \
    -tmf data/molecules/vitamins/vitamins.smi \
    -tidname \
    -mode FULLMATRIX \
    -out vitamins-fullmatrix.txt \
    -heatmap-image vitamins-fullmatrix.png \
    -heatmap-image-cellsize 15 \
    -heatmap-image-query-ids-length 250 \
    -heatmap-image-target-ids-length 250

Self overlap of the vitamins dataset

The generated image layout is adjusted to have larger than default cell sizes and enough space to accomodate the long structure ID strings of the dataset. For details on the heatmap image generation see document Details on searchStorage.