Using custom binary descriptors

Custom binary fingerprints and float vector descriptors can also be handled. Note that the custom descriptors expose only the serialization mechanisms of the underlying representations. No descriptor generation (from Molecules) is available in this case, so for queries also the custom descriptors must be used. Parts of the steps described below are implemented in self contained example scripts custom-binaryfp-workflow-vitamins.sh and custom-binaryfp-workflow-nci250k.sh found in the examples directory. Note that these example scripts currently failing under Windows + Cygwin.

The basic workflow described below contains the following steps:

The examples use the vitamins dataset containig 30 molecules. Below performance data for larger sets can be found.

Create an input file

Use diagnostic tool stdg.sh to generate sample binary descriptor representations by creating a binary string based input file using chemical fingerprints in format 010101....01010 <ID>. This tool uses the legacy descriptors API (chemaxon.descriptors) to generate descriptors.

cat data/vitamins.smi | bin/stdg.sh \
    -in - \
    -cfg data/cfp-7-1.xml \
    -desc com.chemaxon.descriptors.alternates.CfpBsWrapper \
    -idloc TRAILING \
    -idsrc MOLNAME \
    -escape false \
    -processdesc "d.replace(/\\|/g, '')" \
    -v \
    -out vitamins-custom-binstring.txt

Note that the exposed descriptor generator writes bits organized into groups of 8 using characters | in the form 01001101|01111000|000.... The desired output format is a plain bitstring (0100110101111000000...) containing no such separators. Option -processdesc <SCRIPT> provides a JavaScript hook to transform the descriptor String representation. In the passed script reference d contains the original String representation (before escaping/id appending) and the expected return value is the representation to be used. For more info on JavaScript's replace() function used and the passed regular expressions can be found in the JavaScript String Reference, in the JavaScript RegExp reference and in MDN's JavaScript Guide.

Breakdown of the -processdesc option:

Expression Details
"d.replace(/\\|/g, '')" Command line argument passed to the executable. This is escaped for the shell.
d.replace(<PATTERN>, <REPLACEMENT>) JS function to replace <PATTERN> to <REPLACEMENT> in string d.
d.replace(/\|/g, '') The value after the shell processes the escaped \\ character. This value is passed to the JavaScript hook
/\/|/g The regular expresseion processed by the scripting hook.
/.../g Regular expression literal (/.../) and flag to indicate global search (g).
\| Escaped | character which is matched.

So the given JS hook will delete all occurrences of | character by replacing them to an empty string.

An alternative approach to removing the separator character could be using standard output (-out -) and using command tr (... -out - | tr -d " " > vitamins-custom-binstring.txt).

IDs created from input names are also written.

Note that this tool use single threaded execution.

Import custom descriptors

Note that the underlying context must be composed using a JavaScript hook (specified by -contextjs <SCRIPT>). This must be a valid JavaScript code which returns the OverlapAnalysisContext instance to be used (as the value of the last expression). Many initialized references and helper functions are available, use option -h to print command line help for details. Since the input is an arbitrary line oriented text file which might contains additional data the methods used for accessing descriptor and optionally ID parts are needed to be specified explicitly. Such specification is done by using splitters.

bin/importStorage.sh \
    -in vitamins-custom-binstring.txt \
    -splitter com.chemaxon.overlap.splits.FirstToken \
    -idsplitter com.chemaxon.overlap.splits.AllButFirstToken \
    -out vitamins-custom-fp.bin \
    -id vitamins-custom-id.bin \
    -contextjs "ctx_from_descpb(bld_bv.length(1024).endianness(en_BIG_ENDIAN).stringFormat(sf_STRICT_BINARY_STRING))"

Note that writing IDs (using options -id and -idsplitter) is optional. Note that IDs might contain white spaces (like Vitamin C), so using splitter SecondToken instead of AllButFirstToken would compromise them (selecting only Vitamin instead of the full remaining part Vitamin C).

Breakdown of the contents of the passed JavaScript fragment creating the OverlapAnalysisContext used:

Script part Description
ctx_from_descpb(..) Helper function which creates a default OverlapAnalysisContext from the associated DescriptorParameters builder.
bld_bv A builder instance for BvParameters in default state.
.length(..) Update builder with length parameter (see apidoc).
.endianness(..) Update builder with endianness parameter (see apidoc).
en_BIG_ENDIAN Constant which can be passed to .endianness(..) (see apidoc).
.stringFormat(..) Update builder with string format parameter (see apidoc).
sf_STRICT_BINARY_STRING Constant which can be passed to .stringFormat(..) (see apidoc).

Import associated master molecule storage

Master molecule storage can be created when structures are available with tool createMms. Note that currently the order of custom descriptors and the order of molecules must match; ID based matching is not available.

cat data/vitamins.smi | bin/createMms.sh -in - -out vitamins-mms.bin

Diagnostic dump storages

Peek into the contents of created storages.

bin/dumpStorage.sh \
    -in vitamins-custom-fp.bin \
    -in vitamins-custom-id.bin \
    -in vitamins-mms.bin

Create descriptor for querying

Query descriptors has no associated IDs. We use "Vitamin E - alpha-Tocopherol" structure slightly modified (an additional carbon atom is attached):

echo "Oc2c(c(c1O[[email protected]](CCc1c2C)(C)CCC[[email protected]](C)CCC[[email protected]](C)CCCC(C)C)C)CC" | bin/stdg.sh \
    -in - \
    -cfg data/cfp-7-1.xml \
    -desc com.chemaxon.descriptors.alternates.CfpBsWrapper \
    -idloc NONE \
    -escape false \
    -processdesc "d.replace(/\\|/g, '')" \
    -v \
    -out query-desc.txt

Query descriptor storage

Inline query descriptors are set using parameter -qd. Query descriptors stored in a file can be read using -qdf. Note that query molecules (-qm or -qmf) can not be used, since we dont know how to generate the descriptors for them.

bin/searchStorage.sh \
    -frombytes vitamins-custom-fp.bin \
    -qd `cat query-desc.txt`

bin/searchStorage.sh \
    -frombytes vitamins-custom-fp.bin \
    -qdf query-desc.txt

searchStorage can use IDs instead of plain structure indices. Parameter -idstorage can specify the associated ID storage.

bin/searchStorage.sh \
    -frombytes vitamins-custom-fp.bin \
    -idstorage vitamins-custom-id.bin \
    -qdf query-desc.txt