Using custom binary descriptors
Custom binary fingerprints and float vector descriptors can also be handled. Note that the custom descriptors expose only the serialization mechanisms of the underlying representations. No descriptor generation (from Molecules) is available in this case, so for queries also the custom descriptors must be used. Parts of the steps described below are implemented in self contained example scripts
custom-binaryfp-workflow-nci250k.sh found in the
examples directory. Note that these example scripts currently failing under Windows + Cygwin.
The basic workflow described below contains the following steps:
- Create textual representations of descriptors generated by the diagnostic tool
- Also create textual descriptor representation to be used as queries
- Import custom descriptors
- Search custom descriptors
The examples use the vitamins dataset containig 30 molecules. Below performance data for larger sets can be found.
Create an input file
Use diagnostic tool
stdg.sh to generate sample binary descriptor representations by creating a binary string based input file using chemical fingerprints in format
010101....01010 <ID>. This tool uses the legacy descriptors API (
chemaxon.descriptors) to generate descriptors.
cat data/vitamins.smi | bin/stdg.sh \ -in - \ -cfg data/cfp-7-1.xml \ -desc com.chemaxon.descriptors.alternates.CfpBsWrapper \ -idloc TRAILING \ -idsrc MOLNAME \ -escape false \ -processdesc "d.replace(/\\|/g, '')" \ -v \ -out vitamins-custom-binstring.txt
Note that the exposed descriptor generator writes bits organized into groups of 8 using characters
| in the form
01001101|01111000|000.... The desired output format is a plain bitstring (
0100110101111000000...) containing no such separators. Option
Breakdown of the
||Command line argument passed to the executable. This is escaped for the shell.|
||JS function to replace
||The value after the shell processes the escaped
||The regular expresseion processed by the scripting hook.|
||Regular expression literal (
So the given JS hook will delete all occurrences of
| character by replacing them to an empty string.
An alternative approach to removing the separator character could be using standard output (
-out -) and using command
... -out - | tr -d " " > vitamins-custom-binstring.txt).
IDs created from input names are also written.
Note that this tool use single threaded execution.
Import custom descriptors
OverlapAnalysisContext instance to be used (as the value of the last expression). Many initialized references and helper functions are available, use option
-h to print command line help for details. Since the input is an arbitrary line oriented text file which might contains additional data the methods used for accessing descriptor and optionally ID parts are needed to be specified explicitly. Such specification is done by using splitters.
bin/importStorage.sh \ -in vitamins-custom-binstring.txt \ -splitter com.chemaxon.overlap.splits.FirstToken \ -idsplitter com.chemaxon.overlap.splits.AllButFirstToken \ -out vitamins-custom-fp.bin \ -id vitamins-custom-id.bin \ -contextjs "ctx_from_descpb(bld_bv.length(1024).endianness(en_BIG_ENDIAN).stringFormat(sf_STRICT_BINARY_STRING))"
Note that writing IDs (using options
-idsplitter) is optional. Note that IDs might contain white spaces (like
Vitamin C), so using splitter
SecondToken instead of
AllButFirstToken would compromise them (selecting only
Vitamin instead of the full remaining part
||Helper function which creates a default
||A builder instance for
||Update builder with length parameter (see apidoc).|
||Update builder with endianness parameter (see apidoc).|
||Constant which can be passed to
||Update builder with string format parameter (see apidoc).|
||Constant which can be passed to
Import associated master molecule storage
Master molecule storage can be created when structures are available with tool
createMms. Note that currently the order of custom descriptors and the order of molecules must match; ID based matching is not available.
cat data/vitamins.smi | bin/createMms.sh -in - -out vitamins-mms.bin
Diagnostic dump storages
Peek into the contents of created storages.
bin/dumpStorage.sh \ -in vitamins-custom-fp.bin \ -in vitamins-custom-id.bin \ -in vitamins-mms.bin
Create descriptor for querying
Query descriptors has no associated IDs. We use "Vitamin E - alpha-Tocopherol" structure slightly modified (an additional carbon atom is attached):
echo "Oc2c(c(c1O[[email protected]](CCc1c2C)(C)CCC[[email protected]](C)CCC[[email protected]](C)CCCC(C)C)C)CC" | bin/stdg.sh \ -in - \ -cfg data/cfp-7-1.xml \ -desc com.chemaxon.descriptors.alternates.CfpBsWrapper \ -idloc NONE \ -escape false \ -processdesc "d.replace(/\\|/g, '')" \ -v \ -out query-desc.txt
Query descriptor storage
Inline query descriptors are set using parameter
-qd. Query descriptors stored in a file can be read using
-qdf. Note that query molecules (
-qmf) can not be used, since we dont know how to generate the descriptors for them.
bin/searchStorage.sh \ -frombytes vitamins-custom-fp.bin \ -qd `cat query-desc.txt` bin/searchStorage.sh \ -frombytes vitamins-custom-fp.bin \ -qdf query-desc.txt
searchStorage can use IDs instead of plain structure indices. Parameter
-idstorage can specify the associated ID storage.
bin/searchStorage.sh \ -frombytes vitamins-custom-fp.bin \ -idstorage vitamins-custom-id.bin \ -qdf query-desc.txt