ChemAxon’s Trainer Engine

Version 1.0-beta6

This documentation gives a short introduction to ChemAxon’s Trainer Engine.

Table of Contents

What is Trainer Engine?

ChemAxon’s Trainer Engine is a tool that allows you to create new training models based on your own experimental data and use the created models for prediction.

The workflow the Trainer Engine is based on can be separated into 3 steps:

  1. Descriptor generation
  2. Training (model building)
  3. Prediction

In the first step descriptors are generated for the training set, which creates a serialised descriptor file. In the second step the actual training is done and a serialised predictor (model) is built based on the descriptor file. In the third step prediction can be done for the test set using the built model.

Usage examples

The examples work on Linux/Mac operating systems. On Windows machines please use the
trainer-engine.bat command instead of trainer-engine.sh

  1. Getting information on using the trainer-engine.sh script with the -h option, which prints a short help message to the standard output:

    $ ./trainer-engine.sh -h
    Usage:
        trainer-engine.sh  [command]  [command  options]
    Commands:
        split                 Random  splitter.
        generate-descriptors  Generates  descriptors  for  input  structures.
        train                 Train  a  model  from  descriptors  and  the  specified  configuration.
        predict               Predictor.
    
    Run  'trainer-engine.sh  COMMAND  --help'  for  more  information  on  a  command.
    

    Note 1: For listing all available parameters of a command use the -h or the --help option,
    e. g.: ./trainer-engine.sh predict -h
    Note 2: You can specify Java options before commands, e.g. allocating 8 GB memory for training: ./trainer-engine.sh -Xmx8g train [options]

  2. Generating descriptors for hERG prediction:

    $ ./trainer-engine.sh generate-descriptors  \
        --descriptors descriptor-config.json  \
        --sdf-input molecules.sdf  \
        --sdf-tag "pAct(hERG)"  \
        --output descriptors.ser
    
  3. Creating a training model from the generated descriptors:

    $ ./trainer-engine.sh train  \
        --training-data descriptors.ser  \
        --training-model training-model-config.json \
        --output hERG-predictor-v1.ser
    
  4. Using the created training model for prediction and calculating statistical parameters (if the pAct(hERG) SDF tag contains the observed data):

    $ ./trainer-engine.sh predict  \
        --serialized-predictor  hERG-predictor-v1.ser  \
        --input test.sdf \
        --exp-tag pAct(hERG) \
        --sdf-tag predicted_hERG \
        --output result.sdf
    
  5. Prediction and finding the most similar structures from the training set:

    $ ./trainer-engine.sh predict  \
        --serialized-predictor  hERG-predictor-v1.ser  \
        --input test.sdf \
        --most-similars \
        --sdf-tag predicted_hERG \
        --output result.sdf
    

How to configure Trainer Engine?

This section provides details on the configuration of Trainer Engine.

Descriptor generation

Descriptors can be generated using the trainer-engine.sh generate-descriptors command. The descriptors to be generated by the script has to be specified in a configuration file.

The descriptor configuration file is a JSON file that defines a set of descriptors for all available descriptor types as an array of "type" and "descriptors" key-array pairs.

The following configuration file is a template that shows such an example set of descriptors.

[
    {
        "type": "PHYSCHEM"
    },
    {
        // Any topology descriptor having a scalar value from the
        // TopologyAnalyserPlugin can be used.
        //https://apidocs.chemaxon.com/jchem/doc/dev/java/api/chemaxon/marvin/calculations/TopologyAnalyserPlugin.html
        "type": "TOPOLOGY",
        "descriptors": [
            "atomCount",
            "fsp3",
            "heteroRingCount",
            "stericEffectIndex"
        ]
    },
    {
        // Any Chemical Terms function that returns a scalar can be used as a
        // descriptor.
        // https://docs.chemaxon.com/display/docs/chemical-terms-functions-in-alphabetic-order.md
        "type": "CHEMTERM",
        "descriptors": [
            "atomCount('6')",
            "atomCount('7')",
            "formalCharge(majorms('7.4'))",
            "max(pka())",
            "min(pka())",
            "logP()"
        ]
    },
    {
        // The default ECFP fingerprint is 1024-bit-long and has a radius of 4.
        // The following ECFP fingerprints are also available:
        //    ECFP4_256, ECFP4_512, ECFP4_1024, ECFP4_2048
        //    ECFP6_256, ECFP6_512, ECFP6_1024, ECFP6_2048
        "type": "ECFP4_1024"
    },
    {
        "type": "MACCS"
    }
    {
        // Any property defined in an SD file tag of the training file can be
        // used as a descriptor.
        "type": "SDFTAG",
        "descriptors": [
            "DESCRIPTOR1",
            "DESCRIPTOR2"
        ]
    },
    {
        // The kNN (k Nearest Neighbour) descriptor is the weighted average of
        // the training property (e.g. hERG) of the 5 most similar molecules in
        // the training set. The weights are the similarity values between the
        // training set molecules.
        "type": "KNN"
    }
]

Training (model building)

Training models can be built using the trainer-engine.sh train command. The settings of a model have to be specified in a configuration file.

The trainer configuration file is a JSON file that defines a trainer object with trainerWrapper, type, method and params keys.

Below are example configuration files of the currently available training models.

Random Forest Classification

{
    "trainer": {

        // This is an optional key which defines an outer wrapper for the
        // training type. Currently only CONFORMAL PREDICTION is available,
        // which allows Error Bound prediction.
        "trainerWrapper": "CONFORMAL_PREDICTION",

        "type": "CLASSIFICATION",

        // For CLASSIFICATION only RANDOM_FOREST is supported.
        "method": "RANDOM_FOREST",

        "params": {
            // The number of trees.
            "ntrees": 300,

            // The number of input variables to be used to determine the
            // decision at a node of the tree. If p is the number of variables,
            // floor(sqrt(p)) generally gives good performance.
            "mtry": 0,

            // The maximum depth of the tree.
            "maxDepth": 50,

            // The maximum number of leaf nodes of the tree.
            // Default, if not specified: data size / 5
            // "maxNodes": 50,

            // The number of instances in a node below which the tree will not
            // split, nodeSize = 5 generally gives good results.
            "nodeSize": 1,

            // The sampling rate for the training tree. 1.0 means sampling with
            // replacement, while < 1.0 means sampling without replacement.
            "subSample": 1.0

            // Priors of the classes. The weight of each class is roughly the
            // ratio of samples in each class. For example, if there are 400
            // positive samples and 100 negative samples, the classWeight should
            // be [1, 4] (assuming label 0 is of negative, label 1 is of
            // positive).
            // "weights": [1, 4]
        }
    }
}

Logistic Regression

{
    "trainer": {
        "type": "CLASSIFICATION",
        "method": "LOGISTIC_REGRESSION",
        "params": {
            // lambda > 0 gives a "regularized" estimate of linear weights which
            // often has superior generalization performance, especially when
            // the dimensionality is high.
            "lambda": 0.1,

            // The tolerance for stopping iterations.
            "tol": 1e-5,

            // The maximum number of iterations.
            "maxIter": 500,

            // Feature transformation. In general, learning algorithms benefit
            // from standardization of the data set.
            // Available functions:
            //    "Scaler"       - Scales all numeric variables into the range [0, 1]
            //    "Standardizer" - Standardizes numeric feature to 0 mean and unit variance
            //    "MaxAbsScaler" - scales each feature by its maximum absolute value
            "featureTransformer": "None"
        }
    }
}

Random Forest Regression

{
    "trainer": {

        // This is an optional key which defines an outer wrapper for the
        // training type. Currently only CONFORMAL PREDICTION is available,
        // which allows Error Bound prediction.
        "trainerWrapper": "CONFORMAL_PREDICTION",

        "type": "REGRESSION",

        "method": "RANDOM_FOREST",

        "params": {
            // The number of trees.
            "ntrees": 300,

            // The number of input variables to be used to determine the
            // decision at a node of the tree. If p is the number of variables,
            // p / 3 usually gives good performance.
            "mtry": 0,

            // The maximum depth of the tree.
            "maxDepth": 50,

            // The maximum number of leaf nodes of the tree.
            // Default, if not specified: data size / 5
            // "maxNodes": 50,

            // The number of instances in a node below which the tree will not
            // split, nodeSize = 5 generally gives good results.
            "nodeSize": 1,

            // The sampling rate for the training tree. 1.0 means sampling with
            // replacement, while < 1.0 means sampling without replacement.
            "subSample": 1.0
        }
    }
}

Support Vector Regression

{
    "trainer": {
        "type": "REGRESSION",
        "method": "SUPPORT_VECTOR_REGRESSION",
        "params": {

            // Threshold parameter. There is no penalty associated with samples
            // which are predicted within distance epsilon from the actual
            // value. Decreasing epsilon forces closer fitting to the
            // calibration/training data.
            "eps": 1.0,

            // The soft margin penalty parameter.
            "C": 0.5,

            // The tolerance of convergence test.
            "tol": 0.1,

            // Feature transformation. In general, learning algorithms benefit
            // from standardization of the data set.
            // Available functions:
            //    "Scaler"       - Scales all numeric variables into the range [0, 1]
            //    "Standardizer" - Standardizes numeric feature to 0 mean and unit variance
            //    "MaxAbsScaler" - scales each feature by its maximum absolute value
            //    "None"
            "featureTransformer": "Scaler"
        }
    }
}

Linear Regression

{
    "trainer": {
        "type": "REGRESSION",
        "method": "LINEAR_REGRESSION",
        "params": {
            // Feature transformation. In general, learning algorithms benefit
            // from standardization of the data set.
            // Available functions:
            //    "Scaler"       - Scales all numeric variables into the range [0, 1]
            //    "Standardizer" - Standardizes numeric feature to 0 mean and unit variance
            //    "MaxAbsScaler" - scales each feature by its maximum absolute value
            //    "None"
            "featureTransformer": "None"
        }
    }
}

Warning: The original JSON specification doesn’t allow comments in a JSON file. However, we use a parser in the trainer script that can parse JSON files with comments.

Installation guide

To install the Training Engine via a ZIP package please follow the instructions below:

  1. Download the Trainer Engine ZIP package to your computer.
  2. Unzip the package and run the trainer-engine.sh script in your command line environment.

Note 1: The minimum requirement is Java 1.8 or versions above.
Note 2: You might need to add execution permission to the script to be able to run it on your OS.
Note 3: Before running the script it is recommended to check the validity of the relevant ChemAxon licenses.

Licensing

To use the Trainer Engine you need a valid Trainer Plugin license.

Note: The license file (license.cxl ) must be placed in the .chemaxon (Unix) or chemaxon (Windows) sub-directory of the user’s HOME directory.

Known limitations

  1. The kNN descriptor can be used with regression-type training methods only.
  2. Avoid using arguments with whitespace (e.g. "exp data"), some commands do not recognise it.

Plans

  1. Allow using the kNN descriptor with classification-type training as well.