ChemAxon’s Trainer Engine

Version 1.0-beta-11

This documentation gives a short introduction to ChemAxon’s Trainer Engine.

Table of Contents

What is Trainer Engine?

ChemAxon’s Trainer Engine is a tool that allows you to create new training models based on your own experimental data and use the created models for prediction.

The workflow of the Trainer Engine can be separated into 3 steps:

  1. Descriptor generation
  2. Training (model building)
  3. Prediction

In the first step descriptors are generated for the training set creating a serialised descriptor file. In the second step the actual training is done and a serialised predictor (model) is built based on the descriptor file. In the third step prediction can be done for the test set using the model built in the previous step.

Usage examples

The following examples were made for a UNIX/Linux operating system. On Windows machines please use the trainer-engine.bat file instead of the trainer-engine.sh script.

$ trainer-engine.sh -h
    Usage:
        trainer-engine.sh  [command]  [command  options]
    Commands:
        split  Random splitter.
        generate-descriptors  Generates  descriptors  for  input  structures.
        train  Train  a  model  from  descriptors  and  the  specified  configuration.
        predict  Predictor.
		gui  Start web application on http://localhost:8080

    Run  'trainer-engine.sh  COMMAND  --help'  for  more  information  on  a  command.

Note 1: For listing all available parameters of a command use the -h or the --help option, e. g.: trainer-engine.sh predict -h.

Note 2: You can specify Java options before commands, e.g. for allocating 8 GB memory for training use the trainer-engine.sh -Xmx8g train [options] command.

$ trainer-engine.sh generate-descriptors  \
    --descriptors  descriptor-config.hjson  \
    --sdf-input  molecules.sdf  \
    --sdf-tag  "pAct(hERG)"  \
    --output  descriptors.ser
$ trainer-engine.sh train  \
    --training-data  descriptors.ser  \
    --training-model  training-model-config.hjson
    --output  hERG-predictor-v1.ser
$ trainer-engine.sh predict  \
    --serialized-predictor  hERG-predictor-v1.ser  \
    --input  test.sdf
    --sdf-tag  HERG
    --output  result.sdf
$ trainer-engine.sh predict  \
    --serialized-predictor  hERG-predictor-v1.ser  \
    --input test.sdf \
    --most-similars \
    --sdf-tag predicted_hERG \
    --output result.sdf
$ trainer-engine.sh gui

The following image shows the creation of a new hERG model after starting the Trainer Engine GUI.

How to configure Trainer Engine?

This section provides details on the configuration of Trainer Engine.

Descriptor generation

Descriptors can be generated using the trainer-engine.sh generate-descriptors command. The descriptors to be generated by the script has to be specified in a configuration file.

The descriptor configuration file is a hJSON file that defines a set of descriptors for all available descriptor types as an array of type and descriptors key-array pairs.

The following configuration file is a template that shows such an example set of descriptors.

{
    descriptorGenerator: [
    {
        type: PHYSCHEM
    }
    {
        // Any topology descriptor having a scalar value from the TopologyAnalyserPlugin can be used.
        //https://apidocs.chemaxon.com/jchem/doc/dev/java/api/chemaxon/marvin/calculations/TopologyAnalyserPlugin.html
        type: TOPOLOGY
        descriptors: [
            atomCount
            fsp3
            heteroRingCount
            stericEffectIndex
        ]
    }
    {
        // Any Chemical Terms function that returns a scalar can be used as a descriptor.
        // https://docs.chemaxon.com/display/docs/chemical-terms-functions-in-alphabetic-order.md
        type: CHEMTERM
        descriptors: [
            atomCount('6')
            atomCount('7')
            formalCharge(majorms('7.4'))
            max(pka())
            min(pka())
            logP()
        ]
    }
    {
        // The default ECFP fingerprint is 1024-bit-long and has a radius of 4.
        // The following ECFP fingerprints are also available:
        //    ECFP4_256, ECFP4_512, ECFP4_1024, ECFP4_2048
        //    ECFP6_256, ECFP6_512, ECFP6_1024, ECFP6_2048
        type: ECFP4_1024
    }
    {
        type: MACCS
    }
    {
        // Any property defined in an SD file tag of the training file can be
        // used as a descriptor.
        type: SDFTAG
        descriptors: [
            DESCRIPTOR1
            DESCRIPTOR2
        ]
    }
    {
        // The kNN (k Nearest Neighbour) Regression descriptor is the weighted average of
        // the training property (e.g. hERG) of the 5 most similar molecules in
        // the training set. The weights are the similarity values between the
        // training set molecules.
        type: KNN_REGRESSION
    }]
}

Note: Classification-type KNN descriptor can also be used by putting type: KNN_CLASSIFICATION in the config file.

Training set standardization

Molecules of the training set can be standardized before descriptor generation using ChemAxon’s Standardizer. This can be done by inserting the standardizer: Std_Action_String line before the descriptorGenerator: block in the hJSON file, where Std_Action_String defines the Standardizer action string.

The action string contains the sequence of Standardizer actions to be performed. The format of the action string requires each Standardizer action to be separated by “…” from each other.

Example

In this example we define a standardization step, which neutralizes, aromatizes and finally tautomerizes the molecules of the training set. The action string of it is neutralize..aromatize..tautomerize .

The string then can be put into the following example hJSON config file:

{
    standardizer: neutralize..aromatize..tautomerize
    descriptorGenerator: [
    {
        type: PHYSCHEM
    }
    {
        // The default ECFP fingerprint is 1024-bit-long and has a radius of 4.
        // The following ECFP fingerprints are also available:
        //    ECFP4_256, ECFP4_512, ECFP4_1024, ECFP4_2048
        //    ECFP6_256, ECFP6_512, ECFP6_1024, ECFP6_2048
        type: ECFP6_1024
    }
    {
        type: MACCS
    }]
}

Training (model building)

Training models can be built using the ./trainer-engine.sh train command. The settings of a model have to be specified in a configuration file.

The trainer configuration file is a JSON file that defines a trainer object with trainerWrapper, type, method and params keys.

Below are some example hJSON configuration files of the currently available training models.

Linear Regression

{
    trainer: {
        method: REGRESSION
        algorithm: LINEAR_REGRESSION
        params: {
            // Feature transformation. In general, learning algorithms benefit
            // from standardization of the data set.
            // Available functions:
            //  Scaler       - Scales all numeric variables into the range [0, 1]
            //  Standardizer - Standardizes numeric feature to 0 mean and unit variance
            //  MaxAbsScaler - scales each feature by its maximum absolute value
            featureTransformer: None
        }
    }
}

Logistic Regression

{
    trainer: {
        method: CLASSIFICATION
        algorithm: LOGISTIC_REGRESSION
        params: {
            // lambda > 0 gives a regularized estimate of linear weights which
            // often has superior generalization performance, especially when
            // the dimensionality is high.
            lambda: 0.1

            // The tolerance for stopping iterations.
            tol: 1e-5

            // The maximum number of iterations.
            maxIter: 500

            // Feature transformation. In general, learning algorithms benefit
            // from standardization of the data set.
            // Available functions:
            //    "Scaler"       - Scales all numeric variables into the range [0, 1]
            //    "Standardizer" - Standardizes numeric feature to 0 mean and unit variance
            //    "MaxAbsScaler" - scales each feature by its maximum absolute value
            featureTransformer: None
        }
    }
}

Random Forest Classification

{
    trainer: {
        // This is an optional key which defines an outer wrapper for the
        // training type. Currently only CONFORMAL PREDICTION is available
        // which allows Error Bound prediction.
        // trainerWrapper: CONFORMAL_PREDICTION

        method: CLASSIFICATION

        // For CLASSIFICATION only RANDOM_FOREST is supported.
        algorithm: RANDOM_FOREST

        params: {
            // The number of trees.
            ntrees: 300

            // The number of input variables to be used to determine the
            // decision at a node of the tree. If p is the number of variables
            // floor(sqrt(p)) generally gives good performance.
            mtry: 0

            // The ratio of input variables to be used to determine the
            // decision at a node of the tree. If p is the number of variables
            // p / 3 usually gives good performance.
            // mtryRatio: 0.35

            // The maximum depth of the tree.
            maxDepth: 50

            // The maximum number of leaf nodes of the tree.
            // Default, if not specified: data size / 5
            // maxNodes: 50

            // The number of instances in a node below which the tree will not
            // split, nodeSize = 5 generally gives good results.
            nodeSize: 1

            // The sampling rate for the training tree. 1.0 means sampling with
            // replacement, while < 1.0 means sampling without replacement.
            subSample: 1.0

            // Priors of the classes. The weight of each class is roughly the
            // ratio of samples in each class. For example, if there are 400
            // positive samples and 100 negative samples, the classWeight should
            // be [1, 4] (assuming label 0 is of negative, label 1 is of
            // positive).
            // weights: [1, 4]
        }
    }
}

Random Forest Regression

{
    trainer: {

        // This is an optional key which defines an outer wrapper for the
        // training type. Currently only CONFORMAL PREDICTION is available
        // which allows Error Bound prediction.
        // trainerWrapper: CONFORMAL_PREDICTION
        method: REGRESSION
        algorithm: RANDOM_FOREST
        params: {
            // The number of trees.
            ntrees: 300

            // The number of input variables to be used to determine the
            // decision at a node of the tree. If p is the number of variables
            // p / 3 usually gives good performance.
            mtry: 0

            // The ratio of input variables to be used to determine the
            // decision at a node of the tree. If p is the number of variables
            // p / 3 usually gives good performance.
            // mtryRatio: 0.35

            // The maximum depth of the tree.
            maxDepth: 50

            // The maximum number of leaf nodes of the tree.
            // Default, if not specified: data size / 5
            // maxNodes: 50

            // The number of instances in a node below which the tree will not
            // split, nodeSize = 5 generally gives good results.
            nodeSize: 1

            // The sampling rate for the training tree. 1.0 means sampling with
            // replacement, while < 1.0 means sampling without replacement.
            subSample: 1.0
        }
    }
}

Note: Only one of the mtry and mtryRatio parameters can be used in a given config file. Setting both parameters at the same time results in an error.

Support Vector Machine Classification

{
    trainer: {
        method: CLASSIFICATION
        algorithm: SUPPORT_VECTOR_MACHINE
        params: {

            // The soft margin penalty parameter.
            c: 0.5

            // The tolerance of convergence test.
            tol: 0.1

            // Feature transformation. In general, learning algorithms benefit
            // from standardization of the data set.
            // Available functions:
            //  Scaler       - Scales all numeric variables into the range [0, 1]
            //  Standardizer - Standardizes numeric feature to 0 mean and unit variance
            //  MaxAbsScaler - scales each feature by its maximum absolute value
            featureTransformer: Scaler
        }
    }
}

Support Vector Machine Regression

{
    trainer: {
        method: REGRESSION
        algorithm: SUPPORT_VECTOR_REGRESSION
        params: {

            // Threshold parameter. There is no penalty associated with samples
            // which are predicted within distance epsilon from the actual
            // value. Decreasing epsilon forces closer fitting to the
            // calibration/training data.
            eps: 1.0

            // The soft margin penalty parameter.
            c: 0.5

            // The tolerance of convergence test.
            tol: 0.1

            // Feature transformation. In general, learning algorithms benefit
            // from standardization of the data set.
            // Available functions:
            //  Scaler       - Scales all numeric variables into the range [0, 1]
            //  Standardizer - Standardizes numeric feature to 0 mean and unit variance
            //  MaxAbsScaler - scales each feature by its maximum absolute value
            featureTransformer: Scaler
        }
    }
}

Gradient Tree Boost Classification

{
    trainer: {

        method: CLASSIFICATION

        algorithm: GRADIENT_TREE_BOOST

        params: {

            // The number of trees.
            ntrees: 500

            // The maximum depth of the tree.
            maxDepth: 20

            // The maximum number of leaf nodes of the tree.
            maxNodes: 6

            // The number of instances in a node below which the tree will not
            // split, nodeSize = 5 generally gives good results.
            nodeSize: 5

            // The shrinkage parameter in (0, 1] controls the learning rate of
            // procedure.
            shrinkage: 0.05

            // The sampling fraction for stochastic tree boosting.
            subSample: 0.7

        }
    }
}

Gradient Tree Boost Regression

{
    trainer: {

        method: REGRESSION

        algorithm: GRADIENT_TREE_BOOST

        params: {

            // Loss function for regression.
            // Available functions:
            //     LeastSquares
            //     LeastAbsoluteDeviation
            //     Quantile(p)              p = [0, 1]
            //     Huber(p)                 p = [0, 1]
            lossFunction: LeastAbsoluteDeviation

            // The number of trees.
            ntrees: 500

            // The maximum depth of the tree.
            maxDepth: 20

            // The maximum number of leaf nodes of the tree.
            maxNodes: 6

            // The number of instances in a node below which the tree will not
            // split, nodeSize = 5 generally gives good results.
            nodeSize: 5

            // The shrinkage parameter in (0, 1] controls the learning rate of
            // procedure.
            shrinkage: 0.05

            // The sampling fraction for stochastic tree boosting.
            subSample: 0.7

        }
    }
}

Note: The parser of the Trainer Engine supports configuration files in hJSON format, which is a convenient extension of the original JSON format. However, the parser is able to read configuration files in JSON format, too.

Installation guide

Installing Trainer Engine

To install the Training Engine via a ZIP package please follow the instructions below:

  1. Download the Trainer Engine ZIP package to your computer.
  2. Unzip the package and run the trainer-engine.sh script in your command line environment.

Note 1: The minimum Java requirement is 1.8 or versions above.
Note 2: You might need to add execution permission to the script to be able to run it on your OS.
Note 3: Before running the script it is recommended to check the validity of the relevant ChemAxon licenses.

Integrating Trainer Engine and Playground

To use your training models created with Trainer Engine for prediction in Playground please follow the instructions below:

  1. Download and install Docker to your computer.
  2. Create a ChemAxon account and generate an API key.
  3. Once Docker is installed on your computer, login with it to the ChemAxon hub using the following command:
 echo <API-KEY> | docker login -u <E-MAIL> --password-stdin hub.chemaxon.com
  1. Once you are logged in, download the ZIP package containing the necessary YML configuration file to run both applications.
  2. Unzip the package and run Docker with the following command to start the joint application:
docker compose -f docker-compose-trainplay.yml up
  1. After starting Docker you can reach Trainer Engine under the localhost/trainer/ path and Playground under the localhost/playground/ path.

Note 1: The YML file is an example configuration of Docker Compose integrating Trainer Engine into the Playground application. Docker downloads the images of the two applications when it is first run.

Note 2: The ZIP package contains an empty chemaxon-trainer-data directory where training models are placed during runs.

Note 3: You need Trainer and MarvinJS licenses to run both applications. The two licenses can be placed in one license.cxl file.

Note 4: Since Docker v. 4.3.0 you can use compose as a separate command with docker, without using - to concatenate them. See the details of this change here.

Licensing

To use the Trainer Engine you need a valid Trainer Plugin license.

Note: The license file (license.cxl ) must be placed in the .chemaxon (Unix) or chemaxon (Windows) sub-directory of the user’s HOME directory.

Known limitations

  1. Avoid using arguments with whitespace (e.g. "exp data"), some commands do not recognise it.

Release notes (History of changes)

Trainer Engine CLI v. 1.0-beta-10:

Trainer Engine CLI v. 1.0-beta-11: