Version 1.0-beta-11
This documentation gives a short introduction to ChemAxon’s Trainer Engine.
ChemAxon’s Trainer Engine is a tool that allows you to create new training models based on your own experimental data and use the created models for prediction.
The workflow of the Trainer Engine can be separated into 3 steps:
In the first step descriptors are generated for the training set creating a serialised descriptor file. In the second step the actual training is done and a serialised predictor (model) is built based on the descriptor file. In the third step prediction can be done for the test set using the model built in the previous step.
The following examples were made for a UNIX/Linux operating system. On Windows machines please use the trainer-engine.bat
file instead of the trainer-engine.sh
script.
trainer-engine.sh
script with the -h
option, which prints a short help message to the standard output:$ trainer-engine.sh -h
Usage:
trainer-engine.sh [command] [command options]
Commands:
split Random splitter.
generate-descriptors Generates descriptors for input structures.
train Train a model from descriptors and the specified configuration.
predict Predictor.
gui Start web application on http://localhost:8080
Run 'trainer-engine.sh COMMAND --help' for more information on a command.
Note 1: For listing all available parameters of a command use the
-h
or the--help
option, e. g.:trainer-engine.sh predict -h
.
Note 2: You can specify Java options before commands, e.g. for allocating 8 GB memory for training use the
trainer-engine.sh -Xmx8g train [options]
command.
$ trainer-engine.sh generate-descriptors \
--descriptors descriptor-config.hjson \
--sdf-input molecules.sdf \
--sdf-tag "pAct(hERG)" \
--output descriptors.ser
$ trainer-engine.sh train \
--training-data descriptors.ser \
--training-model training-model-config.hjson
--output hERG-predictor-v1.ser
$ trainer-engine.sh predict \
--serialized-predictor hERG-predictor-v1.ser \
--input test.sdf
--sdf-tag HERG
--output result.sdf
$ trainer-engine.sh predict \
--serialized-predictor hERG-predictor-v1.ser \
--input test.sdf \
--most-similars \
--sdf-tag predicted_hERG \
--output result.sdf
$ trainer-engine.sh gui
The following image shows the creation of a new hERG model after starting the Trainer Engine GUI.
This section provides details on the configuration of Trainer Engine.
Descriptors can be generated using the trainer-engine.sh generate-descriptors
command. The descriptors to be generated by the script has to be specified in a configuration file.
The descriptor configuration file is a hJSON file that defines a set of descriptors for all available descriptor types as an array of type
and descriptors
key-array pairs.
The following configuration file is a template that shows such an example set of descriptors.
{
descriptorGenerator: [
{
type: PHYSCHEM
}
{
// Any topology descriptor having a scalar value from the TopologyAnalyserPlugin can be used.
//https://apidocs.chemaxon.com/jchem/doc/dev/java/api/chemaxon/marvin/calculations/TopologyAnalyserPlugin.html
type: TOPOLOGY
descriptors: [
atomCount
fsp3
heteroRingCount
stericEffectIndex
]
}
{
// Any Chemical Terms function that returns a scalar can be used as a descriptor.
// https://docs.chemaxon.com/display/docs/chemical-terms-functions-in-alphabetic-order.md
type: CHEMTERM
descriptors: [
atomCount('6')
atomCount('7')
formalCharge(majorms('7.4'))
max(pka())
min(pka())
logP()
]
}
{
// The default ECFP fingerprint is 1024-bit-long and has a radius of 4.
// The following ECFP fingerprints are also available:
// ECFP4_256, ECFP4_512, ECFP4_1024, ECFP4_2048
// ECFP6_256, ECFP6_512, ECFP6_1024, ECFP6_2048
type: ECFP4_1024
}
{
type: MACCS
}
{
// Any property defined in an SD file tag of the training file can be
// used as a descriptor.
type: SDFTAG
descriptors: [
DESCRIPTOR1
DESCRIPTOR2
]
}
{
// The kNN (k Nearest Neighbour) Regression descriptor is the weighted average of
// the training property (e.g. hERG) of the 5 most similar molecules in
// the training set. The weights are the similarity values between the
// training set molecules.
type: KNN_REGRESSION
}]
}
Note: Classification-type KNN descriptor can also be used by putting type: KNN_CLASSIFICATION in the config file.
Molecules of the training set can be standardized before descriptor generation using ChemAxon’s Standardizer. This can be done by inserting the standardizer: Std_Action_String
line before the descriptorGenerator: block in the hJSON file, where Std_Action_String defines the Standardizer action string.
The action string contains the sequence of Standardizer actions to be performed. The format of the action string requires each Standardizer action to be separated by “…” from each other.
In this example we define a standardization step, which neutralizes, aromatizes and finally tautomerizes the molecules of the training set. The action string of it is neutralize..aromatize..tautomerize
.
The string then can be put into the following example hJSON config file:
{
standardizer: neutralize..aromatize..tautomerize
descriptorGenerator: [
{
type: PHYSCHEM
}
{
// The default ECFP fingerprint is 1024-bit-long and has a radius of 4.
// The following ECFP fingerprints are also available:
// ECFP4_256, ECFP4_512, ECFP4_1024, ECFP4_2048
// ECFP6_256, ECFP6_512, ECFP6_1024, ECFP6_2048
type: ECFP6_1024
}
{
type: MACCS
}]
}
Training models can be built using the ./trainer-engine.sh train
command. The settings of a model have to be specified in a configuration file.
The trainer configuration file is a JSON file that defines a trainer
object with trainerWrapper
, type
, method
and params
keys.
Below are some example hJSON configuration files of the currently available training models.
{
trainer: {
method: REGRESSION
algorithm: LINEAR_REGRESSION
params: {
// Feature transformation. In general, learning algorithms benefit
// from standardization of the data set.
// Available functions:
// Scaler - Scales all numeric variables into the range [0, 1]
// Standardizer - Standardizes numeric feature to 0 mean and unit variance
// MaxAbsScaler - scales each feature by its maximum absolute value
featureTransformer: None
}
}
}
{
trainer: {
method: CLASSIFICATION
algorithm: LOGISTIC_REGRESSION
params: {
// lambda > 0 gives a regularized estimate of linear weights which
// often has superior generalization performance, especially when
// the dimensionality is high.
lambda: 0.1
// The tolerance for stopping iterations.
tol: 1e-5
// The maximum number of iterations.
maxIter: 500
// Feature transformation. In general, learning algorithms benefit
// from standardization of the data set.
// Available functions:
// "Scaler" - Scales all numeric variables into the range [0, 1]
// "Standardizer" - Standardizes numeric feature to 0 mean and unit variance
// "MaxAbsScaler" - scales each feature by its maximum absolute value
featureTransformer: None
}
}
}
{
trainer: {
// This is an optional key which defines an outer wrapper for the
// training type. Currently only CONFORMAL PREDICTION is available
// which allows Error Bound prediction.
// trainerWrapper: CONFORMAL_PREDICTION
method: CLASSIFICATION
// For CLASSIFICATION only RANDOM_FOREST is supported.
algorithm: RANDOM_FOREST
params: {
// The number of trees.
ntrees: 300
// The number of input variables to be used to determine the
// decision at a node of the tree. If p is the number of variables
// floor(sqrt(p)) generally gives good performance.
mtry: 0
// The ratio of input variables to be used to determine the
// decision at a node of the tree. If p is the number of variables
// p / 3 usually gives good performance.
// mtryRatio: 0.35
// The maximum depth of the tree.
maxDepth: 50
// The maximum number of leaf nodes of the tree.
// Default, if not specified: data size / 5
// maxNodes: 50
// The number of instances in a node below which the tree will not
// split, nodeSize = 5 generally gives good results.
nodeSize: 1
// The sampling rate for the training tree. 1.0 means sampling with
// replacement, while < 1.0 means sampling without replacement.
subSample: 1.0
// Priors of the classes. The weight of each class is roughly the
// ratio of samples in each class. For example, if there are 400
// positive samples and 100 negative samples, the classWeight should
// be [1, 4] (assuming label 0 is of negative, label 1 is of
// positive).
// weights: [1, 4]
}
}
}
{
trainer: {
// This is an optional key which defines an outer wrapper for the
// training type. Currently only CONFORMAL PREDICTION is available
// which allows Error Bound prediction.
// trainerWrapper: CONFORMAL_PREDICTION
method: REGRESSION
algorithm: RANDOM_FOREST
params: {
// The number of trees.
ntrees: 300
// The number of input variables to be used to determine the
// decision at a node of the tree. If p is the number of variables
// p / 3 usually gives good performance.
mtry: 0
// The ratio of input variables to be used to determine the
// decision at a node of the tree. If p is the number of variables
// p / 3 usually gives good performance.
// mtryRatio: 0.35
// The maximum depth of the tree.
maxDepth: 50
// The maximum number of leaf nodes of the tree.
// Default, if not specified: data size / 5
// maxNodes: 50
// The number of instances in a node below which the tree will not
// split, nodeSize = 5 generally gives good results.
nodeSize: 1
// The sampling rate for the training tree. 1.0 means sampling with
// replacement, while < 1.0 means sampling without replacement.
subSample: 1.0
}
}
}
Note: Only one of the mtry and mtryRatio parameters can be used in a given config file. Setting both parameters at the same time results in an error.
{
trainer: {
method: CLASSIFICATION
algorithm: SUPPORT_VECTOR_MACHINE
params: {
// The soft margin penalty parameter.
c: 0.5
// The tolerance of convergence test.
tol: 0.1
// Feature transformation. In general, learning algorithms benefit
// from standardization of the data set.
// Available functions:
// Scaler - Scales all numeric variables into the range [0, 1]
// Standardizer - Standardizes numeric feature to 0 mean and unit variance
// MaxAbsScaler - scales each feature by its maximum absolute value
featureTransformer: Scaler
}
}
}
{
trainer: {
method: REGRESSION
algorithm: SUPPORT_VECTOR_REGRESSION
params: {
// Threshold parameter. There is no penalty associated with samples
// which are predicted within distance epsilon from the actual
// value. Decreasing epsilon forces closer fitting to the
// calibration/training data.
eps: 1.0
// The soft margin penalty parameter.
c: 0.5
// The tolerance of convergence test.
tol: 0.1
// Feature transformation. In general, learning algorithms benefit
// from standardization of the data set.
// Available functions:
// Scaler - Scales all numeric variables into the range [0, 1]
// Standardizer - Standardizes numeric feature to 0 mean and unit variance
// MaxAbsScaler - scales each feature by its maximum absolute value
featureTransformer: Scaler
}
}
}
{
trainer: {
method: CLASSIFICATION
algorithm: GRADIENT_TREE_BOOST
params: {
// The number of trees.
ntrees: 500
// The maximum depth of the tree.
maxDepth: 20
// The maximum number of leaf nodes of the tree.
maxNodes: 6
// The number of instances in a node below which the tree will not
// split, nodeSize = 5 generally gives good results.
nodeSize: 5
// The shrinkage parameter in (0, 1] controls the learning rate of
// procedure.
shrinkage: 0.05
// The sampling fraction for stochastic tree boosting.
subSample: 0.7
}
}
}
{
trainer: {
method: REGRESSION
algorithm: GRADIENT_TREE_BOOST
params: {
// Loss function for regression.
// Available functions:
// LeastSquares
// LeastAbsoluteDeviation
// Quantile(p) p = [0, 1]
// Huber(p) p = [0, 1]
lossFunction: LeastAbsoluteDeviation
// The number of trees.
ntrees: 500
// The maximum depth of the tree.
maxDepth: 20
// The maximum number of leaf nodes of the tree.
maxNodes: 6
// The number of instances in a node below which the tree will not
// split, nodeSize = 5 generally gives good results.
nodeSize: 5
// The shrinkage parameter in (0, 1] controls the learning rate of
// procedure.
shrinkage: 0.05
// The sampling fraction for stochastic tree boosting.
subSample: 0.7
}
}
}
Note: The parser of the Trainer Engine supports configuration files in hJSON format, which is a convenient extension of the original JSON format. However, the parser is able to read configuration files in JSON format, too.
To install the Training Engine via a ZIP package please follow the instructions below:
trainer-engine.sh
script in your command line environment.Note 1: The minimum Java requirement is 1.8 or versions above.
Note 2: You might need to add execution permission to the script to be able to run it on your OS.
Note 3: Before running the script it is recommended to check the validity of the relevant ChemAxon licenses.
To use your training models created with Trainer Engine for prediction in Playground please follow the instructions below:
echo <API-KEY> | docker login -u <E-MAIL> --password-stdin hub.chemaxon.com
docker compose -f docker-compose-trainplay.yml up
localhost/trainer/
path and Playground under the localhost/playground/
path.Note 1: The YML file is an example configuration of Docker Compose integrating Trainer Engine into the Playground application. Docker downloads the images of the two applications when it is first run.
Note 2: The ZIP package contains an empty
chemaxon-trainer-data
directory where training models are placed during runs.
Note 3: You need Trainer and MarvinJS licenses to run both applications. The two licenses can be placed in one
license.cxl
file.
Note 4: Since Docker v. 4.3.0 you can use
compose
as a separate command withdocker
, without using-
to concatenate them. See the details of this change here.
To use the Trainer Engine you need a valid Trainer Plugin license.
Note: The license file (
license.cxl
) must be placed in the.chemaxon
(Unix) orchemaxon
(Windows) sub-directory of the user’s HOME directory.
"exp data"
), some commands do not recognise it.Trainer Engine CLI v. 1.0-beta-10:
Trainer Engine CLI v. 1.0-beta-11: