Profiling and execution statistics

Please note that this document is in a work in progress state. Plese note that the JSON formats of profiling data/execution statistics might change in the future releases.

Some of the command line interfaces support the collection of profiling and execution statistics data. The collected information can be used for benchmarking, debugging and to fine tune application settings. The collected data is stored in the specified text file in JSON format. The collected data can be load and visualized by the embedded server. Note that self contained example scripts rest-api-XXX collect and expose profiling and execution statistics described below.

Please note that experimental REST API endpoint statistics/profiling-snapshot can supply a garbage collector / memory pool snapshot similar to the ones conteined in the profiling data detailed below.

Example use case

In this example the performance of descriptor generation and the scaling of search is measured and visualized. The nci-250k dataset is used in this example as search targets. Subsets of the pubchem-compound-rnd-1k dataset are used for queries. These datasets are included in the distribution.

Collect data

Collect data from various tools. The workflow is similar to the one detailed in document Basic similarity search workflow. Additionally options -stat and -prof are used to write execution statistics and profiling data.

# Create master molecule storage and master id storage
gzip -dc data/molecules/nci/nci-250k.smi.gz | bin/createMms.sh \
    -in - \
    -out nci-250k-mms.bin \
    -name nci-250k-name.bin \
    -stat nci-250k-mms-stat.txt \
    -prof nci-250k-mms-prof.txt

# Generate cfp7 end ecfp4 descriptors
gzip -dc data/molecules/nci/nci-250k.smi.gz | bin/buildStorage.sh \
    -context createSimpleCfp7Context \
    -in - \
    -out nci-250k-cfp7.bin \
    -stat nci-250k-cfp7-stat.txt \
    -prof nci-250k-cfp7-prof.txt
gzip -dc data/molecules/nci/nci-250k.smi.gz | bin/buildStorage.sh \
    -context createSimpleEcfp4Context \
    -in - \
    -out nci-250k-ecfp4.bin \
    -stat nci-250k-ecfp4-stat.txt \
    -prof nci-250k-ecfp4-prof.txt

# Launch most similar searches for 1, 10, 100, 1000 queries
# Queries are selected from the pubchem-compound-rnd dataset
# For search scaling benchmark we use the cfp7 descriptors only
gzip -dc data/molecules/pubchem-compound/pubchem-compound-rnd-1k.smi.gz | head -1 | bin/searchStorage.sh \
    -frombytes nci-250k-cfp7.bin \
    -qf - \
    -out hits-q-1.txt \
    -stat search-q-1-t-250k-stat.txt \
    -prof search-q-1-t-250k-prof.txt
gzip -dc data/molecules/pubchem-compound/pubchem-compound-rnd-1k.smi.gz | head -10 | bin/searchStorage.sh \
    -frombytes nci-250k-cfp7.bin \
    -qf - \
    -out hits-q-10.txt \
    -stat search-q-10-t-250k-stat.txt \
    -prof search-q-10-t-250k-prof.txt
gzip -dc data/molecules/pubchem-compound/pubchem-compound-rnd-1k.smi.gz | head -100 |bin/searchStorage.sh \
    -frombytes nci-250k-cfp7.bin \
    -qf - \
    -out hits-q-100.txt \
    -stat search-q-100-t-250k-stat.txt \
    -prof search-q-100-t-250k-prof.txt
gzip -dc data/molecules/pubchem-compound/pubchem-compound-rnd-1k.smi.gz | head -1000 | bin/searchStorage.sh \
    -frombytes nci-250k-cfp7.bin \
    -qf - \
    -out hits-q-1000.txt \
    -stat search-q-1000-t-250k-stat.txt \
    -prof search-q-1000-t-250k-prof.txt

Sanitize/compact profiling data

Execution statistics contains the values of system properties and environmental variables. It is usually a good practice to not expose these data through the Web UI. Script data/sanitize-prof.js removes most of the stored such properties and variables keeping only a few whitelisted (see script for details) ones. This script is invoked using tool jseval:

# Sanitize/compact profiling data exposed by the Web UI
bin/jseval.sh -df inp=nci-250k-mms-prof.txt -jsfile data/sanitize-prof.js -out nci-250k-mms-prof-sanitized.txt
bin/jseval.sh -df inp=nci-250k-cfp7-prof.txt -jsfile data/sanitize-prof.js -out nci-250k-cfp7-prof-sanitized.txt
bin/jseval.sh -df inp=nci-250k-ecfp4-prof.txt -jsfile data/sanitize-prof.js -out nci-250k-ecfp4-prof-sanitized.txt
bin/jseval.sh -df inp=search-q-1-t-250k-prof.txt -jsfile data/sanitize-prof.js -out search-q-1-t-250k-prof-sanitized.txt
bin/jseval.sh -df inp=search-q-10-t-250k-prof.txt -jsfile data/sanitize-prof.js -out search-q-10-t-250k-prof-sanitized.txt
bin/jseval.sh -df inp=search-q-100-t-250k-prof.txt -jsfile data/sanitize-prof.js -out search-q-100-t-250k-prof-sanitized.txt
bin/jseval.sh -df inp=search-q-1000-t-250k-prof.txt -jsfile data/sanitize-prof.js -out search-q-1000-t-250k-prof-sanitized.txt

Display results using Web UI

# Load results into embedded server; listen on port 8081; launch browser
# Note that generated descriptors, imported molecules are not read by the server
# Use option -nobrowse to disable browser launch
bin/gui.sh \
    -port 8081 \
    -profres nci-250k-mms-prof-sanitized.txt \
    -profres nci-250k-cfp7-prof-sanitized.txt \
    -profres nci-250k-ecfp4-prof-sanitized.txt \
    -profres search-q-1-t-250k-prof-sanitized.txt \
    -profres search-q-10-t-250k-prof-sanitized.txt \
    -profres search-q-100-t-250k-prof-sanitized.txt \
    -profres search-q-1000-t-250k-prof-sanitized.txt

Since -profres is a multi arity argument you can use

bin/gui.sh \
    -port 8081 \
    -profres *-prof-sanitized.txt \

Visualization of profiling data and execution statistics

A dynamic HTML based visualization is available in the distribution. Use tool gui.sh with option -profres <FILENAME> to read the results JSON file as above. The launched server shows the read profiling results file(s) on the overview page. Aggregation of the execution statistics data from the individual profiling result files is also accessible with a visualization. See rest api example for more details on tool gui.sh. Current visualization is not complete, it will be cleaned up in a later release.

Index page

Following the links lead to the visualization pages. Profiling data visualization displays the value of various measures versus execution time.

Profiling data visualization shows the value of various measured parameter values during the execution time. Charts can be moved, scaled. Additional charts can be added.

Single profiling data

Aggregation of execution statistics collected from the exposed profiling data is also exposed. The distribution of various measured parameters for executions is displayed, can be filtered and plotted on a scatter plot. The individual profiling data display is also linked.

Aggregated statistics

Technical details of the visualization

Web content lib/webapp.war contains display files showProf.html, showStat.html and their dependent libraries (most important for the data analysis are d3.js and crossfilter; for the UI composition additionally jquery, jquery-ui, bootstrap and font-awesome). Display file expects an URL parameter ref to be present in the browser address bar pointing to the location of the results JSON to display, served by the server.

Supported command line interfaces

The following command line interfaces support the collection of profiling and execution statistics:

calculateOverlap.sh Calculate similarity based overlap analysis. For details see Introduction to overlap analysis.
buildStorage.sh Read molecules, generate descriptors and ID mapping and write them to a serialized file.
createMms.sh Read molecules and store them in a serialized file. Also molecule names or SD properties can be retrieved and stored in separate files.
searchStorage.sh Read serialized descriptor storage file and search descriptors (either parsed from String form or generated using the specified context).
stdg.sh Single threaded descriptor generator; create string representations capable of comparing reference string descriptor representations for testing. Also can be used to create sample input for custom descriptor import.

For details on the command line interfaces see documentation.

Execution statistics details

Execution statistics contains various data collected during execution in JSON format. Execution statistics is written to a file when option -stat is specified for supported tools.

Common contents:

tool: Name of the tool.
version: Version info from class OverlapVersionInfo
env: Environmental variables.
props: System properties.
cliParameters: Parameters object created from command line arguments. JVM options (like -Xmx...) are not reflected.
cliArguments: Command line arguments array. JVM options (like -Xmx...) are not contained.
memstatInitial, memstatFinal: Memory info at the beginning and ending of the execution.
profSnapshotInitial, profSnapshotFinal: More detailed VM state at the beginning and ending of the execution.
startTime, stopTime, elapsedTime: System times in ms.

Statistics of the most important progress observers as well important metrics (like target counts) are also stored. For more details see the APIDOC of the classes representing the written statistics:

Tool	Execution statistics DTO class
`buildStorage`	`com.chemaxon.overlap.cli.BuildStorageStat`
`calculateOverlap`	`com.chemaxon.overlap.cli.CalculateOverlapStat`
`createMms`	`com.chemaxon.overlap.cli.CreateMmsStat`
`searchStorage`	`com.chemaxon.overlap.cli.SearchStorageStat`
`stdg`	`com.chemaxon.descriptors.alternates.cli.SingleThreadedDescriptorGeneratorStat`

Further attributes

The visualization page checks for specific system properties to use as additional dimensions:

Property name	Typical usage
`overlap-benchmark.cputype`	CPU model name fetched from `/proc/cpuinfo` or from `sysctl -n machdep.cpu.brand_string` on OS X.
`overlap-benchmark.cpucount`	CPU count fetched from `nproc --all` or from `sysctl -n hw.ncpu` on OS X.
`overlap-benchmark.memory`	Total memory fetched from `/proc/meminfo` or from `sysctl -n hw.memsize` on OS X.
`overlap-benchmark.input-set`	Textual description of input set; set by example scripts
`overlap-benchmark.fingerprint`	Textual description of the fingerprint used; set by example scripts
`overlap-benchmark.test-name`	Differentiate test suites; not used by example scripts
`overlap-benchmark.garbage-collector`	Differentiate garbage collector settings; not used by example scripts
`overlap-benchmark.machine`	Differentiate machines; not used by example scripts
`overlap-benchmark.xmx`	Differentiate max memory settings; not used by example scripts

To set these system properties for command line scripts use -D<propname>=<value> JVM argument like "-Doverlap-benchmark.fingerprint=cfp7". Note that system properties starting with overlap-benchmark are whitelisted by sanitization script data/sanitize-prof.js.

Profiling data details

Profiling data is a JSON object containing periodic snapshots from various JMX beans providing instrumentation of the running Java VM and execution data detailed above.

Contents:

prof: array of periodic JVM instrumentation snapshots. Array elements are serialized from instances of class com.chemaxon.overlap.cli.prof.ProfSnapshot. For details see its javadoc. These snapshots contain information of the running progress observer(s), serialized from instances of class com.chemaxon.calculations.common.TimerStatistics. For details see javadoc.
stat: execution statistics detailed above.

Summary of snapshot data

Each element in the JSON array referenced with key prof is a snapshot, serialized from class com.chemaxon.overlap.cli.prof.ProfSnapshot containing the following data:

timestamp Time stamp of the snapshot.
memoryPools[] Info on memory pools (name, type, usage, collection usage).
garbageCollectors[] Info on garbage collectors (name, collection count, collection time, handled memory pools).
heapMemoryUsage Info on heap memory usage.
nonHeapMemoryUsage Info on non-heap memory usage.
systemLoadAverage System load average if availabe.
objectPendingFinalizationCount Approximate number of objects pending finalization.
important flag is set for the first and last snapshots. It is also set when the snapshot was triggered by an external source. First and last snapshots for progress observers are marked as important. This flag can be considered when the array is decimated to reduce its size.
ext additional data, like running progress observer statuses.

Setup profiling data collection

The following command line options are used for setting up profiling data collection:

-prof <FILENAME> If specified profiling data will be collected and written to the specified file.
-profres <TIME> Optional parameter specifying the frequency of profiling data collection in miliseconds.

For example when -prof proffile.txt -profres 500 is specified then profiling data will be collected and written into file proffile.txt. Snapshot will be made approximately twice per every second.

Alternative tools

Note that tools jvisualvm (see documentation) and jstat (see documentation) shipped with the JDK also provide VM instrumentation and can extract/visualize most of the data above with the exception of the progress observer statuses.