Profiling and execution statistics

Please note that this document is in a work in progress state. Plese note that the JSON formats of profiling data/execution statistics might change in the future releases.

Some of the command line interfaces support the collection of profiling and execution statistics data. The collected information can be used for benchmarking, debugging and to fine tune application settings. The collected data is stored in the specified text file in JSON format. The collected data can be load and visualized by the embedded server. Note that self contained example scripts rest-api-XXX collect and expose profiling and execution statistics described below.

Please note that experimental REST API endpoint statistics/profiling-snapshot can supply a garbage collector / memory pool snapshot similar to the ones conteined in the profiling data detailed below.

Example use case

In this example the performance of descriptor generation and the scaling of search is measured and visualized. The nci-250k dataset is used in this example as search targets. Subsets of the pubchem-compound-rnd-1k dataset are used for queries. These datasets are included in the distribution.

Collect data

Collect data from various tools. The workflow is similar to the one detailed in document Basic similarity search workflow. Additionally options -stat and -prof are used to write execution statistics and profiling data.

# Create master molecule storage and master id storage
gzip -dc data/molecules/nci/nci-250k.smi.gz | bin/createMms.sh \
    -in - \
    -out nci-250k-mms.bin \
    -name nci-250k-name.bin \
    -stat nci-250k-mms-stat.txt \
    -prof nci-250k-mms-prof.txt

# Generate cfp7 end ecfp4 descriptors
gzip -dc data/molecules/nci/nci-250k.smi.gz | bin/buildStorage.sh \
    -context createSimpleCfp7Context \
    -in - \
    -out nci-250k-cfp7.bin \
    -stat nci-250k-cfp7-stat.txt \
    -prof nci-250k-cfp7-prof.txt
gzip -dc data/molecules/nci/nci-250k.smi.gz | bin/buildStorage.sh \
    -context createSimpleEcfp4Context \
    -in - \
    -out nci-250k-ecfp4.bin \
    -stat nci-250k-ecfp4-stat.txt \
    -prof nci-250k-ecfp4-prof.txt

# Launch most similar searches for 1, 10, 100, 1000 queries
# Queries are selected from the pubchem-compound-rnd dataset
# For search scaling benchmark we use the cfp7 descriptors only
gzip -dc data/molecules/pubchem-compound/pubchem-compound-rnd-1k.smi.gz | head -1 | bin/searchStorage.sh \
    -frombytes nci-250k-cfp7.bin \
    -qf - \
    -out hits-q-1.txt \
    -stat search-q-1-t-250k-stat.txt \
    -prof search-q-1-t-250k-prof.txt
gzip -dc data/molecules/pubchem-compound/pubchem-compound-rnd-1k.smi.gz | head -10 | bin/searchStorage.sh \
    -frombytes nci-250k-cfp7.bin \
    -qf - \
    -out hits-q-10.txt \
    -stat search-q-10-t-250k-stat.txt \
    -prof search-q-10-t-250k-prof.txt
gzip -dc data/molecules/pubchem-compound/pubchem-compound-rnd-1k.smi.gz | head -100 |bin/searchStorage.sh \
    -frombytes nci-250k-cfp7.bin \
    -qf - \
    -out hits-q-100.txt \
    -stat search-q-100-t-250k-stat.txt \
    -prof search-q-100-t-250k-prof.txt
gzip -dc data/molecules/pubchem-compound/pubchem-compound-rnd-1k.smi.gz | head -1000 | bin/searchStorage.sh \
    -frombytes nci-250k-cfp7.bin \
    -qf - \
    -out hits-q-1000.txt \
    -stat search-q-1000-t-250k-stat.txt \
    -prof search-q-1000-t-250k-prof.txt

Sanitize/compact profiling data

Execution statistics contains the values of system properties and environmental variables. It is usually a good practice to not expose these data through the Web UI. Script data/sanitize-prof.js removes most of the stored such properties and variables keeping only a few whitelisted (see script for details) ones. This script is invoked using tool jseval:

# Sanitize/compact profiling data exposed by the Web UI
bin/jseval.sh -df inp=nci-250k-mms-prof.txt -jsfile data/sanitize-prof.js -out nci-250k-mms-prof-sanitized.txt
bin/jseval.sh -df inp=nci-250k-cfp7-prof.txt -jsfile data/sanitize-prof.js -out nci-250k-cfp7-prof-sanitized.txt
bin/jseval.sh -df inp=nci-250k-ecfp4-prof.txt -jsfile data/sanitize-prof.js -out nci-250k-ecfp4-prof-sanitized.txt
bin/jseval.sh -df inp=search-q-1-t-250k-prof.txt -jsfile data/sanitize-prof.js -out search-q-1-t-250k-prof-sanitized.txt
bin/jseval.sh -df inp=search-q-10-t-250k-prof.txt -jsfile data/sanitize-prof.js -out search-q-10-t-250k-prof-sanitized.txt
bin/jseval.sh -df inp=search-q-100-t-250k-prof.txt -jsfile data/sanitize-prof.js -out search-q-100-t-250k-prof-sanitized.txt
bin/jseval.sh -df inp=search-q-1000-t-250k-prof.txt -jsfile data/sanitize-prof.js -out search-q-1000-t-250k-prof-sanitized.txt

Display results using Web UI

# Load results into embedded server; listen on port 8081; launch browser
# Note that generated descriptors, imported molecules are not read by the server
# Use option -nobrowse to disable browser launch
bin/gui.sh \
    -port 8081 \
    -profres nci-250k-mms-prof-sanitized.txt \
    -profres nci-250k-cfp7-prof-sanitized.txt \
    -profres nci-250k-ecfp4-prof-sanitized.txt \
    -profres search-q-1-t-250k-prof-sanitized.txt \
    -profres search-q-10-t-250k-prof-sanitized.txt \
    -profres search-q-100-t-250k-prof-sanitized.txt \
    -profres search-q-1000-t-250k-prof-sanitized.txt

Since -profres is a multi arity argument you can use

bin/gui.sh \
    -port 8081 \
    -profres *-prof-sanitized.txt \

Visualization of profiling data and execution statistics

A dynamic HTML based visualization is available in the distribution. Use tool gui.sh with option -profres <FILENAME> to read the results JSON file as above. The launched server shows the read profiling results file(s) on the overview page. Aggregation of the execution statistics data from the individual profiling result files is also accessible with a visualization. See rest api example for more details on tool gui.sh. Current visualization is not complete, it will be cleaned up in a later release.

Index page

Following the links lead to the visualization pages. Profiling data visualization displays the value of various measures versus execution time.

Profiling data visualization shows the value of various measured parameter values during the execution time. Charts can be moved, scaled. Additional charts can be added.

Single profiling data

Aggregation of execution statistics collected from the exposed profiling data is also exposed. The distribution of various measured parameters for executions is displayed, can be filtered and plotted on a scatter plot. The individual profiling data display is also linked.

Aggregated statistics

Technical details of the visualization

Web content lib/webapp.war contains display files showProf.html, showStat.html and their dependent libraries (most important for the data analysis are d3.js and crossfilter; for the UI composition additionally jquery, jquery-ui, bootstrap and font-awesome). Display file expects an URL parameter ref to be present in the browser address bar pointing to the location of the results JSON to display, served by the server.

Supported command line interfaces

The following command line interfaces support the collection of profiling and execution statistics:

For details on the command line interfaces see documentation.

Execution statistics details

Execution statistics contains various data collected during execution in JSON format. Execution statistics is written to a file when option -stat is specified for supported tools.

Common contents:

Statistics of the most important progress observers as well important metrics (like target counts) are also stored. For more details see the APIDOC of the classes representing the written statistics:

Tool Execution statistics DTO class
buildStorage com.chemaxon.overlap.cli.BuildStorageStat
calculateOverlap com.chemaxon.overlap.cli.CalculateOverlapStat
createMms com.chemaxon.overlap.cli.CreateMmsStat
searchStorage com.chemaxon.overlap.cli.SearchStorageStat
stdg com.chemaxon.descriptors.alternates.cli.SingleThreadedDescriptorGeneratorStat

Further attributes

The visualization page checks for specific system properties to use as additional dimensions:

Property name Typical usage
overlap-benchmark.cputype CPU model name fetched from /proc/cpuinfo
overlap-benchmark.memory Total memory fetched from /proc/meminfo
overlap-benchmark.input-set Textual description of input set; set by example scripts
overlap-benchmark.fingerprint Textual description of the fingerprint used; set by example scripts
overlap-benchmark.test-name Differentiate test suites; not used by example scripts
overlap-benchmark.garbage-collector Differentiate garbage collector settings; not used by example scripts
overlap-benchmark.machine Differentiate machines; not used by example scripts
overlap-benchmark.xmx Differentiate max memory settings; not used by example scripts

To set these system properties for command line scripts use -D<propname>=<value> JVM argument like "-Doverlap-benchmark.fingerprint=cfp7". Note that system properties starting with overlap-benchmark are whitelisted by sanitization script data/sanitize-prof.js.

Profiling data details

Profiling data is a JSON object containing periodic snapshots from various JMX beans providing instrumentation of the running Java VM and execution data detailed above.

Contents:

Summary of snapshot data

Each element in the JSON array referenced with key prof is a snapshot, serialized from class com.chemaxon.overlap.cli.prof.ProfSnapshot containing the following data:

Setup profiling data collection

The following command line options are used for setting up profiling data collection:

For example when -prof proffile.txt -profres 500 is specified then profiling data will be collected and written into file proffile.txt. Snapshot will be made approximately twice per every second.

Alternative tools

Note that tools jvisualvm (see documentation) and jstat (see documentation) shipped with the JDK also provide VM instrumentation and can extract/visualize most of the data above with the exception of the progress observer statuses.