Prepare example molecule sets


Several large molecule sets are available from public sources. In the examples some such sets are used as an input. This document details the download and preparation of the following ones (for the sets included in this distribution the path of the structure file is shown):

Name Processed file Molecule count Download size Processed size
vitamins data/molecules/vitamins/vitamins.smi 30 N/A 3 k
antibiotics data/molecules/antibiotics/antibiotics.smi 146 N/A 16 k
who-essential-medicines data/molecules/who-essential-medicines/who-essential-medicines.smi 342 N/A 26 k
drugbank-all data/molecules/drugbank/drugbank-all.sdf.gz 7127 2.3 M 2.3 M
nci-250k data/molecules/nci/nci-250k.smi.gz 249 k 3.2 M 2.8 M
chembl data/molecules/chembl/chembl-21.smi.gz 1.5 M 507 M 24 M
chebi chebi.smi.gz 97 k 65 M 1.3 M
emolecules-plus emolecules-plus.smi.gz 17.8 M 192 M 197 M
surechembl surechembl.smi.gz 18 M 1.4 G 208 M
zinc-all zinc-all.smi.gz 16.6 M 142 M 142 M
pubchem-compound pubchem-compound.smi.gz 96.3 M 68 G 867 M
pubchem-compound-rnd pubchem-compound-rnd.smi.gz 96.3 M N/A 1.8 G
pubchem-compound-rnd-1k data/molecules/pubchem-compound/pubchem-compound-rnd-1k.smi.gz] 1 k N/A 20 k
pubchem-compound-rnd-10k data/molecules/pubchem-compound/pubchem-compound-rnd-10k.smi.gz 10 k N/A 193 k
pubchem-compound-rnd-100k data/molecules/pubchem-compound/pubchem-compound-rnd-100k.smi.gz 100 k N/A 1.9 M
pubchem-compound-rnd-1000k pubchem-compound-rnd-1000k.smi.gz 1 M N/A 19 M
gdb-13 gdb-13.smi.gz 977 M 2.7 G 2.7 G
gdb-12 gdb-12.smi.gz 123 M N/A 334 M

Download script

Script examples/ can download and prepare the molecule sets described here. Launch the script with option -h to access usage help.


File data/molecules/vitamins/vitamins.smi contains 30 molecules in <SMILES> <NAME> format. It needs no further preparation. A gzipped version (data/molecules/vitamins/vitamins.smi.gz) is also available. Contents of the file is based on page


File data/molecules/antibiotics/antibiotics.smi contains 146 molecules in <SMILES> <NAME> format. It needs no further preparation. A gzipped version (data/molecules/antibiotics/antibiotics.smi) is also available. Contents of the file is based on page

WHO Model List of Essential Medicines

File data/molecules/who-essential-medicines/who-essential-medicines.smi contains 342 structures from the WHO Model List of Essential Medicines (*adult list* of 19th edition, April 2015), based on, created using ChemAxon ChemCurator.

Drugbank all

DrugBank "Open Data dataset" is available as a zipped SDF file. For details see Download page can be found at Preparation involves repacking .zip archive into gzipped SDF format. Also create two gzipped SMILES versions where the structure name is derived from field COMMON_NAME and DRUGBANK_ID.

Please note that the SMILES versions currently miss a few structures.

wget -O
unzip -p | gzip > drugbank-all.sdf.gz
gzip -dc drugbank-all.sdf.gz | bin/ -in - -out - -namefromprop COMMON_NAME | gzip -9 > drugbank-common_name.smi.gz
gzip -dc drugbank-all.sdf.gz | bin/ -in - -out - -namefromprop DRUGBANK_ID | gzip -9 > drugbank-drugbank_id.smi.gz

According to page

The DrugBank Open Data datasets are public domain datasets that can be used freely in your application or project (including commercial use). It is released under a Creative Common’s CC0 International License. To the extent possible under law, the person who associated CC0 with the DrugBank Open Data has waived all copyright and related or neighboring rights to the DrugBank Open Data. This work is published from: Canada.

A repackaged (.sgf.gz) version of the downloaded DrugBank Open Data dataset, according to this license is currently available in directory data/molecules/drugbank/.


NCI Release 1; ~250k structures in gzipped SMILES, see details at Notes:

gzip -dc NCISMA99.sdz | awk '{print $2 " NCI" $1}' | sed "s/\[\([BCNOPSF]\)\]/\1/g" | gzip > nci-250k.smi.gz

According to the publisher of this dataset the structures are in the public domain. The structures in fixed SMILES format are available in directory data/molecules/nci.


Structures from CheEMBLdb release 21 are available in gzipped SDF file. FTP site of the downloadable content is at For details see and Structures are converted to SMILES with included tool prepareMolecules with preserving ChEMBL IDs. Please note that the SMILES version might miss a few structures.

wget -O chembl-21.sdf.gz
gzip -dc chembl-21.sdf.gz | bin/ -in - -out - -namefromprop chembl_id | gzip -9 > chembl-21.smi.gz

According to and files the data is covered by the Creative Commons Atrubution-ShareAlike 3.0 Unported license. The downloaded and SMILES converted structure data is available in directory data/molecules/chembl. According to the information the following attributions are required:


Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on small chemical compounds. Structures in sdf format can be downloaded from the FTP site

gzip -dc ChEBI_complete.sdf.gz | awk '{
    if ($0 == "> <SMILES>") { getline ; SMI = $0; }
    else if ($0 == "> <ChEBI ID>") { getline ; CID = $0; }
    else if ($0 == "$$$$") { print SMI " " CID; SMI = ""; CID = "" }
}' | gzip > chebi.smi.gz

eMolecules Plus

Free version of eMolecules Plus Database can be downloaded from The first line of the zipped file is a header and two IDs (version_id and parent_id) are concatenated. We remove the first line and add a - character between these two IDs.

gzip -dc version.smi.gz | tail -n +2 | awk '{ print $1 " " $2 "-" $3 }' | gzip -9 > emolecules-plus.smi.gz


SureChembl compound data dump can be downloaded in txt and sdf formats from FTP site Details are available in file The txt format is a tab separated file containing ID, SMILES, InChI and InChIKey informations. During processing the first line of these files is dropped and the first two fields are used to construct the desired output in the form of <SMILES> <ID> lines. Use the FTP directory to get the list of files to download and process.

rm -f surechembl.smi.gz
wget -qO- | \
    tr "\"" "\n" | \
    grep "^ftp://ftp\.ebi\.ac\.uk.*\.txt\.gz$" | \
    sed -e 's|.*/\(.*\)|\1|' | \
    while read file
        echo "Download/process $file"
        if [ ! -e "$file" ]
            wget "$file"
        gzip -dc "$file" | tail -n +2 | awk '{ print $2 " " $1 }' | gzip -9 >> surechembl.smi.gz

According to the README file at

The data content in SureChEMBL is licensed under a highly permissive Creative Commons license - specifically the "CC Attribution-ShareAlike 3.0 Unported license", see LICENSE file. The required attribution should contain the url of the SureChEMBL resource ( and should be visible on the entry portal for a web resource in which SureChEMBL is integrated, or contained with the documentation for any further distribution.

The required attribution according to file ate

The data in SureChEMBL is covered by the licence in the file LICENSE.

Under the -BY clause, we request attribution for subsequent use of SureChEMBL data.

For publications using SureChEMBL data, the primary current citation is:

  1. G. Papadatos, M. Davies, N. Dedman, J. Chambers, A. Gaulton, J. Siddle, R. Koks, S. A. Irvine, J. Pettersson, N. Goncharoff, A. Hersey, J. P. Overington (2016). SureChEMBL: a large-scale, chemically annotated patent document database. Nucleic Acids Research Database Issue, 44, D1220-D1228, DOI:10.1093/nar/gkv1253, PMID:26582922.

If SureChEMBL is incorporated into other works, we ask that the SureChEMBL IDs are preserved, and that the release date of SureChEMBL is clearly displayed.


ZINC All Purchasable subset in gzipped SMILES (Reference pH 7 set), see download link at (Downloads tab).

wget -O zinc-all.smi.gz

PubChem Compound

PubChem Compound (homepage: can be downloaded in gzipped SDF format from PubChem FTP site Specifications are available at To get a gzipped SMILES file fields PUBCHEM_OPENEYE_ISO_SMILES and PUBCHEM_COMPOUND_CID are collected from the SDF file using an awk script. Note that this set contains ~68M structures, so the download size is over 50 GB (over 4000 files) and the execution time of SMILES extraction can be more than two hours. For the sake of simplicity the awk script used below is not compliant with the full SDfile format data header and data value specification.

wget -nd -np -r*
gzip -dc *.sdf.gz | awk '{
    if ($0 == "> <PUBCHEM_OPENEYE_ISO_SMILES>") { getline ; SMI = $0; }
    else if ($0 == "> <PUBCHEM_COMPOUND_CID>") { getline ; CID = $0; }
    else if ($0 == "$$$$") { print SMI " " CID; SMI = ""; CID = "" }
}' | gzip -9 > pubchem-compound.smi.gz

PubChem Compound random ordering

A randomized ordering of the PubChem Compound structures can be created by shuffling the extracted SMILES file. Note that the execution time of the following script can be more than an hour.

gzip -dc pubchem-compound.smi.gz | \
    awk 'BEGIN { srand(0) ; }{ printf "%f%f %s\n", rand(), rand(), $0 }' | \
    sort | \
    sed -r 's/^[01]\.[0-9]+[01]\.[0-9]+ //' | \
    gzip -9 > pubchem-compound-rnd.smi.gz

Note that by using a fixed seed (srand(0)) for random number generation (rand()) the shuffling script above is expected to produce the same ordering (for the same input) across multiple runs. Note that prefixes of the resulting file are random subsets of the input set.

PubChem Compound random subsets

Derive a random subsets by taking prefixes of the random ordered version described above. These random subsets can be used for benchmark and verification.

gzip -dc pubchem-compound-rnd.smi.gz | head -1000 | gzip -9 > pubchem-compound-rnd-1k.smi.gz
gzip -dc pubchem-compound-rnd.smi.gz | head -10000 | gzip -9 > pubchem-compound-rnd-10k.smi.gz
gzip -dc pubchem-compound-rnd.smi.gz | head -100000 | gzip -9 > pubchem-compound-rnd-100k.smi.gz
gzip -dc pubchem-compound-rnd.smi.gz | head -1000000 | gzip -9 > pubchem-compound-rnd-1000k.smi.gz

According to "Fair Use Disclaimer":

Databases of molecular data on the NCBI FTP site include such examples as nucleotide sequences (GenBank), protein sequences, macromolecular structures, molecular variation, gene expression, and mapping data. They are designed to provide and encourage access within the scientific community to sources of current and comprehensive information. Therefore, NCBI itself places no restrictions on the use or distribution of the data contained therein. However, some submitters of the original data may claim patent, copyright, or other intellectual property rights in all or a portion of the data they have submitted. NCBI is not in a position to assess the validity of such claims and, therefore, cannot provide comment or unrestricted permission concerning the use, copying, or distribution of the information contained in the molecular databases.

Random subsets 1k, 10k and 100k are available in directory data/molecules/pubchem-compound.

Pubchem Compound random subsets SDF source

SDF source (containing SDF properties) of the 1k random subset is retrieved from PubChem Power User Gateway (see

gzip -dc pubchem-compound-rnd-1k.smi.gz | \
    awk '{ print $2 }' | \
    while read cid
        curl "${cid}/record/SDF" >> pubchem-compound-rnd-1k.sdf
        sleep 1
gzip pubchem-compound-rnd-1k.sdf

The SDF source of random subset 1k is available in directory data/molecules/pubchem-compound.

GDB-13 and GDB-12

According to

GDB-13 enumerates small organic molecules up to 13 atoms of C, N, O, S and Cl following simple chemical stability and synthetic feasibility rules. With 977 468 314 structures, GDB-13 is the largest publicly available small organic molecule database to date.

970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. Blum L. C.; Reymond J.-L. J. Am. Chem. Soc., 2009, 131, 8732-8733.

After accepting Terms and Conditions download "Entire GDB-13 (including all C/N/O/Cl/S molecules)". The archive file contains several individual files for various atom counts. By excluding 13.smi we get enumeration up to 12 atoms. Note that converting to the flat smi.gz file takes several minutes.

tar xvzfO gdb13.tgz --exclude 13.smi | gzip > gdb-12.smi.gz
tar xvzfO gdb13.tgz | gzip > gdb-13.smi.gz