rdkit.Chem.BuildFragmentCatalog module

command line utility for working with FragmentCatalogs (CASE-type analysis)


BuildFragmentCatalog [optional args] <filename>

filename, the name of a delimited text file containing InData, is required for some modes of operation (see below)

Command Line Arguments

  • -n maxNumMols: specify the maximum number of molecules to be processed

  • -b: build the catalog and OnBitLists

    requires InData

  • -s: score compounds

    requires InData and a Catalog, can use OnBitLists

  • -g: calculate info gains

    requires Scores

  • -d: show details about high-ranking fragments

    requires a Catalog and Gains

  • –catalog=*filename*: filename with the pickled catalog.

    If -b is provided, this file will be overwritten.

  • –onbits=*filename*: filename to hold the pickled OnBitLists. If -b is provided, this file will be overwritten

  • –scores=*filename*: filename to hold the text score data. If -s is provided, this file will be overwritten

  • –gains=*filename*: filename to hold the text gains data. If -g is provided, this file will be overwritten

  • –details=*filename*: filename to hold the text details data. If -d is provided, this file will be overwritten.

  • –minPath=2: specify the minimum length for a path

  • –maxPath=6: specify the maximum length for a path

  • –smiCol=1: specify which column in the input data file contains


  • –actCol=-1: specify which column in the input data file contains


  • –nActs=2: specify the number of possible activity values

  • –nBits=-1: specify the maximum number of bits to show details for

rdkit.Chem.BuildFragmentCatalog.BuildCatalog(suppl, maxPts=-1, groupFileName=None, minPath=2, maxPath=6, reportFreq=10)

builds a fragment catalog from a set of molecules in a delimited text block


  • suppl: a mol supplier

  • maxPts: (optional) if provided, this will set an upper bound on the number of points to be considered

  • groupFileName: (optional) name of the file containing functional group information

  • minPath, maxPath: (optional) names of the minimum and maximum path lengths to be considered

  • reportFreq: (optional) how often to display status information


a FragmentCatalog

rdkit.Chem.BuildFragmentCatalog.CalcGains(suppl, catalog, topN=-1, actName='', acts=None, nActs=2, reportFreq=10, biasList=None, collectFps=0)

calculates info gains by constructing fingerprints DOC

Returns a 2-tuple:
  1. gains matrix

  2. list of fingerprints

rdkit.Chem.BuildFragmentCatalog.CalcGainsFromFps(suppl, fps, topN=-1, actName='', acts=None, nActs=2, reportFreq=10, biasList=None)

calculates info gains from a set of fingerprints


rdkit.Chem.BuildFragmentCatalog.OutputGainsData(outF, gains, cat, nActs=2)
rdkit.Chem.BuildFragmentCatalog.ProcessGainsData(inF, delim=',', idCol=0, gainCol=1)

reads a list of ids and info gains out of an input file

class rdkit.Chem.BuildFragmentCatalog.RunDetails

Bases: object

actCol = -1
biasList = None
catalogName = None
dbName = ''
delim = ','
detailsName = None
doBuild = 0
doDetails = 0
doGains = 0
doScore = 0
doSigs = 0
fpName = None
gainsName = None
hasTitle = 1
inFileName = None
maxPath = 6
minPath = 2
nActs = 2
nBits = -1
nameCol = -1
numMols = -1
onBitsName = None
scoresName = None
smiCol = 1
tableName = None
topN = -1
rdkit.Chem.BuildFragmentCatalog.ScoreFromLists(bitLists, suppl, catalog, maxPts=-1, actName='', acts=None, nActs=2, reportFreq=10)

similar to _ScoreMolecules()_, but uses pre-calculated bit lists for the molecules (this speeds things up a lot)


  • bitLists: sequence of on bit sequences for the input molecules

  • suppl: the input supplier (we read activities from here)

  • catalog: the FragmentCatalog

  • maxPts: (optional) the maximum number of molecules to be considered

  • actName: (optional) the name of the molecule’s activity property. If this is not provided, the molecule’s last property will be used.

  • nActs: (optional) number of possible activity values

  • reportFreq: (optional) how often to display status information


the results table (a 3D array of ints nBits x 2 x nActs)

rdkit.Chem.BuildFragmentCatalog.ScoreMolecules(suppl, catalog, maxPts=-1, actName='', acts=None, nActs=2, reportFreq=10)

scores the compounds in a supplier using a catalog


  • suppl: a mol supplier

  • catalog: the FragmentCatalog

  • maxPts: (optional) the maximum number of molecules to be considered

  • actName: (optional) the name of the molecule’s activity property. If this is not provided, the molecule’s last property will be used.

  • acts: (optional) a sequence of activity values (integers). If not provided, the activities will be read from the molecules.

  • nActs: (optional) number of possible activity values

  • reportFreq: (optional) how often to display status information


a 2-tuple:

  1. the results table (a 3D array of ints nBits x 2 x nActs)

  2. a list containing the on bit lists for each molecule

rdkit.Chem.BuildFragmentCatalog.ShowDetails(catalog, gains, nToDo=-1, outF=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, idCol=0, gainCol=1, outDelim=', ')

gains should be a sequence of sequences. The idCol entry of each sub-sequence should be a catalog ID. _ProcessGainsData()_ provides suitable input.

rdkit.Chem.BuildFragmentCatalog.message(msg, dest=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)