rdkit.ML.Data.DataUtils module

Utilities for data manipulation


  • .qdat files contain quantized data suitable for

feeding to learning algorithms.

The .qdat file, written by _DecTreeGui_, is structured as follows:

  1. Any number of lines which are ignored.

  2. A line containing the string ‘Variable Table’

    any number of variable definitions in the format:

    ‘# Variable_name [quant_bounds]’

    where ‘[quant_bounds]’ is a list of the boundaries used for quantizing

    that variable. If the variable is inherently integral (i.e. not quantized), this can be an empty list.

  3. A line beginning with ‘# —-’ which signals the end of the variable list

  4. Any number of lines containing data points, in the format:

    ‘Name_of_point var1 var2 var3 …. varN’

    all variable values should be integers

Throughout, it is assumed that varN is the result

  • .dat files contain the same information as .qdat files, but the variable values can be anything (floats, ints, strings). These files should still contain quant_bounds!

  • .qdat.pkl file contain a pickled (binary) representation of the data read in. They stores, in order:

    1. A python list of the variable names

    2. A python list of lists with the quantization bounds

    3. A python list of the point names

    4. A python list of lists with the data points


builds a data set from a .dat file


  • fileName: the name of the .dat file


an _MLData.MLDataSet_


builds a data set from a .qdat file


  • fileName: the name of the .qdat file


an _MLData.MLQuantDataSet_

rdkit.ML.Data.DataUtils.CalcNPossibleUsingMap(data, order, qBounds, nQBounds=None, silent=True)

calculates the number of possible values for each variable in a data set


  • data: a list of examples

  • order: the ordering map between the variables in _data_ and _qBounds_

  • qBounds: the quantization bounds for the variables


a list with the number of possible values each variable takes on in the data set


  • variables present in _qBounds_ will have their _nPossible_ number read from _qbounds

  • _nPossible_ for other numeric variables will be calculated

rdkit.ML.Data.DataUtils.CountResults(inData, col=-1, bounds=None)


rdkit.ML.Data.DataUtils.DBToData(dbName, tableName, user='sysdba', password='masterkey', dupCol=-1, what='*', where='', join='', pickleCol=-1, pickleClass=None, ensembleIds=None)

constructs an _MLData.MLDataSet_ from a database


  • dbName: the name of the database to be opened

  • tableName: the table name containing the data in the database

  • user: the user name to be used to connect to the database

  • password: the password to be used to connect to the database

  • dupCol: if nonzero specifies which column should be used to recognize duplicates.


an _MLData.MLDataSet_


  • this uses Dbase.DataUtils functionality

rdkit.ML.Data.DataUtils.FilterData(inData, val, frac, col=-1, indicesToUse=None, indicesOnly=0)



Seeds the random number generators


  • seed: a 2-tuple containing integers to be used as the random number seeds


this seeds both the RDRandom generator and the one in the standard Python _random_ module

rdkit.ML.Data.DataUtils.RandomizeActivities(dataSet, shuffle=0, runDetails=None)

randomizes the activity values of a dataset


  • dataSet: a _ML.Data.MLQuantDataSet_, the activities here will be randomized

  • shuffle: an optional toggle. If this is set, the activity values will be shuffled (so the number in each class remains constant)

  • runDetails: an optional CompositeRun object


  • _examples_ are randomized in place


reads the examples from a .dat file


  • inFile: a file object


a 2-tuple containing:

  1. the names of the examples

  2. a list of lists containing the examples themselves


  • this attempts to convert variable values to ints, then floats. if those both fail, they are left as strings


reads the examples from a .qdat file


  • inFile: a file object


a 2-tuple containing:

  1. the names of the examples

  2. a list of lists containing the examples themselves


because this is reading a .qdat file, it assumed that all variable values are integers


reads the variables and quantization bounds from a .qdat or .dat file


  • inFile: a file object


a 2-tuple containing:

  1. varNames: a list of the variable names

  2. qbounds: the list of quantization bounds for each variable

rdkit.ML.Data.DataUtils.TakeEnsemble(vect, ensembleIds, isDataVect=False)
>>> v = [10,20,30,40,50]
>>> TakeEnsemble(v,(1,2,3))
[20, 30, 40]
>>> v = ['foo',10,20,30,40,50,1]
>>> TakeEnsemble(v,(1,2,3),isDataVect=True)
['foo', 20, 30, 40, 1]
rdkit.ML.Data.DataUtils.TextFileToData(fName, onlyCols=None)


rdkit.ML.Data.DataUtils.TextToData(reader, ignoreCols=[], onlyCols=None)

constructs an _MLData.MLDataSet_ from a bunch of text #DOC

  • reader needs to be iterable and return lists of elements (like a csv.reader)


an _MLData.MLDataSet_

rdkit.ML.Data.DataUtils.WriteData(outFile, varNames, qBounds, examples)

writes out a .qdat file


  • outFile: a file object

  • varNames: a list of variable names

  • qBounds: the list of quantization bounds (should be the same length

    as _varNames_)

  • examples: the data to be written

rdkit.ML.Data.DataUtils.WritePickledData(outName, data)

writes either a .qdat.pkl or a .dat.pkl file


  • outName: the name of the file to be used

  • data: either an _MLData.MLDataSet_ or an _MLData.MLQuantDataSet_
