rdkit.ML.InfoTheory.rdInfoTheory module¶
Module containing bunch of functions for information metrics and a ranker to rank bits
- class rdkit.ML.InfoTheory.rdInfoTheory.BitCorrMatGenerator((object)arg1)¶
Bases:
instance
A class to generate a pairwise correlation matrix between a list of bits The mode of operation for this class is something like this
>>> cmg = BitCorrMatGenerator() >>> cmg.SetBitList(blist) >>> for fp in fpList: >>> cmg.CollectVotes(fp) >>> corrMat = cmg.GetCorrMatrix()
The resulting correlation matrix is a one dimensional nummeric array containing the lower triangle elements
- C++ signature :
void __init__(_object*)
- CollectVotes((BitCorrMatGenerator)self, (AtomPairsParameters)bitVect) None : ¶
For each pair of on bits (bi, bj) in fp increase the correlation count for the pair by 1
ARGUMENTS:
fp : a bit vector to collect the fingerprints from
- C++ signature :
void CollectVotes(RDInfoTheory::BitCorrMatGenerator*,boost::python::api::object)
- GetCorrMatrix((BitCorrMatGenerator)self) object : ¶
Get the correlation matrix following the collection of votes from a bunch of fingerprints
- C++ signature :
_object* GetCorrMatrix(RDInfoTheory::BitCorrMatGenerator*)
- SetBitList((BitCorrMatGenerator)self, (AtomPairsParameters)bitList) None : ¶
Set the list of bits that need to be correllated
This may for example be their top ranking ensemble bits
ARGUMENTS:
bitList : an integer list of bit IDs
- C++ signature :
void SetBitList(RDInfoTheory::BitCorrMatGenerator*,boost::python::api::object)
- rdkit.ML.InfoTheory.rdInfoTheory.ChiSquare((AtomPairsParameters)resArr) float : ¶
Calculates the chi squared value for a variable
ARGUMENTS:
varMat: a Numeric Array object varMat is a Numeric array with the number of possible occurrences
of each result for reach possible value of the given variable.
- So, for a variable which adopts 4 possible values and a result which
has 3 possible values, varMat would be 4x3
RETURNS:
a Python float object
- C++ signature :
double ChiSquare(boost::python::api::object)
- class rdkit.ML.InfoTheory.rdInfoTheory.InfoBitRanker((object)self, (int)nBits, (int)nClasses)¶
Bases:
instance
A class to rank the bits from a series of labelled fingerprints A simple demonstration may help clarify what this class does. Here’s a small set of vectors:
>>> for i,bv in enumerate(bvs): print(bv.ToBitString(),acts[i]) ... 0001 0 0101 0 0010 1 1110 1
Default ranker, using infogain:
>>> ranker = InfoBitRanker(4,2) >>> for i,bv in enumerate(bvs): ranker.AccumulateVotes(bv,acts[i]) ... >>> for bit,gain,n0,n1 in ranker.GetTopN(3): print(int(bit),'%.3f'%gain,int(n0),int(n1)) ... 3 1.000 2 0 2 1.000 0 2 0 0.311 0 1
Using the biased infogain:
>>> ranker = InfoBitRanker(4,2,InfoTheory.InfoType.BIASENTROPY) >>> ranker.SetBiasList((1,)) >>> for i,bv in enumerate(bvs): ranker.AccumulateVotes(bv,acts[i]) ... >>> for bit,gain,n0,n1 in ranker.GetTopN(3): print(int(bit),'%.3f'%gain,int(n0),int(n1)) ... 2 1.000 0 2 0 0.311 0 1 1 0.000 1 1
A chi squared ranker is also available:
>>> ranker = InfoBitRanker(4,2,InfoTheory.InfoType.CHISQUARE) >>> for i,bv in enumerate(bvs): ranker.AccumulateVotes(bv,acts[i]) ... >>> for bit,gain,n0,n1 in ranker.GetTopN(3): print(int(bit),'%.3f'%gain,int(n0),int(n1)) ... 3 4.000 2 0 2 4.000 0 2 0 1.333 0 1
As is a biased chi squared:
>>> ranker = InfoBitRanker(4,2,InfoTheory.InfoType.BIASCHISQUARE) >>> ranker.SetBiasList((1,)) >>> for i,bv in enumerate(bvs): ranker.AccumulateVotes(bv,acts[i]) ... >>> for bit,gain,n0,n1 in ranker.GetTopN(3): print(int(bit),'%.3f'%gain,int(n0),int(n1)) ... 2 4.000 0 2 0 1.333 0 1 1 0.000 1 1
- C++ signature :
void __init__(_object*,int,int)
__init__( (object)self, (int)nBits, (int)nClasses, (InfoType)infoType) -> None :
- C++ signature :
void __init__(_object*,int,int,RDInfoTheory::InfoBitRanker::InfoType)
- AccumulateVotes((InfoBitRanker)self, (AtomPairsParameters)bitVect, (int)label) None : ¶
Accumulate the votes for all the bits turned on in a bit vector
ARGUMENTS:
bv : bit vector either ExplicitBitVect or SparseBitVect operator
label : the class label for the bit vector. It is assumed that 0 <= class < nClasses
- C++ signature :
void AccumulateVotes(RDInfoTheory::InfoBitRanker*,boost::python::api::object,int)
- GetTopN((InfoBitRanker)self, (int)num) object : ¶
Returns the top n bits ranked by the information metric This is actually the function where most of the work of ranking is happening
ARGUMENTS:
num : the number of top ranked bits that are required
- C++ signature :
_object* GetTopN(RDInfoTheory::InfoBitRanker*,int)
- SetBiasList((InfoBitRanker)self, (AtomPairsParameters)classList) None : ¶
Set the classes to which the entropy calculation should be biased
This list contains a set of class ids used when in the BIASENTROPY mode of ranking bits. In this mode, a bit must be correlated higher with one of the biased classes than all the other classes. For example, in a two class problem with actives and inactives, the fraction of actives that hit the bit has to be greater than the fraction of inactives that hit the bit
ARGUMENTS:
classList : list of class ids that we want a bias towards
- C++ signature :
void SetBiasList(RDInfoTheory::InfoBitRanker*,boost::python::api::object)
- SetMaskBits((InfoBitRanker)self, (AtomPairsParameters)maskBits) None : ¶
Set the mask bits for the calculation
ARGUMENTS:
maskBits : list of mask bits to use
- C++ signature :
void SetMaskBits(RDInfoTheory::InfoBitRanker*,boost::python::api::object)
- Tester((InfoBitRanker)self, (AtomPairsParameters)bitVect) None : ¶
- C++ signature :
void Tester(RDInfoTheory::InfoBitRanker*,boost::python::api::object)
- WriteTopBitsToFile((InfoBitRanker)self, (str)fileName) None : ¶
Write the bits that have been ranked to a file
- C++ signature :
void WriteTopBitsToFile(RDInfoTheory::InfoBitRanker {lvalue},std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
- rdkit.ML.InfoTheory.rdInfoTheory.InfoEntropy((AtomPairsParameters)resArr) float : ¶
calculates the informational entropy of the values in an array
ARGUMENTS:
resMat: pointer to a long int array containing the data
dim: long int containing the length of the _tPtr_ array.
RETURNS:
a double
- C++ signature :
double InfoEntropy(boost::python::api::object)
- rdkit.ML.InfoTheory.rdInfoTheory.InfoGain((AtomPairsParameters)resArr) float : ¶
Calculates the information gain for a variable
ARGUMENTS:
varMat: a Numeric Array object varMat is a Numeric array with the number of possible occurrences
of each result for reach possible value of the given variable.
- So, for a variable which adopts 4 possible values and a result which
has 3 possible values, varMat would be 4x3
RETURNS:
a Python float object
NOTES
this is a dropin replacement for _PyInfoGain()_ in entropy.py
- C++ signature :
double InfoGain(boost::python::api::object)
- class rdkit.ML.InfoTheory.rdInfoTheory.InfoType¶
Bases:
enum
- BIASCHISQUARE = rdkit.ML.InfoTheory.rdInfoTheory.InfoType.BIASCHISQUARE¶
- BIASENTROPY = rdkit.ML.InfoTheory.rdInfoTheory.InfoType.BIASENTROPY¶
- CHISQUARE = rdkit.ML.InfoTheory.rdInfoTheory.InfoType.CHISQUARE¶
- ENTROPY = rdkit.ML.InfoTheory.rdInfoTheory.InfoType.ENTROPY¶
- names = {'BIASCHISQUARE': rdkit.ML.InfoTheory.rdInfoTheory.InfoType.BIASCHISQUARE, 'BIASENTROPY': rdkit.ML.InfoTheory.rdInfoTheory.InfoType.BIASENTROPY, 'CHISQUARE': rdkit.ML.InfoTheory.rdInfoTheory.InfoType.CHISQUARE, 'ENTROPY': rdkit.ML.InfoTheory.rdInfoTheory.InfoType.ENTROPY}¶
- values = {1: rdkit.ML.InfoTheory.rdInfoTheory.InfoType.ENTROPY, 2: rdkit.ML.InfoTheory.rdInfoTheory.InfoType.BIASENTROPY, 3: rdkit.ML.InfoTheory.rdInfoTheory.InfoType.CHISQUARE, 4: rdkit.ML.InfoTheory.rdInfoTheory.InfoType.BIASCHISQUARE}¶