rdkit.ML.SLT.Risk module

code for calculating empirical risk

rdkit.ML.SLT.Risk.BurgesRiskBound(VCDim, nData, nWrong, conf)

Calculates Burges’s formulation of the risk bound

The formulation is from Eqn. 3 of Burges’s review article “A Tutorial on Support Vector Machines for Pattern Recognition”

In _Data Mining and Knowledge Discovery_ Kluwer Academic Publishers (1998) Vol. 2

Arguments

  • VCDim: the VC dimension of the system

  • nData: the number of data points used

  • nWrong: the number of data points misclassified

  • conf: the confidence to be used for this risk bound

Returns

  • a float

Notes

  • This has been validated against the Burges paper

  • I believe that this is only technically valid for binary classification

rdkit.ML.SLT.Risk.CherkasskyRiskBound(VCDim, nData, nWrong, conf, a1=1.0, a2=2.0)

The formulation here is from Eqns 4.22 and 4.23 on pg 108 of Cherkassky and Mulier’s book “Learning From Data” Wiley, 1998.

Arguments

  • VCDim: the VC dimension of the system

  • nData: the number of data points used

  • nWrong: the number of data points misclassified

  • conf: the confidence to be used for this risk bound

  • a1, a2: constants in the risk equation. Restrictions on these values:

    • 0 <= a1 <= 4

    • 0 <= a2 <= 2

Returns

  • a float

Notes

  • This appears to behave reasonably

  • the equality a1=1.0 is by analogy to Burges’s paper.

rdkit.ML.SLT.Risk.CristianiRiskBound(VCDim, nData, nWrong, conf)

the formulation here is from pg 58, Theorem 4.6 of the book “An Introduction to Support Vector Machines” by Cristiani and Shawe-Taylor Cambridge University Press, 2000

Arguments

  • VCDim: the VC dimension of the system

  • nData: the number of data points used

  • nWrong: the number of data points misclassified

  • conf: the confidence to be used for this risk bound

Returns

  • a float

Notes

  • this generates odd (mismatching) values

rdkit.ML.SLT.Risk.log2(x)