|
Package rdkit ::
Package ML ::
Module BuildComposite
|
|
1
2
3
4
5
6
7 """ command line utility for building composite models
8
9 #DOC
10
11 **Usage**
12
13 BuildComposite [optional args] filename
14
15 Unless indicated otherwise (via command line arguments), _filename_ is
16 a QDAT file.
17
18 **Command Line Arguments**
19
20 - -o *filename*: name of the output file for the pickled composite
21
22 - -n *num*: number of separate models to add to the composite
23
24 - -p *tablename*: store persistence data in the database
25 in table *tablename*
26
27 - -N *note*: attach some arbitrary text to the persistence data
28
29 - -b *filename*: name of the text file to hold examples from the
30 holdout set which are misclassified
31
32 - -s: split the data into training and hold-out sets before building
33 the composite
34
35 - -f *frac*: the fraction of data to use in the training set when the
36 data is split
37
38 - -r: randomize the activities (for testing purposes). This ignores
39 the initial distribution of activity values and produces each
40 possible activity value with equal likliehood.
41
42 - -S: shuffle the activities (for testing purposes) This produces
43 a permutation of the input activity values.
44
45 - -l: locks the random number generator to give consistent sets
46 of training and hold-out data. This is primarily intended
47 for testing purposes.
48
49 - -B: use a so-called Bayesian composite model.
50
51 - -d *database name*: instead of reading the data from a QDAT file,
52 pull it from a database. In this case, the _filename_ argument
53 provides the name of the database table containing the data set.
54
55 - -D: show a detailed breakdown of the composite model performance
56 across the training and, when appropriate, hold-out sets.
57
58 - -P *pickle file name*: write out the pickled data set to the file
59
60 - -F *filter frac*: filters the data before training to change the
61 distribution of activity values in the training set. *filter
62 frac* is the fraction of the training set that should have the
63 target value. **See note below on data filtering.**
64
65 - -v *filter value*: filters the data before training to change the
66 distribution of activity values in the training set. *filter
67 value* is the target value to use in filtering. **See note below
68 on data filtering.**
69
70 - --modelFiltFrac *model filter frac*: Similar to filter frac above,
71 in this case the data is filtered for each model in the composite
72 rather than a single overall filter for a composite. *model
73 filter frac* is the fraction of the training set for each model
74 that should have the target value (*model filter value*).
75
76 - --modelFiltVal *model filter value*: target value to use for
77 filtering data before training each model in the composite.
78
79 - -t *threshold value*: use high-confidence predictions for the
80 final analysis of the hold-out data.
81
82 - -Q *list string*: the values of quantization bounds for the
83 activity value. See the _-q_ argument for the format of *list
84 string*.
85
86 - --nRuns *count*: build *count* composite models
87
88 - --prune: prune any models built
89
90 - -h: print a usage message and exit.
91
92 - -V: print the version number and exit
93
94 *-*-*-*-*-*-*-*- Tree-Related Options -*-*-*-*-*-*-*-*
95
96 - -g: be less greedy when training the models.
97
98 - -G *number*: force trees to be rooted at descriptor *number*.
99
100 - -L *limit*: provide an (integer) limit on individual model
101 complexity
102
103 - -q *list string*: Add QuantTrees to the composite and use the list
104 specified in *list string* as the number of target quantization
105 bounds for each descriptor. Don't forget to include 0's at the
106 beginning and end of *list string* for the name and value fields.
107 For example, if there are 4 descriptors and you want 2 quant
108 bounds apiece, you would use _-q "[0,2,2,2,2,0]"_.
109 Two special cases:
110 1) If you would like to ignore a descriptor in the model
111 building, use '-1' for its number of quant bounds.
112 2) If you have integer valued data that should not be quantized
113 further, enter 0 for that descriptor.
114
115 - --recycle: allow descriptors to be used more than once in a tree
116
117 - --randomDescriptors=val: toggles growing random forests with val
118 randomly-selected descriptors available at each node.
119
120
121 *-*-*-*-*-*-*-*- KNN-Related Options -*-*-*-*-*-*-*-*
122
123 - --doKnn: use K-Nearest Neighbors models
124
125 - --knnK=*value*: the value of K to use in the KNN models
126
127 - --knnTanimoto: use the Tanimoto metric in KNN models
128
129 - --knnEuclid: use a Euclidean metric in KNN models
130
131 *-*-*-*-*-*-*- Naive Bayes Classifier Options -*-*-*-*-*-*-*-*
132 - --doNaiveBayes : use Naive Bayes classifiers
133
134 - --mEstimateVal : the value to be used in the m-estimate formula
135 If this is greater than 0.0, we use it to compute the conditional
136 probabilities by the m-estimate
137
138 *-*-*-*-*-*-*-*- SVM-Related Options -*-*-*-*-*-*-*-*
139
140 **** NOTE: THESE ARE DISABLED ****
141
142 ## - --doSVM: use Support-vector machines
143
144 ## - --svmKernel=*kernel*: choose the type of kernel to be used for
145 ## the SVMs. Options are:
146 ## The default is:
147
148 ## - --svmType=*type*: choose the type of support-vector machine
149 ## to be used. Options are:
150 ## The default is:
151
152 ## - --svmGamma=*gamma*: provide the gamma value for the SVMs. If this
153 ## is not provided, a grid search will be carried out to determine an
154 ## optimal *gamma* value for each SVM.
155
156 ## - --svmCost=*cost*: provide the cost value for the SVMs. If this is
157 ## not provided, a grid search will be carried out to determine an
158 ## optimal *cost* value for each SVM.
159
160 ## - --svmWeights=*weights*: provide the weight values for the
161 ## activities. If provided this should be a sequence of (label,
162 ## weight) 2-tuples *nActs* long. If not provided, a weight of 1
163 ## will be used for each activity.
164
165 ## - --svmEps=*epsilon*: provide the epsilon value used to determine
166 ## when the SVM has converged. Defaults to 0.001
167
168 ## - --svmDegree=*degree*: provide the degree of the kernel (when
169 ## sensible) Defaults to 3
170
171 ## - --svmCoeff=*coeff*: provide the coefficient for the kernel (when
172 ## sensible) Defaults to 0
173
174 ## - --svmNu=*nu*: provide the nu value for the kernel (when sensible)
175 ## Defaults to 0.5
176
177 ## - --svmDataType=*float*: if the data is contains only 1 and 0 s, specify by
178 ## using binary. Defaults to float
179
180 ## - --svmCache=*cache*: provide the size of the memory cache (in MB)
181 ## to be used while building the SVM. Defaults to 40
182
183 **Notes**
184
185 - *Data filtering*: When there is a large disparity between the
186 numbers of points with various activity levels present in the
187 training set it is sometimes desirable to train on a more
188 homogeneous data set. This can be accomplished using filtering.
189 The filtering process works by selecting a particular target
190 fraction and target value. For example, in a case where 95% of
191 the original training set has activity 0 and ony 5% activity 1, we
192 could filter (by randomly removing points with activity 0) so that
193 30% of the data set used to build the composite has activity 1.
194
195
196 """
197 from rdkit import RDConfig
198 from rdkit.utils import listutils
199 from rdkit.ML.Composite import Composite,BayesComposite
200
201 import numpy
202 import math
203 from rdkit.ML.Data import DataUtils,SplitData
204 from rdkit.ML import ScreenComposite
205 from rdkit.Dbase import DbModule
206 from rdkit.Dbase.DbConnection import DbConnect
207 from rdkit.ML import CompositeRun
208 import sys,cPickle,time
209 from rdkit import DataStructs
210
211 _runDetails = CompositeRun.CompositeRun()
212
213 __VERSION_STRING="3.2.3"
214
215 _verbose = 1
217 """ emits messages to _sys.stdout_
218 override this in modules which import this one to redirect output
219
220 **Arguments**
221
222 - msg: the string to be displayed
223
224 """
225 if _verbose: sys.stdout.write('%s\n'%(msg))
226
227
228 -def testall(composite,examples,badExamples=[]):
229 """ screens a number of examples past a composite
230
231 **Arguments**
232
233 - composite: a composite model
234
235 - examples: a list of examples (with results) to be screened
236
237 - badExamples: a list to which misclassified examples are appended
238
239 **Returns**
240
241 a list of 2-tuples containing:
242
243 1) a vote
244
245 2) a confidence
246
247 these are the votes and confidence levels for **misclassified** examples
248
249 """
250 wrong = []
251 for example in examples:
252 if composite.GetActivityQuantBounds():
253 answer = composite.QuantizeActivity(example)[-1]
254 else:
255 answer = example[-1]
256 res,conf = composite.ClassifyExample(example)
257 if res != answer:
258 wrong.append((res,conf))
259 badExamples.append(example)
260
261 return wrong
262
264 """ #DOC
265
266 """
267 args = ['BuildComposite']
268 args.append('-n %d'%(details.nModels))
269 if details.filterFrac != 0.0: args.append('-F %.3f -v %d'%(details.filterFrac,details.filterVal))
270 if details.modelFilterFrac != 0.0: args.append('--modelFiltFrac=%.3f --modelFiltVal=%d'%(details.modelFilterFrac,
271 details.modelFilterVal))
272 if details.splitRun: args.append('-s -f %.3f'%(details.splitFrac))
273 if details.shuffleActivities: args.append('-S')
274 if details.randomActivities: args.append('-r')
275 if details.threshold > 0.0: args.append('-t %.3f'%(details.threshold))
276 if details.activityBounds: args.append('-Q "%s"'%(details.activityBoundsVals))
277 if details.dbName: args.append('-d %s'%(details.dbName))
278 if details.detailedRes: args.append('-D')
279 if hasattr(details,'noScreen') and details.noScreen: args.append('--noScreen')
280 if details.persistTblName and details.dbName:
281 args.append('-p %s'%(details.persistTblName))
282 if details.note:
283 args.append('-N %s'%(details.note))
284 if details.useTrees:
285 if details.limitDepth>0: args.append('-L %d'%(details.limitDepth))
286 if details.lessGreedy: args.append('-g')
287 if details.qBounds:
288 shortBounds = listutils.CompactListRepr(details.qBounds)
289 if details.qBounds: args.append('-q "%s"'%(shortBounds))
290 else:
291 if details.qBounds: args.append('-q "%s"'%(details.qBoundCount))
292
293 if details.pruneIt: args.append('--prune')
294 if details.startAt: args.append('-G %d'%details.startAt)
295 if details.recycleVars: args.append('--recycle')
296 if details.randomDescriptors: args.append('--randomDescriptors=%d'%details.randomDescriptors)
297 if details.useSigTrees:
298 args.append('--doSigTree')
299 if details.limitDepth>0: args.append('-L %d'%(details.limitDepth))
300 if details.randomDescriptors:
301 args.append('--randomDescriptors=%d'%details.randomDescriptors)
302
303 if details.useKNN:
304 args.append('--doKnn --knnK %d'%(details.knnNeighs))
305 if details.knnDistFunc=='Tanimoto':
306 args.append('--knnTanimoto')
307 else:
308 args.append('--knnEuclid')
309
310 if details.useNaiveBayes:
311 args.append('--doNaiveBayes')
312 if details.mEstimateVal >= 0.0 :
313 args.append('--mEstimateVal=%.3f'%details.mEstimateVal)
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348 if details.replacementSelection: args.append('--replacementSelection')
349
350
351
352 if details.tableName: args.append(details.tableName)
353
354 return ' '.join(args)
355
356 -def RunOnData(details,data,progressCallback=None,saveIt=1,setDescNames=0):
357 nExamples = data.GetNPts()
358 if details.lockRandom:
359 seed = details.randomSeed
360 else:
361 import random
362 seed = (random.randint(0,1e6),random.randint(0,1e6))
363 DataUtils.InitRandomNumbers(seed)
364 testExamples = []
365 if details.shuffleActivities == 1:
366 DataUtils.RandomizeActivities(data,shuffle=1,runDetails=details)
367 elif details.randomActivities == 1:
368 DataUtils.RandomizeActivities(data,shuffle=0,runDetails=details)
369
370 namedExamples = data.GetNamedData()
371 if details.splitRun == 1:
372 trainIdx,testIdx = SplitData.SplitIndices(len(namedExamples),details.splitFrac,
373 silent=not _verbose)
374
375 trainExamples = [namedExamples[x] for x in trainIdx]
376 testExamples = [namedExamples[x] for x in testIdx]
377 else:
378 testExamples = []
379 testIdx = []
380 trainIdx = range(len(namedExamples))
381 trainExamples = namedExamples
382
383 if details.filterFrac != 0.0:
384
385 if hasattr(details,'activityBounds') and details.activityBounds:
386 tExamples = []
387 bounds = details.activityBounds
388 for pt in trainExamples:
389 pt = pt[:]
390 act = pt[-1]
391 placed=0
392 bound=0
393 while not placed and bound < len(bounds):
394 if act < bounds[bound]:
395 pt[-1] = bound
396 placed = 1
397 else:
398 bound += 1
399 if not placed:
400 pt[-1] = bound
401 tExamples.append(pt)
402 else:
403 bounds = None
404 tExamples = trainExamples
405 trainIdx,temp = DataUtils.FilterData(tExamples,details.filterVal,
406 details.filterFrac,-1,
407 indicesOnly=1)
408 tmp = [trainExamples[x] for x in trainIdx]
409 testExamples += [trainExamples[x] for x in temp]
410 trainExamples = tmp
411
412 counts = DataUtils.CountResults(trainExamples,bounds=bounds)
413 ks = counts.keys()
414 ks.sort()
415 message('Result Counts in training set:')
416 for k in ks:
417 message(str((k, counts[k])))
418 counts = DataUtils.CountResults(testExamples,bounds=bounds)
419 ks = counts.keys()
420 ks.sort()
421 message('Result Counts in test set:')
422 for k in ks:
423 message(str((k, counts[k])))
424 nExamples = len(trainExamples)
425 message('Training with %d examples'%(nExamples))
426
427 nVars = data.GetNVars()
428 attrs = range(1,nVars+1)
429 nPossibleVals = data.GetNPossibleVals()
430 for i in range(1,len(nPossibleVals)):
431 if nPossibleVals[i-1] == -1:
432 attrs.remove(i)
433
434 if details.pickleDataFileName != '':
435 pickleDataFile = open(details.pickleDataFileName,'wb+')
436 cPickle.dump(trainExamples,pickleDataFile)
437 cPickle.dump(testExamples,pickleDataFile)
438 pickleDataFile.close()
439
440 if details.bayesModel:
441 composite = BayesComposite.BayesComposite()
442 else:
443 composite = Composite.Composite()
444
445 composite._randomSeed = seed
446 composite._splitFrac = details.splitFrac
447 composite._shuffleActivities = details.shuffleActivities
448 composite._randomizeActivities = details.randomActivities
449
450 if hasattr(details,'filterFrac'):
451 composite._filterFrac = details.filterFrac
452 if hasattr(details,'filterVal'):
453 composite._filterVal = details.filterVal
454
455 composite.SetModelFilterData(details.modelFilterFrac, details.modelFilterVal)
456
457 composite.SetActivityQuantBounds(details.activityBounds)
458 nPossibleVals = data.GetNPossibleVals()
459 if details.activityBounds:
460 nPossibleVals[-1] = len(details.activityBounds)+1
461
462
463 if setDescNames:
464 composite.SetInputOrder(data.GetVarNames())
465 composite.SetDescriptorNames(details._descNames)
466 else:
467 composite.SetDescriptorNames(data.GetVarNames())
468 composite.SetActivityQuantBounds(details.activityBounds)
469 if details.nModels==1:
470 details.internalHoldoutFrac=0.0
471 if details.useTrees:
472 from rdkit.ML.DecTree import CrossValidate,PruneTree
473 if details.qBounds != []:
474 from rdkit.ML.DecTree import BuildQuantTree
475 builder = BuildQuantTree.QuantTreeBoot
476 else:
477 from rdkit.ML.DecTree import ID3
478 builder = ID3.ID3Boot
479 driver = CrossValidate.CrossValidationDriver
480 pruner = PruneTree.PruneTree
481
482 composite.SetQuantBounds(details.qBounds)
483 nPossibleVals = data.GetNPossibleVals()
484 if details.activityBounds:
485 nPossibleVals[-1] = len(details.activityBounds)+1
486 composite.Grow(trainExamples,attrs,nPossibleVals=[0]+nPossibleVals,
487 buildDriver=driver,
488 pruner=pruner,
489 nTries=details.nModels,pruneIt=details.pruneIt,
490 lessGreedy=details.lessGreedy,needsQuantization=0,
491 treeBuilder=builder,nQuantBounds=details.qBounds,
492 startAt=details.startAt,
493 maxDepth=details.limitDepth,
494 progressCallback=progressCallback,
495 holdOutFrac=details.internalHoldoutFrac,
496 replacementSelection=details.replacementSelection,
497 recycleVars=details.recycleVars,
498 randomDescriptors=details.randomDescriptors,
499 silent=not _verbose)
500
501 elif details.useSigTrees:
502 from rdkit.ML.DecTree import CrossValidate
503 from rdkit.ML.DecTree import BuildSigTree
504 builder = BuildSigTree.SigTreeBuilder
505 driver = CrossValidate.CrossValidationDriver
506 nPossibleVals = data.GetNPossibleVals()
507 if details.activityBounds:
508 nPossibleVals[-1] = len(details.activityBounds)+1
509 if hasattr(details,'sigTreeBiasList'):
510 biasList = details.sigTreeBiasList
511 else:
512 biasList=None
513 if hasattr(details,'useCMIM'):
514 useCMIM=details.useCMIM
515 else:
516 useCMIM=0
517 if hasattr(details,'allowCollections'):
518 allowCollections = details.allowCollections
519 else:
520 allowCollections=False
521 composite.Grow(trainExamples,attrs,nPossibleVals=[0]+nPossibleVals,
522 buildDriver=driver,
523 nTries=details.nModels,
524 needsQuantization=0,
525 treeBuilder=builder,
526 maxDepth=details.limitDepth,
527 progressCallback=progressCallback,
528 holdOutFrac=details.internalHoldoutFrac,
529 replacementSelection=details.replacementSelection,
530 recycleVars=details.recycleVars,
531 randomDescriptors=details.randomDescriptors,
532 biasList=biasList,
533 useCMIM=useCMIM,
534 allowCollection=allowCollections,
535 silent=not _verbose)
536
537 elif details.useKNN:
538 from rdkit.ML.KNN import CrossValidate
539 from rdkit.ML.KNN import DistFunctions
540
541 driver = CrossValidate.CrossValidationDriver
542 dfunc = ''
543 if (details.knnDistFunc == "Euclidean") :
544 dfunc = DistFunctions.EuclideanDist
545 elif (details.knnDistFunc == "Tanimoto"):
546 dfunc = DistFunctions.TanimotoDist
547 else:
548 assert 0,"Bad KNN distance metric value"
549
550
551 composite.Grow(trainExamples, attrs, nPossibleVals=[0]+nPossibleVals,
552 buildDriver=driver, nTries=details.nModels,
553 needsQuantization=0,
554 numNeigh=details.knnNeighs,
555 holdOutFrac=details.internalHoldoutFrac,
556 distFunc=dfunc)
557
558 elif details.useNaiveBayes or details.useSigBayes:
559 from rdkit.ML.NaiveBayes import CrossValidate
560 driver = CrossValidate.CrossValidationDriver
561 if not (hasattr(details,'useSigBayes') and details.useSigBayes):
562 composite.Grow(trainExamples, attrs, nPossibleVals=[0]+nPossibleVals,
563 buildDriver=driver, nTries=details.nModels,
564 needsQuantization=0, nQuantBounds=details.qBounds,
565 holdOutFrac=details.internalHoldoutFrac,
566 replacementSelection=details.replacementSelection,
567 mEstimateVal=details.mEstimateVal,
568 silent=not _verbose)
569 else:
570 if hasattr(details,'useCMIM'):
571 useCMIM=details.useCMIM
572 else:
573 useCMIM=0
574
575 composite.Grow(trainExamples, attrs, nPossibleVals=[0]+nPossibleVals,
576 buildDriver=driver, nTries=details.nModels,
577 needsQuantization=0, nQuantBounds=details.qBounds,
578 mEstimateVal=details.mEstimateVal,
579 useSigs=True,useCMIM=useCMIM,
580 holdOutFrac=details.internalHoldoutFrac,
581 replacementSelection=details.replacementSelection,
582 silent=not _verbose)
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602 else:
603 from rdkit.ML.Neural import CrossValidate
604 driver = CrossValidate.CrossValidationDriver
605 composite.Grow(trainExamples,attrs,[0]+nPossibleVals,nTries=details.nModels,
606 buildDriver=driver,needsQuantization=0)
607
608 composite.AverageErrors()
609 composite.SortModels()
610 modelList,counts,avgErrs = composite.GetAllData()
611 counts = numpy.array(counts)
612 avgErrs = numpy.array(avgErrs)
613 composite._varNames = data.GetVarNames()
614
615 for i in xrange(len(modelList)):
616 modelList[i].NameModel(composite._varNames)
617
618
619 weightedErrs = counts*avgErrs
620 averageErr = sum(weightedErrs)/sum(counts)
621 devs = (avgErrs - averageErr)
622 devs = devs * counts
623 devs = numpy.sqrt(devs*devs)
624 avgDev = sum(devs)/sum(counts)
625 message('# Overall Average Error: %%% 5.2f, Average Deviation: %%% 6.2f'%(100.*averageErr,100.*avgDev))
626
627 if details.bayesModel:
628 composite.Train(trainExamples,verbose=0)
629
630
631 composite.ClearModelExamples()
632 if saveIt:
633 composite.Pickle(details.outName)
634 details.model = DbModule.binaryHolder(cPickle.dumps(composite))
635
636 badExamples = []
637 if not details.detailedRes and (not hasattr(details,'noScreen') or not details.noScreen):
638 if details.splitRun:
639 message('Testing all hold-out examples')
640 wrong = testall(composite,testExamples,badExamples)
641 message('%d examples (%% %5.2f) were misclassified'%(len(wrong),
642 100.*float(len(wrong))/float(len(testExamples))))
643 _runDetails.holdout_error = float(len(wrong))/len(testExamples)
644 else:
645 message('Testing all examples')
646 wrong = testall(composite,namedExamples,badExamples)
647 message('%d examples (%% %5.2f) were misclassified'%(len(wrong),
648 100.*float(len(wrong))/float(len(namedExamples))))
649 _runDetails.overall_error = float(len(wrong))/len(namedExamples)
650
651 if details.detailedRes:
652 message('\nEntire data set:')
653 resTup = ScreenComposite.ShowVoteResults(range(data.GetNPts()),data,composite,
654 nPossibleVals[-1],details.threshold)
655 nGood,nBad,nSkip,avgGood,avgBad,avgSkip,voteTab = resTup
656 nPts = len(namedExamples)
657 nClass = nGood+nBad
658 _runDetails.overall_error = float(nBad) / nClass
659 _runDetails.overall_correct_conf = avgGood
660 _runDetails.overall_incorrect_conf = avgBad
661 _runDetails.overall_result_matrix = repr(voteTab)
662 nRej = nClass-nPts
663 if nRej > 0:
664 _runDetails.overall_fraction_dropped = float(nRej)/nPts
665
666 if details.splitRun:
667 message('\nHold-out data:')
668 resTup = ScreenComposite.ShowVoteResults(range(len(testExamples)),testExamples,
669 composite,
670 nPossibleVals[-1],details.threshold)
671 nGood,nBad,nSkip,avgGood,avgBad,avgSkip,voteTab = resTup
672 nPts = len(testExamples)
673 nClass = nGood+nBad
674 _runDetails.holdout_error = float(nBad) / nClass
675 _runDetails.holdout_correct_conf = avgGood
676 _runDetails.holdout_incorrect_conf = avgBad
677 _runDetails.holdout_result_matrix = repr(voteTab)
678 nRej = nClass-nPts
679 if nRej > 0:
680 _runDetails.holdout_fraction_dropped = float(nRej)/nPts
681
682
683 if details.persistTblName and details.dbName:
684 message('Updating results table %s:%s'%(details.dbName,details.persistTblName))
685 details.Store(db=details.dbName,table=details.persistTblName)
686
687 if details.badName != '':
688 badFile = open(details.badName,'w+')
689 for i in xrange(len(badExamples)):
690 ex = badExamples[i]
691 vote = wrong[i]
692 outStr = '%s\t%s\n'%(ex,vote)
693 badFile.write(outStr)
694 badFile.close()
695
696 composite.ClearModelExamples()
697 return composite
698
699 -def RunIt(details,progressCallback=None,saveIt=1,setDescNames=0):
700 """ does the actual work of building a composite model
701
702 **Arguments**
703
704 - details: a _CompositeRun.CompositeRun_ object containing details
705 (options, parameters, etc.) about the run
706
707 - progressCallback: (optional) a function which is called with a single
708 argument (the number of models built so far) after each model is built.
709
710 - saveIt: (optional) if this is nonzero, the resulting model will be pickled
711 and dumped to the filename specified in _details.outName_
712
713 - setDescNames: (optional) if nonzero, the composite's _SetInputOrder()_ method
714 will be called using the results of the data set's _GetVarNames()_ method;
715 it is assumed that the details object has a _descNames attribute which
716 is passed to the composites _SetDescriptorNames()_ method. Otherwise
717 (the default), _SetDescriptorNames()_ gets the results of _GetVarNames()_.
718
719 **Returns**
720
721 the composite model constructed
722
723
724 """
725 details.rundate = time.asctime()
726
727 fName = details.tableName.strip()
728 if details.outName == '':
729 details.outName = fName + '.pkl'
730 if not details.dbName:
731 if details.qBounds != []:
732 data = DataUtils.TextFileToData(fName)
733 else:
734 data = DataUtils.BuildQuantDataSet(fName)
735 elif details.useSigTrees or details.useSigBayes:
736 details.tableName = fName
737 data = details.GetDataSet(pickleCol=0,pickleClass=DataStructs.ExplicitBitVect)
738 elif details.qBounds != [] or not details.useTrees:
739 details.tableName = fName
740 data = details.GetDataSet()
741 else:
742 data = DataUtils.DBToQuantData(details.dbName,fName,quantName=details.qTableName,
743 user=details.dbUser,password=details.dbPassword)
744
745 composite = RunOnData(details,data,progressCallback=progressCallback,
746 saveIt=saveIt,setDescNames=setDescNames)
747 return composite
748
749
751 """ prints the version number
752
753 """
754 print 'This is BuildComposite.py version %s'%(__VERSION_STRING)
755 if includeArgs:
756 import sys
757 print 'command line was:'
758 print ' '.join(sys.argv)
759
761 """ provides a list of arguments for when this is used from the command line
762
763 """
764 import sys
765 print __doc__
766 sys.exit(-1)
767
769 """ initializes a details object with default values
770
771 **Arguments**
772
773 - details: (optional) a _CompositeRun.CompositeRun_ object.
774 If this is not provided, the global _runDetails will be used.
775
776 **Returns**
777
778 the initialized _CompositeRun_ object.
779
780
781 """
782 if runDetails is None: runDetails = _runDetails
783 return CompositeRun.SetDefaults(runDetails)
784
786 """ parses command line arguments and updates _runDetails_
787
788 **Arguments**
789
790 - runDetails: a _CompositeRun.CompositeRun_ object.
791
792 """
793 import getopt
794 args,extra = getopt.getopt(sys.argv[1:],'P:o:n:p:b:sf:F:v:hlgd:rSTt:BQ:q:DVG:N:L:',
795 ['nRuns=','prune','profile',
796 'seed=','noScreen',
797
798 'modelFiltFrac=', 'modelFiltVal=',
799
800 'recycle','randomDescriptors=',
801
802 'doKnn','knnK=','knnTanimoto','knnEuclid',
803
804 'doSigTree','doCMIM=','allowCollections',
805
806 'doNaiveBayes', 'mEstimateVal=',
807 'doSigBayes',
808
809
810
811
812
813
814 'replacementSelection',
815
816 ])
817 runDetails.profileIt=0
818 for arg,val in args:
819 if arg == '-n':
820 runDetails.nModels = int(val)
821 elif arg == '-N':
822 runDetails.note=val
823 elif arg == '-o':
824 runDetails.outName = val
825 elif arg == '-Q':
826 qBounds = eval(val)
827 assert type(qBounds) in [type([]),type(())],'bad argument type for -Q, specify a list as a string'
828 runDetails.activityBounds=qBounds
829 runDetails.activityBoundsVals=val
830 elif arg == '-p':
831 runDetails.persistTblName=val
832 elif arg == '-P':
833 runDetails.pickleDataFileName= val
834 elif arg == '-r':
835 runDetails.randomActivities = 1
836 elif arg == '-S':
837 runDetails.shuffleActivities = 1
838 elif arg == '-b':
839 runDetails.badName = val
840 elif arg == '-B':
841 runDetails.bayesModels=1
842 elif arg == '-s':
843 runDetails.splitRun = 1
844 elif arg == '-f':
845 runDetails.splitFrac=float(val)
846 elif arg == '-F':
847 runDetails.filterFrac=float(val)
848 elif arg == '-v':
849 runDetails.filterVal=float(val)
850 elif arg == '-l':
851 runDetails.lockRandom = 1
852 elif arg == '-g':
853 runDetails.lessGreedy=1
854 elif arg == '-G':
855 runDetails.startAt = int(val)
856 elif arg == '-d':
857 runDetails.dbName=val
858 elif arg == '-T':
859 runDetails.useTrees = 0
860 elif arg == '-t':
861 runDetails.threshold=float(val)
862 elif arg == '-D':
863 runDetails.detailedRes = 1
864 elif arg == '-L':
865 runDetails.limitDepth = int(val)
866 elif arg == '-q':
867 qBounds = eval(val)
868 assert type(qBounds) in [type([]),type(())],'bad argument type for -q, specify a list as a string'
869 runDetails.qBoundCount=val
870 runDetails.qBounds = qBounds
871 elif arg == '-V':
872 ShowVersion()
873 sys.exit(0)
874 elif arg == '--nRuns':
875 runDetails.nRuns = int(val)
876 elif arg == '--modelFiltFrac':
877 runDetails.modelFilterFrac=float(val)
878 elif arg == '--modelFiltVal':
879 runDetails.modelFilterVal=float(val)
880 elif arg == '--prune':
881 runDetails.pruneIt=1
882 elif arg == '--profile':
883 runDetails.profileIt=1
884
885 elif arg == '--recycle':
886 runDetails.recycleVars=1
887 elif arg == '--randomDescriptors':
888 runDetails.randomDescriptors=int(val)
889
890 elif arg == '--doKnn':
891 runDetails.useKNN=1
892 runDetails.useTrees=0
893
894 runDetails.useNaiveBayes=0
895 elif arg == '--knnK':
896 runDetails.knnNeighs = int(val)
897 elif arg == '--knnTanimoto':
898 runDetails.knnDistFunc="Tanimoto"
899 elif arg == '--knnEuclid':
900 runDetails.knnDistFunc="Euclidean"
901
902 elif arg == '--doSigTree':
903
904 runDetails.useKNN=0
905 runDetails.useTrees=0
906 runDetails.useNaiveBayes=0
907 runDetails.useSigTrees=1
908 elif arg == '--doCMIM':
909 runDetails.useCMIM=int(val)
910 elif arg == '--allowCollections':
911 runDetails.allowCollections=True
912
913 elif arg == '--doNaiveBayes':
914 runDetails.useNaiveBayes=1
915
916 runDetails.useKNN=0
917 runDetails.useTrees=0
918 runDetails.useSigBayes=0
919 elif arg == '--doSigBayes':
920 runDetails.useSigBayes=1
921 runDetails.useNaiveBayes=0
922
923 runDetails.useKNN=0
924 runDetails.useTrees=0
925 elif arg == '--mEstimateVal':
926 runDetails.mEstimateVal=float(val)
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967 elif arg== '--seed':
968
969 runDetails.randomSeed = eval(val)
970
971 elif arg== '--noScreen':
972 runDetails.noScreen=1
973
974 elif arg== '--replacementSelection':
975 runDetails.replacementSelection = 1
976
977 elif arg == '-h':
978 Usage()
979
980 else:
981 Usage()
982 runDetails.tableName=extra[0]
983
984 if __name__ == '__main__':
985 if len(sys.argv) < 2:
986 Usage()
987
988 _runDetails.cmd = ' '.join(sys.argv)
989 SetDefaults(_runDetails)
990 ParseArgs(_runDetails)
991
992
993 ShowVersion(includeArgs=1)
994
995 if _runDetails.nRuns > 1:
996 for i in range(_runDetails.nRuns):
997 sys.stderr.write('---------------------------------\n\tDoing %d of %d\n---------------------------------\n'%(i+1,_runDetails.nRuns))
998 RunIt(_runDetails)
999 else:
1000 if _runDetails.profileIt:
1001 import hotshot,hotshot.stats
1002 prof=hotshot.Profile('prof.dat')
1003 prof.runcall(RunIt,_runDetails)
1004 stats = hotshot.stats.load('prof.dat')
1005 stats.strip_dirs()
1006 stats.sort_stats('time','calls')
1007 stats.print_stats(30)
1008 else:
1009 RunIt(_runDetails)
1010