Package ML :: Module BuildComposite
[hide private]
[frames] | no frames]

Source Code for Module ML.BuildComposite

   1  # $Id: BuildComposite.py 2 2006-05-06 22:54:39Z glandrum $ 
   2  # 
   3  #  Copyright (C) 2000-2006  greg Landrum and Rational Discovery LLC 
   4  # 
   5  #   @@ All Rights Reserved  @@ 
   6  # 
   7  """ command line utility for building composite models 
   8   
   9  #DOC 
  10   
  11  **Usage** 
  12   
  13    BuildComposite [optional args] filename 
  14   
  15  Unless indicated otherwise (via command line arguments), _filename_ is 
  16  a QDAT file. 
  17   
  18  **Command Line Arguments** 
  19   
  20    - -o *filename*: name of the output file for the pickled composite 
  21   
  22    - -n *num*: number of separate models to add to the composite 
  23   
  24    - -p *tablename*: store persistence data in the database 
  25       in table *tablename* 
  26   
  27    - -N *note*: attach some arbitrary text to the persistence data 
  28   
  29    - -b *filename*: name of the text file to hold examples from the 
  30       holdout set which are misclassified 
  31   
  32    - -s: split the data into training and hold-out sets before building 
  33       the composite 
  34   
  35    - -f *frac*: the fraction of data to use in the training set when the 
  36       data is split 
  37   
  38    - -r: randomize the activities (for testing purposes).  This ignores 
  39       the initial distribution of activity values and produces each 
  40       possible activity value with equal likliehood. 
  41   
  42    - -S: shuffle the activities (for testing purposes) This produces 
  43       a permutation of the input activity values. 
  44   
  45    - -l: locks the random number generator to give consistent sets 
  46       of training and hold-out data.  This is primarily intended 
  47       for testing purposes. 
  48   
  49    - -B: use a so-called Bayesian composite model. 
  50   
  51    - -d *database name*: instead of reading the data from a QDAT file, 
  52       pull it from a database.  In this case, the _filename_ argument 
  53       provides the name of the database table containing the data set. 
  54   
  55    - -D: show a detailed breakdown of the composite model performance 
  56       across the training and, when appropriate, hold-out sets. 
  57        
  58    - -P *pickle file name*: write out the pickled data set to the file 
  59   
  60    - -F *filter frac*: filters the data before training to change the 
  61       distribution of activity values in the training set.  *filter 
  62       frac* is the fraction of the training set that should have the 
  63       target value.  **See note below on data filtering.** 
  64   
  65    - -v *filter value*: filters the data before training to change the 
  66       distribution of activity values in the training set. *filter 
  67       value* is the target value to use in filtering.  **See note below 
  68       on data filtering.** 
  69        
  70    - --modelFiltFrac *model filter frac*: Similar to filter frac above, 
  71       in this case the data is filtered for each model in the composite 
  72       rather than a single overall filter for a composite. *model 
  73       filter frac* is the fraction of the training set for each model 
  74       that should have the target value (*model filter value*). 
  75   
  76    - --modelFiltVal *model filter value*: target value to use for 
  77       filtering data before training each model in the composite. 
  78        
  79    - -t *threshold value*: use high-confidence predictions for the 
  80       final analysis of the hold-out data. 
  81   
  82    - -Q *list string*: the values of quantization bounds for the 
  83       activity value.  See the _-q_ argument for the format of *list 
  84       string*. 
  85   
  86    - --nRuns *count*: build *count* composite models 
  87   
  88    - --prune: prune any models built 
  89   
  90    - -h: print a usage message and exit. 
  91   
  92    - -V: print the version number and exit 
  93   
  94    *-*-*-*-*-*-*-*- Tree-Related Options -*-*-*-*-*-*-*-* 
  95   
  96    - -g: be less greedy when training the models. 
  97   
  98    - -G *number*: force trees to be rooted at descriptor *number*. 
  99   
 100    - -L *limit*: provide an (integer) limit on individual model 
 101       complexity 
 102   
 103    - -q *list string*: Add QuantTrees to the composite and use the list 
 104       specified in *list string* as the number of target quantization 
 105       bounds for each descriptor.  Don't forget to include 0's at the 
 106       beginning and end of *list string* for the name and value fields. 
 107       For example, if there are 4 descriptors and you want 2 quant 
 108       bounds apiece, you would use _-q "[0,2,2,2,2,0]"_. 
 109       Two special cases: 
 110         1) If you would like to ignore a descriptor in the model 
 111            building, use '-1' for its number of quant bounds. 
 112         2) If you have integer valued data that should not be quantized 
 113            further, enter 0 for that descriptor. 
 114   
 115    - --recycle: allow descriptors to be used more than once in a tree         
 116   
 117    - --randomDescriptors=val: toggles growing random forests with val 
 118        randomly-selected descriptors available at each node. 
 119   
 120   
 121    *-*-*-*-*-*-*-*- KNN-Related Options -*-*-*-*-*-*-*-* 
 122   
 123    - --doKnn: use K-Nearest Neighbors models 
 124   
 125    - --knnK=*value*: the value of K to use in the KNN models 
 126   
 127    - --knnTanimoto: use the Tanimoto metric in KNN models 
 128     
 129    - --knnEuclid: use a Euclidean metric in KNN models 
 130     
 131    *-*-*-*-*-*-*- Naive Bayes Classifier Options -*-*-*-*-*-*-*-* 
 132    - --doNaiveBayes : use Naive Bayes classifiers 
 133     
 134    - --mEstimateVal : the value to be used in the m-estimate formula 
 135        If this is greater than 0.0, we use it to compute the conditional 
 136        probabilities by the m-estimate 
 137   
 138    *-*-*-*-*-*-*-*- SVM-Related Options -*-*-*-*-*-*-*-* 
 139   
 140    **** NOTE: THESE ARE DISABLED ****   
 141   
 142  ##   - --doSVM: use Support-vector machines 
 143   
 144  ##   - --svmKernel=*kernel*: choose the type of kernel to be used for 
 145  ##     the SVMs.  Options are: 
 146  ##     The default is: 
 147   
 148  ##   - --svmType=*type*: choose the type of support-vector machine 
 149  ##     to be used.  Options are: 
 150  ##     The default is: 
 151   
 152  ##   - --svmGamma=*gamma*: provide the gamma value for the SVMs.  If this 
 153  ##     is not provided, a grid search will be carried out to determine an 
 154  ##     optimal *gamma* value for each SVM. 
 155       
 156  ##   - --svmCost=*cost*: provide the cost value for the SVMs.  If this is 
 157  ##     not provided, a grid search will be carried out to determine an 
 158  ##     optimal *cost* value for each SVM. 
 159   
 160  ##   - --svmWeights=*weights*: provide the weight values for the 
 161  ##     activities.  If provided this should be a sequence of (label, 
 162  ##     weight) 2-tuples *nActs* long.  If not provided, a weight of 1 
 163  ##     will be used for each activity. 
 164   
 165  ##   - --svmEps=*epsilon*: provide the epsilon value used to determine 
 166  ##     when the SVM has converged.  Defaults to 0.001 
 167       
 168  ##   - --svmDegree=*degree*: provide the degree of the kernel (when 
 169  ##     sensible) Defaults to 3 
 170   
 171  ##   - --svmCoeff=*coeff*: provide the coefficient for the kernel (when 
 172  ##     sensible) Defaults to 0 
 173       
 174  ##   - --svmNu=*nu*: provide the nu value for the kernel (when sensible) 
 175  ##     Defaults to 0.5 
 176   
 177  ##   - --svmDataType=*float*: if the data is contains only 1 and 0 s, specify by 
 178  ##     using binary. Defaults to float 
 179       
 180  ##   - --svmCache=*cache*: provide the size of the memory cache (in MB) 
 181  ##     to be used while building the SVM.  Defaults to 40 
 182   
 183  **Notes** 
 184   
 185    - *Data filtering*: When there is a large disparity between the 
 186      numbers of points with various activity levels present in the 
 187      training set it is sometimes desirable to train on a more 
 188      homogeneous data set.  This can be accomplished using filtering. 
 189      The filtering process works by selecting a particular target 
 190      fraction and target value.  For example, in a case where 95% of 
 191      the original training set has activity 0 and ony 5% activity 1, we 
 192      could filter (by randomly removing points with activity 0) so that 
 193      30% of the data set used to build the composite has activity 1. 
 194        
 195   
 196  """ 
 197  import RDConfig 
 198  from utils import listutils 
 199  from ML.Composite import Composite,BayesComposite 
 200  #from ML.SVM import SVMClassificationModel as SVM 
 201  from Numeric import * 
 202  from ML.Data import DataUtils,SplitData 
 203  from ML import ScreenComposite 
 204  from Dbase import DbModule 
 205  from Dbase.DbConnection import DbConnect 
 206  from ML import CompositeRun 
 207  import sys,cPickle,time 
 208  import DataStructs 
 209   
 210  _runDetails = CompositeRun.CompositeRun() 
 211   
 212  __VERSION_STRING="3.2.3" 
 213   
 214  _verbose = 1 
215 -def message(msg):
216 """ emits messages to _sys.stdout_ 217 override this in modules which import this one to redirect output 218 219 **Arguments** 220 221 - msg: the string to be displayed 222 223 """ 224 if _verbose: sys.stdout.write('%s\n'%(msg))
225 226
227 -def testall(composite,examples,badExamples=[]):
228 """ screens a number of examples past a composite 229 230 **Arguments** 231 232 - composite: a composite model 233 234 - examples: a list of examples (with results) to be screened 235 236 - badExamples: a list to which misclassified examples are appended 237 238 **Returns** 239 240 a list of 2-tuples containing: 241 242 1) a vote 243 244 2) a confidence 245 246 these are the votes and confidence levels for **misclassified** examples 247 248 """ 249 wrong = [] 250 for example in examples: 251 if composite.GetActivityQuantBounds(): 252 answer = composite.QuantizeActivity(example)[-1] 253 else: 254 answer = example[-1] 255 res,conf = composite.ClassifyExample(example) 256 if res != answer: 257 wrong.append((res,conf)) 258 badExamples.append(example) 259 260 return wrong
261
262 -def GetCommandLine(details):
263 """ #DOC 264 265 """ 266 args = ['BuildComposite'] 267 args.append('-n %d'%(details.nModels)) 268 if details.filterFrac != 0.0: args.append('-F %.3f -v %d'%(details.filterFrac,details.filterVal)) 269 if details.modelFilterFrac != 0.0: args.append('--modelFiltFrac=%.3f --modelFiltVal=%d'%(details.modelFilterFrac, 270 details.modelFilterVal)) 271 if details.splitRun: args.append('-s -f %.3f'%(details.splitFrac)) 272 if details.shuffleActivities: args.append('-S') 273 if details.randomActivities: args.append('-r') 274 if details.threshold > 0.0: args.append('-t %.3f'%(details.threshold)) 275 if details.activityBounds: args.append('-Q "%s"'%(details.activityBoundsVals)) 276 if details.dbName: args.append('-d %s'%(details.dbName)) 277 if details.detailedRes: args.append('-D') 278 if hasattr(details,'noScreen') and details.noScreen: args.append('--noScreen') 279 if details.persistTblName and details.dbName: 280 args.append('-p %s'%(details.persistTblName)) 281 if details.note: 282 args.append('-N %s'%(details.note)) 283 if details.useTrees: 284 if details.limitDepth>0: args.append('-L %d'%(details.limitDepth)) 285 if details.lessGreedy: args.append('-g') 286 if details.qBounds: 287 shortBounds = listutils.CompactListRepr(details.qBounds) 288 if details.qBounds: args.append('-q "%s"'%(shortBounds)) 289 else: 290 if details.qBounds: args.append('-q "%s"'%(details.qBoundCount)) 291 292 if details.pruneIt: args.append('--prune') 293 if details.startAt: args.append('-G %d'%details.startAt) 294 if details.recycleVars: args.append('--recycle') 295 if details.randomDescriptors: args.append('--randomDescriptors=%d'%details.randomDescriptors) 296 if details.useSigTrees: 297 args.append('--doSigTree') 298 if details.limitDepth>0: args.append('-L %d'%(details.limitDepth)) 299 if details.randomDescriptors: 300 args.append('--randomDescriptors=%d'%details.randomDescriptors) 301 302 if details.useKNN: 303 args.append('--doKnn --knnK %d'%(details.knnNeighs)) 304 if details.knnDistFunc=='Tanimoto': 305 args.append('--knnTanimoto') 306 else: 307 args.append('--knnEuclid') 308 309 if details.useNaiveBayes: 310 args.append('--doNaiveBayes') 311 if details.mEstimateVal >= 0.0 : 312 args.append('--mEstimateVal=%.3f'%details.mEstimateVal) 313 314 ## if details.useSVM: 315 ## args.append('--doSVM') 316 ## if details.svmKernel: 317 ## for k in SVM.kernels.keys(): 318 ## if SVM.kernels[k]==details.svmKernel: 319 ## args.append('--svmKernel=%s'%k) 320 ## break 321 ## if details.svmType: 322 ## for k in SVM.machineTypes.keys(): 323 ## if SVM.machineTypes[k]==details.svmType: 324 ## args.append('--svmType=%s'%k) 325 ## break 326 ## if details.svmGamma: 327 ## args.append('--svmGamma=%f'%details.svmGamma) 328 ## if details.svmCost: 329 ## args.append('--svmCost=%f'%details.svmCost) 330 ## if details.svmWeights: 331 ## args.append("--svmWeights='%s'"%str(details.svmWeights)) 332 ## if details.svmDegree: 333 ## args.append('--svmDegree=%d'%details.svmDegree) 334 ## if details.svmCoeff: 335 ## args.append('--svmCoeff=%d'%details.svmCoeff) 336 ## if details.svmEps: 337 ## args.append('--svmEps=%f'%details.svmEps) 338 ## if details.svmNu: 339 ## args.append('--svmNu=%f'%details.svmNu) 340 ## if details.svmCache: 341 ## args.append('--svmCache=%d'%details.svmCache) 342 ## if detail.svmDataType: 343 ## args.append('--svmDataType=%s'%details.svmDataType) 344 ## if not details.svmShrink: 345 ## args.append('--svmShrink') 346 347 if details.replacementSelection: args.append('--replacementSelection') 348 349 350 # this should always be last: 351 if details.tableName: args.append(details.tableName) 352 353 return ' '.join(args)
354
355 -def RunOnData(details,data,progressCallback=None,saveIt=1,setDescNames=0):
356 nExamples = data.GetNPts() 357 if details.lockRandom: 358 seed = details.randomSeed 359 else: 360 import random 361 seed = (random.randint(0,1e6),random.randint(0,1e6)) 362 DataUtils.InitRandomNumbers(seed) 363 testExamples = [] 364 if details.shuffleActivities == 1: 365 DataUtils.RandomizeActivities(data,shuffle=1,runDetails=details) 366 elif details.randomActivities == 1: 367 DataUtils.RandomizeActivities(data,shuffle=0,runDetails=details) 368 369 namedExamples = data.GetNamedData() 370 if details.splitRun == 1: 371 trainIdx,testIdx = SplitData.SplitIndices(len(namedExamples),details.splitFrac, 372 silent=not _verbose) 373 374 trainExamples = [namedExamples[x] for x in trainIdx] 375 testExamples = [namedExamples[x] for x in testIdx] 376 else: 377 testExamples = [] 378 testIdx = [] 379 trainIdx = range(len(namedExamples)) 380 trainExamples = namedExamples 381 382 if details.filterFrac != 0.0: 383 # if we're doing quantization on the fly, we need to handle that here: 384 if hasattr(details,'activityBounds') and details.activityBounds: 385 tExamples = [] 386 bounds = details.activityBounds 387 for pt in trainExamples: 388 pt = pt[:] 389 act = pt[-1] 390 placed=0 391 bound=0 392 while not placed and bound < len(bounds): 393 if act < bounds[bound]: 394 pt[-1] = bound 395 placed = 1 396 else: 397 bound += 1 398 if not placed: 399 pt[-1] = bound 400 tExamples.append(pt) 401 else: 402 bounds = None 403 tExamples = trainExamples 404 trainIdx,temp = DataUtils.FilterData(tExamples,details.filterVal, 405 details.filterFrac,-1, 406 indicesOnly=1) 407 tmp = [trainExamples[x] for x in trainIdx] 408 testExamples += [trainExamples[x] for x in temp] 409 trainExamples = tmp 410 411 counts = DataUtils.CountResults(trainExamples,bounds=bounds) 412 ks = counts.keys() 413 ks.sort() 414 message('Result Counts in training set:') 415 for k in ks: 416 message(str((k, counts[k]))) 417 counts = DataUtils.CountResults(testExamples,bounds=bounds) 418 ks = counts.keys() 419 ks.sort() 420 message('Result Counts in test set:') 421 for k in ks: 422 message(str((k, counts[k]))) 423 nExamples = len(trainExamples) 424 message('Training with %d examples'%(nExamples)) 425 426 nVars = data.GetNVars() 427 attrs = range(1,nVars+1) 428 nPossibleVals = data.GetNPossibleVals() 429 for i in range(1,len(nPossibleVals)): 430 if nPossibleVals[i-1] == -1: 431 attrs.remove(i) 432 433 if details.pickleDataFileName != '': 434 pickleDataFile = open(details.pickleDataFileName,'wb+') 435 cPickle.dump(trainExamples,pickleDataFile) 436 cPickle.dump(testExamples,pickleDataFile) 437 pickleDataFile.close() 438 439 if details.bayesModel: 440 composite = BayesComposite.BayesComposite() 441 else: 442 composite = Composite.Composite() 443 444 composite._randomSeed = seed 445 composite._splitFrac = details.splitFrac 446 composite._shuffleActivities = details.shuffleActivities 447 composite._randomizeActivities = details.randomActivities 448 449 if hasattr(details,'filterFrac'): 450 composite._filterFrac = details.filterFrac 451 if hasattr(details,'filterVal'): 452 composite._filterVal = details.filterVal 453 454 composite.SetModelFilterData(details.modelFilterFrac, details.modelFilterVal) 455 456 composite.SetActivityQuantBounds(details.activityBounds) 457 nPossibleVals = data.GetNPossibleVals() 458 if details.activityBounds: 459 nPossibleVals[-1] = len(details.activityBounds)+1 460 461 462 if setDescNames: 463 composite.SetInputOrder(data.GetVarNames()) 464 composite.SetDescriptorNames(details._descNames) 465 else: 466 composite.SetDescriptorNames(data.GetVarNames()) 467 composite.SetActivityQuantBounds(details.activityBounds) 468 if details.nModels==1: 469 details.internalHoldoutFrac=0.0 470 if details.useTrees: 471 from ML.DecTree import CrossValidate,PruneTree 472 if details.qBounds != []: 473 from ML.DecTree import BuildQuantTree 474 builder = BuildQuantTree.QuantTreeBoot 475 else: 476 from ML.DecTree import ID3 477 builder = ID3.ID3Boot 478 driver = CrossValidate.CrossValidationDriver 479 pruner = PruneTree.PruneTree 480 481 composite.SetQuantBounds(details.qBounds) 482 nPossibleVals = data.GetNPossibleVals() 483 if details.activityBounds: 484 nPossibleVals[-1] = len(details.activityBounds)+1 485 composite.Grow(trainExamples,attrs,nPossibleVals=[0]+nPossibleVals, 486 buildDriver=driver, 487 pruner=pruner, 488 nTries=details.nModels,pruneIt=details.pruneIt, 489 lessGreedy=details.lessGreedy,needsQuantization=0, 490 treeBuilder=builder,nQuantBounds=details.qBounds, 491 startAt=details.startAt, 492 maxDepth=details.limitDepth, 493 progressCallback=progressCallback, 494 holdOutFrac=details.internalHoldoutFrac, 495 replacementSelection=details.replacementSelection, 496 recycleVars=details.recycleVars, 497 randomDescriptors=details.randomDescriptors, 498 silent=not _verbose) 499 500 elif details.useSigTrees: 501 from ML.DecTree import CrossValidate 502 from ML.DecTree import BuildSigTree 503 builder = BuildSigTree.SigTreeBuilder 504 driver = CrossValidate.CrossValidationDriver 505 nPossibleVals = data.GetNPossibleVals() 506 if details.activityBounds: 507 nPossibleVals[-1] = len(details.activityBounds)+1 508 if hasattr(details,'sigTreeBiasList'): 509 biasList = details.sigTreeBiasList 510 else: 511 biasList=None 512 if hasattr(details,'useCMIM'): 513 useCMIM=details.useCMIM 514 else: 515 useCMIM=0 516 if hasattr(details,'allowCollections'): 517 allowCollections = details.allowCollections 518 else: 519 allowCollections=False 520 composite.Grow(trainExamples,attrs,nPossibleVals=[0]+nPossibleVals, 521 buildDriver=driver, 522 nTries=details.nModels, 523 needsQuantization=0, 524 treeBuilder=builder, 525 maxDepth=details.limitDepth, 526 progressCallback=progressCallback, 527 holdOutFrac=details.internalHoldoutFrac, 528 replacementSelection=details.replacementSelection, 529 recycleVars=details.recycleVars, 530 randomDescriptors=details.randomDescriptors, 531 biasList=biasList, 532 useCMIM=useCMIM, 533 allowCollection=allowCollections, 534 silent=not _verbose) 535 536 elif details.useKNN: 537 from ML.KNN import CrossValidate 538 from ML.KNN import DistFunctions 539 540 driver = CrossValidate.CrossValidationDriver 541 dfunc = '' 542 if (details.knnDistFunc == "Euclidean") : 543 dfunc = DistFunctions.EuclideanDist 544 elif (details.knnDistFunc == "Tanimoto"): 545 dfunc = DistFunctions.TanimotoDist 546 else: 547 assert 0,"Bad KNN distance metric value" 548 549 550 composite.Grow(trainExamples, attrs, nPossibleVals=[0]+nPossibleVals, 551 buildDriver=driver, nTries=details.nModels, 552 needsQuantization=0, 553 numNeigh=details.knnNeighs, 554 holdOutFrac=details.internalHoldoutFrac, 555 distFunc=dfunc) 556 557 elif details.useNaiveBayes or details.useSigBayes: 558 from ML.NaiveBayes import CrossValidate 559 driver = CrossValidate.CrossValidationDriver 560 if not (hasattr(details,'useSigBayes') and details.useSigBayes): 561 composite.Grow(trainExamples, attrs, nPossibleVals=[0]+nPossibleVals, 562 buildDriver=driver, nTries=details.nModels, 563 needsQuantization=0, nQuantBounds=details.qBounds, 564 holdOutFrac=details.internalHoldoutFrac, 565 replacementSelection=details.replacementSelection, 566 mEstimateVal=details.mEstimateVal, 567 silent=not _verbose) 568 else: 569 if hasattr(details,'useCMIM'): 570 useCMIM=details.useCMIM 571 else: 572 useCMIM=0 573 574 composite.Grow(trainExamples, attrs, nPossibleVals=[0]+nPossibleVals, 575 buildDriver=driver, nTries=details.nModels, 576 needsQuantization=0, nQuantBounds=details.qBounds, 577 mEstimateVal=details.mEstimateVal, 578 useSigs=True,useCMIM=useCMIM, 579 holdOutFrac=details.internalHoldoutFrac, 580 replacementSelection=details.replacementSelection, 581 silent=not _verbose) 582 583 584 585 ## elif details.useSVM: 586 ## from ML.SVM import CrossValidate 587 ## driver = CrossValidate.CrossValidationDriver 588 ## composite.Grow(trainExamples, attrs, nPossibleVals=[0]+nPossibleVals, 589 ## buildDriver=driver, nTries=details.nModels, 590 ## needsQuantization=0, 591 ## cost=details.svmCost,gamma=details.svmGamma, 592 ## weights=details.svmWeights,degree=details.svmDegree, 593 ## type=details.svmType,kernelType=details.svmKernel, 594 ## coef0=details.svmCoeff,eps=details.svmEps,nu=details.svmNu, 595 ## cache_size=details.svmCache,shrinking=details.svmShrink, 596 ## dataType=details.svmDataType, 597 ## holdOutFrac=details.internalHoldoutFrac, 598 ## replacementSelection=details.replacementSelection, 599 ## silent=not _verbose) 600 601 else: 602 from ML.Neural import CrossValidate 603 driver = CrossValidate.CrossValidationDriver 604 composite.Grow(trainExamples,attrs,[0]+nPossibleVals,nTries=details.nModels, 605 buildDriver=driver,needsQuantization=0) 606 607 composite.AverageErrors() 608 composite.SortModels() 609 modelList,counts,avgErrs = composite.GetAllData() 610 counts = array(counts) 611 avgErrs = array(avgErrs) 612 composite._varNames = data.GetVarNames() 613 614 for i in xrange(len(modelList)): 615 modelList[i].NameModel(composite._varNames) 616 617 # do final statistics 618 weightedErrs = counts*avgErrs 619 averageErr = sum(weightedErrs)/sum(counts) 620 devs = (avgErrs - averageErr) 621 devs = devs * counts 622 devs = sqrt(devs*devs) 623 avgDev = sum(devs)/sum(counts) 624 message('# Overall Average Error: %%% 5.2f, Average Deviation: %%% 6.2f'%(100.*averageErr,100.*avgDev)) 625 626 if details.bayesModel: 627 composite.Train(trainExamples,verbose=0) 628 629 # blow out the saved examples and then save the composite: 630 composite.ClearModelExamples() 631 if saveIt: 632 composite.Pickle(details.outName) 633 details.model = DbModule.binaryHolder(cPickle.dumps(composite)) 634 635 badExamples = [] 636 if not details.detailedRes and (not hasattr(details,'noScreen') or not details.noScreen): 637 if details.splitRun: 638 message('Testing all hold-out examples') 639 wrong = testall(composite,testExamples,badExamples) 640 message('%d examples (%% %5.2f) were misclassified'%(len(wrong), 641 100.*float(len(wrong))/float(len(testExamples)))) 642 _runDetails.holdout_error = float(len(wrong))/len(testExamples) 643 else: 644 message('Testing all examples') 645 wrong = testall(composite,namedExamples,badExamples) 646 message('%d examples (%% %5.2f) were misclassified'%(len(wrong), 647 100.*float(len(wrong))/float(len(namedExamples)))) 648 _runDetails.overall_error = float(len(wrong))/len(namedExamples) 649 650 if details.detailedRes: 651 message('\nEntire data set:') 652 resTup = ScreenComposite.ShowVoteResults(range(data.GetNPts()),data,composite, 653 nPossibleVals[-1],details.threshold) 654 nGood,nBad,nSkip,avgGood,avgBad,avgSkip,voteTab = resTup 655 nPts = len(namedExamples) 656 nClass = nGood+nBad 657 _runDetails.overall_error = float(nBad) / nClass 658 _runDetails.overall_correct_conf = avgGood 659 _runDetails.overall_incorrect_conf = avgBad 660 _runDetails.overall_result_matrix = repr(voteTab) 661 nRej = nClass-nPts 662 if nRej > 0: 663 _runDetails.overall_fraction_dropped = float(nRej)/nPts 664 665 if details.splitRun: 666 message('\nHold-out data:') 667 resTup = ScreenComposite.ShowVoteResults(range(len(testExamples)),testExamples, 668 composite, 669 nPossibleVals[-1],details.threshold) 670 nGood,nBad,nSkip,avgGood,avgBad,avgSkip,voteTab = resTup 671 nPts = len(testExamples) 672 nClass = nGood+nBad 673 _runDetails.holdout_error = float(nBad) / nClass 674 _runDetails.holdout_correct_conf = avgGood 675 _runDetails.holdout_incorrect_conf = avgBad 676 _runDetails.holdout_result_matrix = repr(voteTab) 677 nRej = nClass-nPts 678 if nRej > 0: 679 _runDetails.holdout_fraction_dropped = float(nRej)/nPts 680 681 682 if details.persistTblName and details.dbName: 683 message('Updating results table %s:%s'%(details.dbName,details.persistTblName)) 684 details.Store(db=details.dbName,table=details.persistTblName) 685 686 if details.badName != '': 687 badFile = open(details.badName,'w+') 688 for i in xrange(len(badExamples)): 689 ex = badExamples[i] 690 vote = wrong[i] 691 outStr = '%s\t%s\n'%(ex,vote) 692 badFile.write(outStr) 693 badFile.close() 694 695 composite.ClearModelExamples() 696 return composite
697
698 -def RunIt(details,progressCallback=None,saveIt=1,setDescNames=0):
699 """ does the actual work of building a composite model 700 701 **Arguments** 702 703 - details: a _CompositeRun.CompositeRun_ object containing details 704 (options, parameters, etc.) about the run 705 706 - progressCallback: (optional) a function which is called with a single 707 argument (the number of models built so far) after each model is built. 708 709 - saveIt: (optional) if this is nonzero, the resulting model will be pickled 710 and dumped to the filename specified in _details.outName_ 711 712 - setDescNames: (optional) if nonzero, the composite's _SetInputOrder()_ method 713 will be called using the results of the data set's _GetVarNames()_ method; 714 it is assumed that the details object has a _descNames attribute which 715 is passed to the composites _SetDescriptorNames()_ method. Otherwise 716 (the default), _SetDescriptorNames()_ gets the results of _GetVarNames()_. 717 718 **Returns** 719 720 the composite model constructed 721 722 723 """ 724 details.rundate = time.asctime() 725 726 fName = details.tableName.strip() 727 if details.outName == '': 728 details.outName = fName + '.pkl' 729 if not details.dbName: 730 if details.qBounds != []: 731 data = DataUtils.TextFileToData(fName) 732 else: 733 data = DataUtils.BuildQuantDataSet(fName) 734 elif details.useSigTrees or details.useSigBayes: 735 details.tableName = fName 736 data = details.GetDataSet(pickleCol=0,pickleClass=DataStructs.ExplicitBitVect) 737 elif details.qBounds != [] or not details.useTrees: 738 details.tableName = fName 739 data = details.GetDataSet() 740 else: 741 data = DataUtils.DBToQuantData(details.dbName,fName,quantName=details.qTableName, 742 user=details.dbUser,password=details.dbPassword) 743 744 composite = RunOnData(details,data,progressCallback=progressCallback, 745 saveIt=saveIt,setDescNames=setDescNames) 746 return composite
747 748
749 -def ShowVersion(includeArgs=0):
750 """ prints the version number 751 752 """ 753 print 'This is BuildComposite.py version %s'%(__VERSION_STRING) 754 if includeArgs: 755 import sys 756 print 'command line was:' 757 print ' '.join(sys.argv)
758
759 -def Usage():
760 """ provides a list of arguments for when this is used from the command line 761 762 """ 763 import sys 764 print __doc__ 765 sys.exit(-1)
766
767 -def SetDefaults(runDetails=None):
768 """ initializes a details object with default values 769 770 **Arguments** 771 772 - details: (optional) a _CompositeRun.CompositeRun_ object. 773 If this is not provided, the global _runDetails will be used. 774 775 **Returns** 776 777 the initialized _CompositeRun_ object. 778 779 780 """ 781 if runDetails is None: runDetails = _runDetails 782 return CompositeRun.SetDefaults(runDetails)
783
784 -def ParseArgs(runDetails):
785 """ parses command line arguments and updates _runDetails_ 786 787 **Arguments** 788 789 - runDetails: a _CompositeRun.CompositeRun_ object. 790 791 """ 792 import getopt 793 args,extra = getopt.getopt(sys.argv[1:],'P:o:n:p:b:sf:F:v:hlgd:rSTt:BQ:q:DVG:N:L:', 794 ['nRuns=','prune','profile', 795 'seed=','noScreen', 796 797 'modelFiltFrac=', 'modelFiltVal=', 798 799 'recycle','randomDescriptors=', 800 801 'doKnn','knnK=','knnTanimoto','knnEuclid', 802 803 'doSigTree','doCMIM=','allowCollections', 804 805 'doNaiveBayes', 'mEstimateVal=', 806 'doSigBayes', 807 808 ## 'doSVM','svmKernel=','svmType=','svmGamma=', 809 ## 'svmCost=','svmWeights=','svmDegree=', 810 ## 'svmCoeff=','svmEps=','svmNu=','svmCache=', 811 ## 'svmShrink','svmDataType=', 812 813 'replacementSelection', 814 815 ]) 816 runDetails.profileIt=0 817 for arg,val in args: 818 if arg == '-n': 819 runDetails.nModels = int(val) 820 elif arg == '-N': 821 runDetails.note=val 822 elif arg == '-o': 823 runDetails.outName = val 824 elif arg == '-Q': 825 qBounds = eval(val) 826 assert type(qBounds) in [type([]),type(())],'bad argument type for -Q, specify a list as a string' 827 runDetails.activityBounds=qBounds 828 runDetails.activityBoundsVals=val 829 elif arg == '-p': 830 runDetails.persistTblName=val 831 elif arg == '-P': 832 runDetails.pickleDataFileName= val 833 elif arg == '-r': 834 runDetails.randomActivities = 1 835 elif arg == '-S': 836 runDetails.shuffleActivities = 1 837 elif arg == '-b': 838 runDetails.badName = val 839 elif arg == '-B': 840 runDetails.bayesModels=1 841 elif arg == '-s': 842 runDetails.splitRun = 1 843 elif arg == '-f': 844 runDetails.splitFrac=float(val) 845 elif arg == '-F': 846 runDetails.filterFrac=float(val) 847 elif arg == '-v': 848 runDetails.filterVal=float(val) 849 elif arg == '-l': 850 runDetails.lockRandom = 1 851 elif arg == '-g': 852 runDetails.lessGreedy=1 853 elif arg == '-G': 854 runDetails.startAt = int(val) 855 elif arg == '-d': 856 runDetails.dbName=val 857 elif arg == '-T': 858 runDetails.useTrees = 0 859 elif arg == '-t': 860 runDetails.threshold=float(val) 861 elif arg == '-D': 862 runDetails.detailedRes = 1 863 elif arg == '-L': 864 runDetails.limitDepth = int(val) 865 elif arg == '-q': 866 qBounds = eval(val) 867 assert type(qBounds) in [type([]),type(())],'bad argument type for -q, specify a list as a string' 868 runDetails.qBoundCount=val 869 runDetails.qBounds = qBounds 870 elif arg == '-V': 871 ShowVersion() 872 sys.exit(0) 873 elif arg == '--nRuns': 874 runDetails.nRuns = int(val) 875 elif arg == '--modelFiltFrac': 876 runDetails.modelFilterFrac=float(val) 877 elif arg == '--modelFiltVal': 878 runDetails.modelFilterVal=float(val) 879 elif arg == '--prune': 880 runDetails.pruneIt=1 881 elif arg == '--profile': 882 runDetails.profileIt=1 883 884 elif arg == '--recycle': 885 runDetails.recycleVars=1 886 elif arg == '--randomDescriptors': 887 runDetails.randomDescriptors=int(val) 888 889 elif arg == '--doKnn': 890 runDetails.useKNN=1 891 runDetails.useTrees=0 892 ## runDetails.useSVM=0 893 runDetails.useNaiveBayes=0 894 elif arg == '--knnK': 895 runDetails.knnNeighs = int(val) 896 elif arg == '--knnTanimoto': 897 runDetails.knnDistFunc="Tanimoto" 898 elif arg == '--knnEuclid': 899 runDetails.knnDistFunc="Euclidean" 900 901 elif arg == '--doSigTree': 902 ## runDetails.useSVM=0 903 runDetails.useKNN=0 904 runDetails.useTrees=0 905 runDetails.useNaiveBayes=0 906 runDetails.useSigTrees=1 907 elif arg == '--doCMIM': 908 runDetails.useCMIM=int(val) 909 elif arg == '--allowCollections': 910 runDetails.allowCollections=True 911 912 elif arg == '--doNaiveBayes': 913 runDetails.useNaiveBayes=1 914 ## runDetails.useSVM=0 915 runDetails.useKNN=0 916 runDetails.useTrees=0 917 runDetails.useSigBayes=0 918 elif arg == '--doSigBayes': 919 runDetails.useSigBayes=1 920 runDetails.useNaiveBayes=0 921 ## runDetails.useSVM=0 922 runDetails.useKNN=0 923 runDetails.useTrees=0 924 elif arg == '--mEstimateVal': 925 runDetails.mEstimateVal=float(val) 926 927 ## elif arg == '--doSVM': 928 ## runDetails.useSVM=1 929 ## runDetails.useKNN=0 930 ## runDetails.useTrees=0 931 ## runDetails.useNaiveBayes=0 932 ## elif arg == '--svmKernel': 933 ## if val not in SVM.kernels.keys(): 934 ## message('kernel %s not in list of available kernels:\n%s\n'%(val,SVM.kernels.keys())) 935 ## sys.exit(-1) 936 ## else: 937 ## runDetails.svmKernel=SVM.kernels[val] 938 ## elif arg == '--svmType': 939 ## if val not in SVM.machineTypes.keys(): 940 ## message('type %s not in list of available machines:\n%s\n'%(val,SVM.machineTypes.keys())) 941 ## sys.exit(-1) 942 ## else: 943 ## runDetails.svmType=SVM.machineTypes[val] 944 ## elif arg == '--svmGamma': 945 ## runDetails.svmGamma = float(val) 946 ## elif arg == '--svmCost': 947 ## runDetails.svmCost = float(val) 948 ## elif arg == '--svmWeights': 949 ## # FIX: this is dangerous 950 ## runDetails.svmWeights = eval(val) 951 ## elif arg == '--svmDegree': 952 ## runDetails.svmDegree = int(val) 953 ## elif arg == '--svmCoeff': 954 ## runDetails.svmCoeff = float(val) 955 ## elif arg == '--svmEps': 956 ## runDetails.svmEps = float(val) 957 ## elif arg == '--svmNu': 958 ## runDetails.svmNu = float(val) 959 ## elif arg == '--svmCache': 960 ## runDetails.svmCache = int(val) 961 ## elif arg == '--svmShrink': 962 ## runDetails.svmShrink = 0 963 ## elif arg == '--svmDataType': 964 ## runDetails.svmDataType=val 965 966 elif arg== '--seed': 967 # FIX: dangerous 968 runDetails.randomSeed = eval(val) 969 970 elif arg== '--noScreen': 971 runDetails.noScreen=1 972 973 elif arg== '--replacementSelection': 974 runDetails.replacementSelection = 1 975 976 elif arg == '-h': 977 Usage() 978 979 else: 980 Usage() 981 runDetails.tableName=extra[0]
982 983 if __name__ == '__main__': 984 if len(sys.argv) < 2: 985 Usage() 986 987 _runDetails.cmd = ' '.join(sys.argv) 988 SetDefaults(_runDetails) 989 ParseArgs(_runDetails) 990 991 992 ShowVersion(includeArgs=1) 993 994 if _runDetails.nRuns > 1: 995 for i in range(_runDetails.nRuns): 996 sys.stderr.write('---------------------------------\n\tDoing %d of %d\n---------------------------------\n'%(i+1,_runDetails.nRuns)) 997 RunIt(_runDetails) 998 else: 999 if _runDetails.profileIt: 1000 import hotshot,hotshot.stats 1001 prof=hotshot.Profile('prof.dat') 1002 prof.runcall(RunIt,_runDetails) 1003 stats = hotshot.stats.load('prof.dat') 1004 stats.strip_dirs() 1005 stats.sort_stats('time','calls') 1006 stats.print_stats(30) 1007 else: 1008 RunIt(_runDetails) 1009