| Trees | Indices | Help |
|
|---|
|
|
1 # This work was funded by Roche and generously donated to the free
2 # and open source cheminformatics community.
3 import warnings
4 warnings.simplefilter('default', DeprecationWarning)
5 warnings.warn("The rdkit.Chem.MCS module is deprecated; please use rdkit.Chem.rdFMCS instead.",
6 DeprecationWarning,stacklevel=2)
7 ## Copyright (c) 2012 Andrew Dalke Scientific AB
8 ## Andrew Dalke <dalke@dalkescientific.com>
9 ##
10 ## All rights reserved.
11 ##
12 ## Redistribution and use in source and binary forms, with or without
13 ## modification, are permitted provided that the following conditions are
14 ## met:
15 ##
16 ## * Redistributions of source code must retain the above copyright
17 ## notice, this list of conditions and the following disclaimer.
18 ##
19 ## * Redistributions in binary form must reproduce the above copyright
20 ## notice, this list of conditions and the following disclaimer in
21 ## the documentation and/or other materials provided with the
22 ## distribution.
23 ##
24 ## THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
25 ## "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
26 ## LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
27 ## A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
28 ## HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
29 ## SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
30 ## LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
31 ## DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
32 ## THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
33 ## (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
34 ## OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
35 from rdkit.Chem import fmcs
36 from rdkit.Chem.fmcs import Default
37 """MCS - find a Maximum Common Substructure
38
39 This software finds the maximum common substructure of a set of
40 structures and reports it as a SMARTS string.
41
42 The SMARTS string depends on the desired match properties. For
43 example, if ring atoms are only allowed to match ring atoms then an
44 aliphatic ring carbon in the query is converted to the SMARTS "[C;R]",
45 and the double-bond ring bond converted to "=;@" while the respective
46 chain-only version are "[C;!R]" and "=;!@".
47
48 """
49
50 # The simplified algorithm description is:
51 #
52 # best_substructure = None
53 # pick one structure as the query, and other as the targets
54 # for each substructure in the query graph:
55 # convert it to a SMARTS string based on the desired match properties
56 # if the SMARTS pattern exists in all of the targets:
57 # then this is a common substructure
58 # keep track of the maximum such common structure,
59 #
60 # The algorithm will usually take a long time. There are several
61 # ways to speed it up.
62 #
63 # == Bond elimination ==
64 #
65 # As the first step, remove bonds which obviously cannot be part of the
66 # MCS.
67 #
68 # This requires atom and bond type information, which I store as SMARTS
69 # patterns. A bond can only be in the MCS if its canonical bond type is
70 # present in all of the structures. A bond type is string made of the
71 # SMARTS for one atom, the SMARTS for the bond, and the SMARTS for the
72 # other atom. The canonical bond type is the lexographically smaller of
73 # the two possible bond types for a bond.
74 #
75 # The atom and bond SMARTS depend on the type comparison used.
76 #
77 # The "ring-matches-ring-only" option adds an "@" or "!@" to the bond
78 # SMARTS, so that the canonical bondtype for "C-C" becomes [#6]-@[#6] or
79 # [#6]-!@[#6] if the bond is in a ring or not in a ring, and if atoms
80 # are compared by element and bonds are compared by bondtype. (This
81 # option does not add "R" or "!R" to the atom SMARTS because there
82 # should be a single bond in the MCS of c1ccccc1O and CO.)
83 #
84 # The result of all of this atom and bond typing is a "TypedMolecule"
85 # for each input structure.
86 #
87 # I then find which canonical bondtypes are present in all of the
88 # structures. I convert each TypedMolecule into a
89 # FragmentedTypedMolecule which has the same atom information but only
90 # those bonds whose bondtypes are in all of the structures. This can
91 # break a structure into multiple, disconnected fragments, hence the
92 # name.
93 #
94 # (BTW, I would like to use the fragmented molecules as the targets
95 # because I think the SMARTS match would go faster, but the RDKit SMARTS
96 # matcher doesn't like them. I think it's because the new molecule
97 # hasn't been sanitized and the underlying data structure the ring
98 # information doesn't exist. Instead, I use the input structures for the
99 # SMARTS match.)
100 #
101 # == Use the structure with the smallest largest fragment as the query ==
102 # == and sort the targets by the smallest largest fragment ==
103 #
104 # I pick one of the FragmentedTypedMolecule instances as the source of
105 # substructure enumeration. Which one?
106 #
107 # My heuristic is to use the one with the smallest largest fragment.
108 # Hopefully it produces the least number of subgraphs, but that's also
109 # related to the number of rings, so a large linear graph will product
110 # fewer subgraphs than a small fused ring system. I don't know how to
111 # quantify that.
112 #
113 # For each of the fragmented structures, I find the number of atoms in
114 # the fragment with the most atoms, and I find the number of bonds in
115 # the fragment with the most bonds. These might not be the same
116 # fragment.
117 #
118 # I sort the input structures by the number of bonds in the largest
119 # fragment, with ties broken first on the number of atoms, and then on
120 # the input order. The smallest such structure is the query structure,
121 # and the remaining are the targets.
122 #
123 # == Use a breadth-first search and a priority queue to ==
124 # == enumerate the fragment subgraphs ==
125 #
126 # I extract each of the fragments from the FragmentedTypedMolecule into
127 # a TypedFragment, which I use to make an EnumerationMolecule. An
128 # enumeration molecule contains a pair of directed edges for each atom,
129 # which simplifies the enumeration algorithm.
130 #
131 # The enumeration algorithm is based around growing a seed. A seed
132 # contains the current subgraph atoms and bonds as well as an exclusion
133 # set of bonds which cannot be used for future grown. The initial seed
134 # is the first bond in the fragment, which may potentially grow to use
135 # the entire fragment. The second seed is the second bond in the
136 # fragment, which is excluded from using the first bond in future
137 # growth. The third seed starts from the third bond, which may not use
138 # the first or second bonds during growth, and so on.
139 #
140 #
141 # A seed can grow along bonds connected to an atom in the seed but which
142 # aren't already in the seed and aren't in the set of excluded bonds for
143 # the seed. If there are no such bonds then subgraph enumeration ends
144 # for this fragment. Given N bonds there are 2**N-1 possible ways to
145 # grow, which is just the powerset of the available bonds, excluding the
146 # no-growth case.
147 #
148 # This breadth-first growth takes into account all possibilties of using
149 # the available N bonds so all of those bonds are added to the exclusion
150 # set of the newly expanded subgraphs.
151 #
152 # For performance reasons, the bonds used for growth are separated into
153 # 'internal' bonds, which connect two atoms already in the subgraph, and
154 # 'external' bonds, which lead outwards to an atom not already in the
155 # subgraph.
156 #
157 # Each seed growth can add from 0 to N new atoms and bonds. The goal is
158 # to maximize the subgraph size so the seeds are stored in a priority
159 # queue, ranked so the seed with the most bonds is processed first. This
160 # turns the enumeration into something more like a depth-first search.
161 #
162 #
163 # == Prune seeds which aren't found in all of the structures ==
164 #
165 # At each stage of seed growth I check that the new seed exists in all
166 # of the original structures. (Well, all except the one which I
167 # enumerate over in the first place; by definition that one will match.)
168 # If it doesn't match then there's no reason to include this seed or any
169 # larger seeds made from it.
170 #
171 # The check is easy; I turn the subgraph into its corresponding SMARTS
172 # string and use RDKit's normal SMARTS matcher to test for a match.
173 #
174 # There are three ways to generate a SMARTS string: 1) arbitrary, 2)
175 # canonical, 3) hybrid.
176 #
177 # I have not tested #1. During most of the development I assumed that
178 # SMARTS matches across a few hundred structures would be slow, so that
179 # the best solution is to generate a *canonical* SMARTS and cache the
180 # match information.
181 #
182 # Well, it turns out that my canonical SMARTS match code takes up most
183 # of the MCS run-time. If I drop the canonicalization step then the
184 # code averages about 5-10% faster. This isn't the same as #1 - I still
185 # do the initial atom assignment based on its neighborhood, which is
186 # like a circular fingerprint of size 2 and *usually* gives a consistent
187 # SMARTS pattern, which I can then cache.
188 #
189 # However, there are times when the non-canonical SMARTS code is slower.
190 # Obviously one is if there are a lot of structures, and another if is
191 # there is a lot of symmetry. I'm still working on characterizing this.
192 #
193 #
194 # == Maximize atoms? or bonds? ==
195 #
196 # The above algorithm enumerates all subgraphs of the query and
197 # identifies those subgraphs which are common to all input structures.
198 #
199 # It's trivial then to keep track of the current "best" subgraph, which
200 # can defined as having the subgraph with the most atoms, or the most
201 # bonds. Both of those options are implemented.
202 #
203 # It would not be hard to keep track of all other subgraphs which are
204 # the same size.
205 #
206 # == complete_ring_only implementation ==
207 #
208 # The "complete ring only" option is implemented by first enabling the
209 # "ring-matches-ring-only" option, as otherwise it doesn't make sense.
210 #
211 # Second, in order to be a "best" subgraph, all bonds in the subgraph
212 # which are ring bonds in the original molecule must also be in a ring
213 # in the subgraph. This is handled as a post-processing step.
214 #
215 # (Note: some possible optimizations, like removing ring bonds from
216 # structure fragments which are not in a ring, are not yet implemented.)
217 #
218 #
219 # == Prune seeds which have no potential for growing large enough ==
220 #
221 # Given a seed, its set of edges available for growth, and the set of
222 # excluded bonds, figure out the maximum possible growth for the seed.
223 # If this maximum possible is less than the current best subgraph then
224 # prune.
225 #
226 # This requires a graph search, currently done in Python, which is a bit
227 # expensive. To speed things up, I precompute some edge information.
228 # That is, if I know that a given bond is a chain bond (not in a ring)
229 # then I can calculate the maximum number of atoms and bonds for seed
230 # growth along that bond, in either direction. However, precomputation
231 # doesn't take into account the excluded bonds, so after a while the
232 # predicted value is too high.
233 #
234 # Again, I'm still working on characterizing this, and an implementation
235 # in C++ would have different tradeoffs.
236
237 __all__ = ["FindMCS"]
238
239 ########## Main driver for the MCS code
240
241
243
245 self.numAtoms = obj.num_atoms
246 self.numBonds = obj.num_bonds
247 self.smarts = obj.smarts
248 self.completed = obj.completed
249
252
254 return "MCSResult(numAtoms=%d, numBonds=%d, smarts=%r, completed=%d)" % (
255 self.numAtoms, self.numBonds, self.smarts, self.completed)
256
262
263
264 -def FindMCS(mols,
265 minNumAtoms=2,
266 maximize=Default.maximize,
267 atomCompare=Default.atomCompare,
268 bondCompare=Default.bondCompare,
269 matchValences=Default.matchValences,
270 ringMatchesRingOnly=False,
271 completeRingsOnly=False,
272 timeout=Default.timeout,
273 threshold=None, ):
274 """Find the maximum common substructure of a set of molecules
275
276 In the simplest case, pass in a list of molecules and get back
277 an MCSResult object which describes the MCS:
278
279 >>> from rdkit import Chem
280 >>> mols = [Chem.MolFromSmiles("C#CCP"), Chem.MolFromSmiles("C=CCO")]
281 >>> from rdkit.Chem import MCS
282 >>> MCS.FindMCS(mols)
283 MCSResult(numAtoms=2, numBonds=1, smarts='[#6]-[#6]', completed=1)
284
285 The SMARTS '[#6]-[#6]' matches the largest common substructure of
286 the input structures. It has 2 atoms and 1 bond. If there is no
287 MCS which is at least `minNumAtoms` in size then the result will set
288 numAtoms and numBonds to -1 and set smarts to None.
289
290 By default, two atoms match if they are the same element and two
291 bonds match if they have the same bond type. Specify `atomCompare`
292 and `bondCompare` to use different comparison functions, as in:
293
294 >>> MCS.FindMCS(mols, atomCompare="any")
295 MCSResult(numAtoms=3, numBonds=2, smarts='[*]-[*]-[*]', completed=1)
296 >>> MCS.FindMCS(mols, bondCompare="any")
297 MCSResult(numAtoms=3, numBonds=2, smarts='[#6]~[#6]~[#6]', completed=1)
298
299 An atomCompare of "any" says that any atom matches any other atom,
300 "elements" compares by element type, and "isotopes" matches based on
301 the isotope label. Isotope labels can be used to implement user-defined
302 atom types. A bondCompare of "any" says that any bond matches any
303 other bond, and "bondtypes" says bonds are equivalent if and only if
304 they have the same bond type.
305
306 A substructure has both atoms and bonds. The default `maximize`
307 setting of "atoms" finds a common substructure with the most number
308 of atoms. Use maximize="bonds" to maximize the number of bonds.
309 Maximizing the number of bonds tends to maximize the number of rings,
310 although two small rings may have fewer bonds than one large ring.
311
312 You might not want a 3-valent nitrogen to match one which is 5-valent.
313 The default `matchValences` value of False ignores valence information.
314 When True, the atomCompare setting is modified to also require that
315 the two atoms have the same valency.
316
317 >>> MCS.FindMCS(mols, matchValences=True)
318 MCSResult(numAtoms=2, numBonds=1, smarts='[#6v4]-[#6v4]', completed=1)
319
320 It can be strange to see a linear carbon chain match a carbon ring,
321 which is what the `ringMatchesRingOnly` default of False does. If
322 you set it to True then ring bonds will only match ring bonds.
323
324 >>> mols = [Chem.MolFromSmiles("C1CCC1CCC"), Chem.MolFromSmiles("C1CCCCCC1")]
325 >>> MCS.FindMCS(mols)
326 MCSResult(numAtoms=7, numBonds=6, smarts='[#6]-[#6]-[#6]-[#6]-[#6]-[#6]-[#6]', completed=1)
327 >>> MCS.FindMCS(mols, ringMatchesRingOnly=True)
328 MCSResult(numAtoms=4, numBonds=3, smarts='[#6](-@[#6])-@[#6]-@[#6]', completed=1)
329
330 You can further restrict things and require that partial rings
331 (as in this case) are not allowed. That is, if an atom is part of
332 the MCS and the atom is in a ring of the entire molecule then
333 that atom is also in a ring of the MCS. Set `completeRingsOnly`
334 to True to toggle this requirement and also sets ringMatchesRingOnly
335 to True.
336
337 >>> mols = [Chem.MolFromSmiles("CCC1CC2C1CN2"), Chem.MolFromSmiles("C1CC2C1CC2")]
338 >>> MCS.FindMCS(mols)
339 MCSResult(numAtoms=6, numBonds=6, smarts='[#6]-1-[#6]-[#6](-[#6])-[#6]-1-[#6]', completed=1)
340 >>> MCS.FindMCS(mols, ringMatchesRingOnly=True)
341 MCSResult(numAtoms=5, numBonds=5, smarts='[#6]-@1-@[#6]-@[#6](-@[#6])-@[#6]-@1', completed=1)
342 >>> MCS.FindMCS(mols, completeRingsOnly=True)
343 MCSResult(numAtoms=4, numBonds=4, smarts='[#6]-@1-@[#6]-@[#6]-@[#6]-@1', completed=1)
344
345 The MCS algorithm will exhaustively search for a maximum common substructure.
346 Typically this takes a fraction of a second, but for some comparisons this
347 can take minutes or longer. Use the `timeout` parameter to stop the search
348 after the given number of seconds (wall-clock seconds, not CPU seconds) and
349 return the best match found in that time. If timeout is reached then the
350 `completed` property of the MCSResult will be 0 instead of 1.
351
352 >>> mols = [Chem.MolFromSmiles("Nc1ccccc1"*100), Chem.MolFromSmiles("Nc1ccccccccc1"*100)]
353 >>> MCS.FindMCS(mols, timeout=0.1)
354 MCSResult(..., completed=0)
355
356 (The MCS after 50 seconds contained 511 atoms.)
357 """
358 warnings.warn("The rdkit.Chem.MCS module is deprecated; please use rdkit.Chem.rdFMCS instead.",
359 DeprecationWarning,stacklevel=2)
360
361 ores = fmcs.fmcs(mols,
362 minNumAtoms=minNumAtoms,
363 maximize=maximize,
364 atomCompare=atomCompare,
365 bondCompare=bondCompare,
366 threshold=threshold,
367 matchValences=matchValences,
368 ringMatchesRingOnly=ringMatchesRingOnly,
369 completeRingsOnly=completeRingsOnly,
370 timeout=timeout, )
371 return MCSResult(ores)
372
373
374 #------------------------------------
375 #
376 # doctest boilerplate
377 #
382
383
384 if __name__ == '__main__':
385 import sys
386 failed, tried = _test()
387 sys.exit(failed)
388
| Trees | Indices | Help |
|
|---|
| Generated by Epydoc 3.0.1 on Sun Oct 8 11:32:04 2017 | http://epydoc.sourceforge.net |