Package rdkit :: Package Chem :: Package fmcs :: Module fmcs'
[hide private]
[frames] | no frames]

Module fmcs'

source code

FMCS - Find Maximum Common Substructure

This software finds the maximum common substructure of a set of
structures and reports it as a SMARTS strings.

This implements what I think is a new algorithm for the MCS problem.
The core description is:

  best_substructure = None
  pick one structure as the query, and other as the targets
  for each substructure in the query graph:
    convert it to a SMARTS string based on the desired match properties
    if the SMARTS pattern exists in all of the targets:
       then this is a common substructure
       keep track of the maximum such common structure,

The SMARTS string depends on the desired match properties. For
example, if ring atoms are only allowed to match ring atoms then an
aliphatic ring carbon in the query is converted to the SMARTS "[C;R]",
and the double-bond ring bond converted to "=;@" while the respectice
chain-only version are "[C;!R]" and "=;!@".

The algorithm I outlined earlier will usually take a long time. There
are several ways to speed it up.

== Bond elimination ==

As the first step, remove bonds which obviously cannot be part of the
MCS.

This requires atom and bond type information, which I store as SMARTS
patterns. A bond can only be in the MCS if its canonical bond type is
present in all of the structures. A bond type is string made of the
SMARTS for one atom, the SMARTS for the bond, and the SMARTS for the
other atom. The canonical bond type is the lexographically smaller of
the two possible bond types for a bond.

The atom and bond SMARTS depend on the type comparison used.

The "ring-matches-ring-only" option adds an "@" or "!@" to the bond
SMARTS, so that the canonical bondtype for "C-C" becomes [#6]-@[#6] or
[#6]-!@[#6] if the bond is in a ring or not in a ring, and if atoms
are compared by element and bonds are compared by bondtype. (This
option does not add "R" or "!R" to the atom SMARTS because there
should be a single bond in the MCS of c1ccccc1O and CO.)

The result of all of this atom and bond typing is a "TypedMolecule"
for each input structure.

I then find which canonical bondtypes are present in all of the
structures. I convert each TypedMolecule into a
FragmentedTypedMolecule which has the same atom information but only
those bonds whose bondtypes are in all of the structures. This can
break a structure into multiple, disconnected fragments, hence the
name.

(BTW, I would like to use the fragmented molecules as the targets
because I think the SMARTS match would go faster, but the RDKit SMARTS
matcher doesn't like them. I think it's because the new molecule
hasn't been sanitized and the underlying data structure the ring
information doesn't exist. Instead, I use the input structures for the
SMARTS match.)

== Use the structure with the smallest largest fragment as the query ==
== and sort the targets by the smallest largest fragment             ==

I pick one of the FragmentedTypedMolecule instances as the source of
substructure enumeration. Which one?

My heuristic is to use the one with the smallest largest fragment.
Hopefully it produces the least number of subgraphs, but that's also
related to the number of rings, so a large linear graph will product
fewer subgraphs than a small fused ring system. I don't know how to
quantify that.

For each of the fragmented structures, I find the number of atoms in
the fragment with the most atoms, and I find the number of bonds in
the fragment with the most bonds. These might not be the same
fragment.

I sort the input structures by the number of bonds in the largest
fragment, with ties broken first on the number of atoms, and then on
the input order. The smallest such structure is the query structure,
and the remaining are the targets.

== Use a breadth-first search and a priority queue to    ==
== enumerate the fragment subgraphs                      ==

I extract each of the fragments from the FragmentedTypedMolecule into
a TypedFragment, which I use to make an EnumerationMolecule. An
enumeration molecule contains a pair of directed edges for each atom,
which simplifies the enumeration algorithm.

The enumeration algorithm is based around growing a seed. A seed
contains the current subgraph atoms and bonds as well as an exclusion
set of bonds which cannot be used for future grown. The initial seed
is the first bond in the fragment, which may potentially grow to use
the entire fragment. The second seed is the second bond in the
fragment, which is excluded from using the first bond in future
growth. The third seed starts from the third bond, which may not use
the first or second bonds during growth, and so on.


A seed can grow along bonds connected to an atom in the seed but which
aren't already in the seed and aren't in the set of excluded bonds for
the seed. If there are no such bonds then subgraph enumeration ends
for this fragment. Given N bonds there are 2**N-1 possible ways to
grow, which is just the powerset of the available bonds, excluding the
no-growth case.

This breadth-first growth takes into account all possibilties of using
the available N bonds so all of those bonds are added to the exclusion
set of the newly expanded subgraphs.

For performance reasons, the bonds used for growth are separated into
'internal' bonds, which connect two atoms already in the subgraph, and
'external' bonds, which lead outwards to an atom not already in the
subgraph.

Each seed growth can add from 0 to N new atoms and bonds. The goal is
to maximize the subgraph size so the seeds are stored in a priority
queue, ranked so the seed with the most bonds is processed first. This
turns the enumeration into something more like a depth-first search.


== Prune seeds which aren't found in all of the structures ==

At each stage of seed growth I check that the new seed exists in all
of the original structures. (Well, all except the one which I
enumerate over in the first place; by definition that one will match.)
If it doesn't match then there's no reason to include this seed or any
larger seeds made from it.

The check is easy; I turn the subgraph into its corresponding SMARTS
string and use RDKit's normal SMARTS matcher to test for a match.

There are three ways to generate a SMARTS string: 1) arbitrary, 2)
canonical, 3) hybrid.

I have not tested #1. During most of the development I assumed that
SMARTS matches across a few hundred structures would be slow, so that
the best solution is to generate a *canonical* SMARTS and cache the
match information.

Well, it turns out that my canonical SMARTS match code takes up most
of the FMCS run-time. If I drop the canonicalization step then the
code averages about 5-10% faster. This isn't the same as #1 - I still
do the initial atom assignment based on its neighborhood, which is
like a circular fingerprint of size 2 and *usually* gives a consistent
SMARTS pattern, which I can then cache.

However, there are times when the non-canonical SMARTS code is slower.
Obviously one is if there are a lot of structures, and another if is
there is a lot of symmetry. I'm still working on characterizing this.


== Maximize atoms? or bonds? ==

The above algorithm enumerates all subgraphs of the query and
identifies those subgraphs which are common to all input structures.

It's trivial then to keep track of the current "best" subgraph, which
can defined as having the subgraph with the most atoms, or the most
bonds. Both of those options are implemented.

It would not be hard to keep track of all other subgraphs which are
the same size.

== --complete-ring-only implementation ==

The "complete ring only" option is implemented by first enabling the
"ring-matches-ring-only" option, as otherwise it doesn't make sense.

Second, in order to be a "best" subgraph, all bonds in the subgraph
which are ring bonds in the original molecule must also be in a ring
in the subgraph. This is handled as a post-processing step.

(Note: some possible optimizations, like removing ring bonds from
structure fragments which are not in a ring, are not yet implemented.)


== Prune seeds which have no potential for growing large enough  ==

Given a seed, its set of edges available for growth, and the set of
excluded bonds, figure out the maximum possible growth for the seed.
If this maximum possible is less than the current best subgraph then
prune.

This requires a graph search, currently done in Python, which is a bit
expensive. To speed things up, I precompute some edge information.
That is, if I know that a given bond is a chain bond (not in a ring)
then I can calculate the maximum number of atoms and bonds for seed
growth along that bond, in either direction. However, precomputation
doesn't take into account the excluded bonds, so after a while the
predicted value is too high.

Again, I'm still working on characterizing this, and an implementation
in C++ would have different tradeoffs.


Version: 1.1

Classes [hide private]
  Atom
Atom(real_atom, atom_smarts, bond_indices, is_in_ring)
  AtomSmartsNoAromaticity
  Bond
Bond(real_bond, bond_smarts, canonical_bondtype, atom_indices, is_in_ring)
  CachingTargetsMatcher
  CangenNode
  Default
  DirectedEdge
DirectedEdge(bond_index, end_atom_index)
  EnumerationMolecule
Molecule(rdmol, atoms, bonds, directed_edges)
  FragmentedTypedMolecule
  MCSResult
  OutgoingEdge
OutgoingEdge(from_atom_index, bond_index, bond_smarts, other_node_idx, other_node)
  SingleBestAtoms
  SingleBestAtomsCompleteRingsOnly
  SingleBestBonds
  SingleBestBondsCompleteRingsOnly
  Subgraph
Subgraph(atom_indices, bond_indices)
  Timer
  TypedFragment
  TypedMolecule
  Uniquer
  VerboseCachingTargetsMatcher
  VerboseHeapOps
  _SingleBest
  starting_from
Functions [hide private]
 
MATCH(mol, pat) source code
 
_Counter() source code
 
_MolToSDBlock(mol) source code
 
_check_atom_classes(molno, num_atoms, atom_classes) source code
 
_copy_sd_tags(mol, fragment) source code
 
_get_match_bond_indices(pat, mol, match_atom_indices) source code
 
_get_nth_prime(n) source code
 
_get_threshold_count(num_mols, threshold) source code
 
_save_other_tags(mol, fragment, mcs, orig_mol, subgraph, args) source code
 
_update_times(timer, times) source code
 
all_subgraph_extensions(enumeration_mol, subgraph, visited_bond_indices, internal_bonds, external_edges) source code
 
assign_isotopes_from_class_tag(mol, atom_class_tag) source code
 
atom_typer_any(atoms) source code
 
atom_typer_elements(atoms) source code
 
atom_typer_isotopes(atoms) source code
 
bond_typer_any(bonds) source code
 
bond_typer_bondtypes(bonds) source code
 
canon(cangen_nodes) source code
 
check_completeRingsOnly(smarts, subgraph, enumeration_mol) source code
 
compute_mcs(fragmented_mols, typed_mols, minNumAtoms, threshold_count=None, maximize='bonds', completeRingsOnly=False, timeout=None, timer=None, verbose=False, verboseDelay=1.0) source code
 
convert_input_to_typed_molecules(mols, atom_typer, bond_typer, matchValences, ringMatchesRingOnly) source code
 
default_atom_typer(atoms) source code
 
default_bond_typer(bonds) source code
 
enumerate_subgraphs(enumeration_mols, prune, atom_assignment, matches_all_targets, hits, timeout, heappush, heappop) source code
 
find_duplicates(cangen_nodes, start, end) source code
 
find_extension_size(enumeration_mol, known_atoms, exclude_bonds, directed_edges) source code
 
find_extensions(atom_indices, visited_bond_indices, directed_edges) source code
 
find_upper_fragment_size_limits(rdmol, atoms) source code
 
fmcs(mols, minNumAtoms=2, maximize='bonds', atomCompare='elements', bondCompare='bondtypes', threshold=1.0, matchValences=False, ringMatchesRingOnly=False, completeRingsOnly=False, timeout=None, times=None, verbose=False, verboseDelay=1.0) source code
 
fragmented_mol_to_enumeration_mols(typed_mol, minNumAtoms=2) source code
 
gen_primes() source code
 
generate_smarts(cangen_nodes) source code
 
get_canonical_bondtype_counts(typed_mols) source code
 
get_canonical_bondtypes(rdmol, bonds, atom_smarts_types, bond_smarts_types) source code
 
get_closure_label(bond_smarts, closure) source code
 
get_counts(it) source code
 
get_initial_cangen_nodes(subgraph, enumeration_mol, atom_assignment, do_initial_assignment=True) source code
 
get_isotopes(mol) source code
 
get_selected_atom_classes(mol, atom_indices) source code
 
get_specified_types(rdmol, atom_types, ringMatchesRingOnly) source code
 
get_typed_fragment(typed_mol, atom_indices) source code
 
get_typed_molecule(rdmol, atom_typer, bond_typer, matchValences=False, ringMatchesRingOnly=False) source code
 
intersect_counts(counts1, counts2) source code
 
main(args=None) source code
 
make_arbitrary_smarts(subgraph, enumeration_mol, atom_assignment) source code
 
make_canonical_smarts(subgraph, enumeration_mol, atom_assignment) source code
 
make_complete_sdf(mcs, mol, subgraph, args) source code
 
make_fragment_sdf(mcs, mol, subgraph, args) source code
 
make_fragment_smiles(mcs, mol, subgraph, args=None) source code
 
make_structure_format(format_name, mcs, mol, subgraph, args) source code
 
nonempty_powerset(iterable)
nonempty_powerset([1,2,3]) --> (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)
source code
 
parse_num_atoms(s) source code
 
parse_select(s) source code
 
parse_threshold(s) source code
 
parse_timeout(s) source code
 
powerset(iterable)
powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)
source code
 
prune_maximize_atoms(subgraph, mol, num_remaining_atoms, num_remaining_bonds, best_sizes) source code
 
prune_maximize_bonds(subgraph, mol, num_remaining_atoms, num_remaining_bonds, best_sizes) source code
 
remove_unknown_bondtypes(typed_mol, supported_canonical_bondtypes) source code
 
rerank(cangen_nodes) source code
 
restore_isotopes(mol) source code
 
save_atom_classes(mol, atom_classes) source code
 
save_isotopes(mol, isotopes) source code
 
set_isotopes(mol, isotopes) source code
 
subgraph_to_fragment(mol, subgraph) source code
 
tiebreaker() source code
Variables [hide private]
  __package__ = 'rdkit.Chem.fmcs'
  __version_info = (1, 1, 0)
  _atom_class_dict = <WeakKeyDictionary at 140158553110288>
  _atom_smarts_no_aromaticity = {1: '#1', 34: '#34', 5: '#5', 6:...
  _available_closures = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ...
  _isotope_dict = <WeakKeyDictionary at 140158553110360>
  _maximize_options = {('atoms', False): (<function prune_maximi...
  _prime_stream = <generator object gen_primes at 0x7f7934c4f370>
  _primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43,...
  atom_typers = {'any': <function atom_typer_any at 0x7f7934b4a7...
  bond_typers = {'any': <function bond_typer_any at 0x7f7934b4a9...
  compare_shortcuts = {'elements': ('elements', 'any'), 'topolog...
  eleno = 52
  range_pat = re.compile(r'(\d+)-(\d*)')
  structure_format_functions = {'complete-sdf': <function make_c...
  value_pat = re.compile(r'(\d+)')

Imports: Chem, _get_symbol, chain, collections, combinations, copy, defaultdict, heapify, heappop, heappush, itertools, next, product, range, re, sys, time, weakref


Variables Details [hide private]

_atom_smarts_no_aromaticity

Value:
{1: '#1', 34: '#34', 5: '#5', 6: '#6', 7: '#7', 8: '#8', 2: 'He', 15: \
'#15', 16: '#16', 52: '#52', 33: '#33'}

_available_closures

Value:
[1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
...

_maximize_options

Value:
{('atoms', False): (<function prune_maximize_atoms at 0x7f7933fa6938>,
                    <class 'rdkit.Chem.fmcs.fmcs.SingleBestAtoms'>),
 ('atoms', True): (<function prune_maximize_atoms at 0x7f7933fa6938>,
                   <class 'rdkit.Chem.fmcs.fmcs.SingleBestAtomsComplet\
eRingsOnly'>),
 ('bonds', False): (<function prune_maximize_bonds at 0x7f7933fa6578>,
                    <class 'rdkit.Chem.fmcs.fmcs.SingleBestBonds'>),
 ('bonds', True): (<function prune_maximize_bonds at 0x7f7933fa6578>, \
...

_primes

Value:
[2,
 3,
 5,
 7,
 11,
 13,
 17,
 19,
...

atom_typers

Value:
{'any': <function atom_typer_any at 0x7f7934b4a7d0>,
 'elements': <function atom_typer_elements at 0x7f7934b4a8c0>,
 'isotopes': <function atom_typer_isotopes at 0x7f7934b4a938>}

bond_typers

Value:
{'any': <function bond_typer_any at 0x7f7934b4a9b0>,
 'bondtypes': <function bond_typer_bondtypes at 0x7f7934b4aa28>}

compare_shortcuts

Value:
{'elements': ('elements', 'any'),
 'topology': ('any', 'any'),
 'types': ('elements', 'bondtypes')}

structure_format_functions

Value:
{'complete-sdf': <function make_complete_sdf at 0x7f7933fa88c0>,
 'fragment-sdf': <function make_fragment_sdf at 0x7f7933fa8848>,
 'fragment-smiles': <function make_fragment_smiles at 0x7f7933fa8668>}