rdkit.Chem.fmcs.fmcs module

FMCS - Find Maximum Common Substructure

This software finds the maximum common substructure of a set of structures and reports it as a SMARTS strings.

This implements what I think is a new algorithm for the MCS problem. The core description is:

best_substructure = None pick one structure as the query, and other as the targets for each substructure in the query graph:

convert it to a SMARTS string based on the desired match properties if the SMARTS pattern exists in all of the targets:

then this is a common substructure keep track of the maximum such common structure,

The SMARTS string depends on the desired match properties. For example, if ring atoms are only allowed to match ring atoms then an aliphatic ring carbon in the query is converted to the SMARTS “[C;R]”, and the double-bond ring bond converted to “=;@” while the respectice chain-only version are “[C;!R]” and “=;!@”.

The algorithm I outlined earlier will usually take a long time. There are several ways to speed it up.

== Bond elimination ==

As the first step, remove bonds which obviously cannot be part of the MCS.

This requires atom and bond type information, which I store as SMARTS patterns. A bond can only be in the MCS if its canonical bond type is present in all of the structures. A bond type is string made of the SMARTS for one atom, the SMARTS for the bond, and the SMARTS for the other atom. The canonical bond type is the lexographically smaller of the two possible bond types for a bond.

The atom and bond SMARTS depend on the type comparison used.

The “ring-matches-ring-only” option adds an “@” or “!@” to the bond SMARTS, so that the canonical bondtype for “C-C” becomes [#6]-@[#6] or [#6]-!@[#6] if the bond is in a ring or not in a ring, and if atoms are compared by element and bonds are compared by bondtype. (This option does not add “R” or “!R” to the atom SMARTS because there should be a single bond in the MCS of c1ccccc1O and CO.)

The result of all of this atom and bond typing is a “TypedMolecule” for each input structure.

I then find which canonical bondtypes are present in all of the structures. I convert each TypedMolecule into a FragmentedTypedMolecule which has the same atom information but only those bonds whose bondtypes are in all of the structures. This can break a structure into multiple, disconnected fragments, hence the name.

(BTW, I would like to use the fragmented molecules as the targets because I think the SMARTS match would go faster, but the RDKit SMARTS matcher doesn’t like them. I think it’s because the new molecule hasn’t been sanitized and the underlying data structure the ring information doesn’t exist. Instead, I use the input structures for the SMARTS match.)

== Use the structure with the smallest largest fragment as the query == == and sort the targets by the smallest largest fragment ==

I pick one of the FragmentedTypedMolecule instances as the source of substructure enumeration. Which one?

My heuristic is to use the one with the smallest largest fragment. Hopefully it produces the least number of subgraphs, but that’s also related to the number of rings, so a large linear graph will product fewer subgraphs than a small fused ring system. I don’t know how to quantify that.

For each of the fragmented structures, I find the number of atoms in the fragment with the most atoms, and I find the number of bonds in the fragment with the most bonds. These might not be the same fragment.

I sort the input structures by the number of bonds in the largest fragment, with ties broken first on the number of atoms, and then on the input order. The smallest such structure is the query structure, and the remaining are the targets.

== Use a breadth-first search and a priority queue to == == enumerate the fragment subgraphs ==

I extract each of the fragments from the FragmentedTypedMolecule into a TypedFragment, which I use to make an EnumerationMolecule. An enumeration molecule contains a pair of directed edges for each atom, which simplifies the enumeration algorithm.

The enumeration algorithm is based around growing a seed. A seed contains the current subgraph atoms and bonds as well as an exclusion set of bonds which cannot be used for future grown. The initial seed is the first bond in the fragment, which may potentially grow to use the entire fragment. The second seed is the second bond in the fragment, which is excluded from using the first bond in future growth. The third seed starts from the third bond, which may not use the first or second bonds during growth, and so on.

A seed can grow along bonds connected to an atom in the seed but which aren’t already in the seed and aren’t in the set of excluded bonds for the seed. If there are no such bonds then subgraph enumeration ends for this fragment. Given N bonds there are 2**N-1 possible ways to grow, which is just the powerset of the available bonds, excluding the no-growth case.

This breadth-first growth takes into account all possibilties of using the available N bonds so all of those bonds are added to the exclusion set of the newly expanded subgraphs.

For performance reasons, the bonds used for growth are separated into ‘internal’ bonds, which connect two atoms already in the subgraph, and ‘external’ bonds, which lead outwards to an atom not already in the subgraph.

Each seed growth can add from 0 to N new atoms and bonds. The goal is to maximize the subgraph size so the seeds are stored in a priority queue, ranked so the seed with the most bonds is processed first. This turns the enumeration into something more like a depth-first search.

== Prune seeds which aren’t found in all of the structures ==

At each stage of seed growth I check that the new seed exists in all of the original structures. (Well, all except the one which I enumerate over in the first place; by definition that one will match.) If it doesn’t match then there’s no reason to include this seed or any larger seeds made from it.

The check is easy; I turn the subgraph into its corresponding SMARTS string and use RDKit’s normal SMARTS matcher to test for a match.

There are three ways to generate a SMARTS string: 1) arbitrary, 2) canonical, 3) hybrid.

I have not tested #1. During most of the development I assumed that SMARTS matches across a few hundred structures would be slow, so that the best solution is to generate a canonical SMARTS and cache the match information.

Well, it turns out that my canonical SMARTS match code takes up most of the FMCS run-time. If I drop the canonicalization step then the code averages about 5-10% faster. This isn’t the same as #1 - I still do the initial atom assignment based on its neighborhood, which is like a circular fingerprint of size 2 and usually gives a consistent SMARTS pattern, which I can then cache.

However, there are times when the non-canonical SMARTS code is slower. Obviously one is if there are a lot of structures, and another if is there is a lot of symmetry. I’m still working on characterizing this.

== Maximize atoms? or bonds? ==

The above algorithm enumerates all subgraphs of the query and identifies those subgraphs which are common to all input structures.

It’s trivial then to keep track of the current “best” subgraph, which can defined as having the subgraph with the most atoms, or the most bonds. Both of those options are implemented.

It would not be hard to keep track of all other subgraphs which are the same size.

== –complete-ring-only implementation ==

The “complete ring only” option is implemented by first enabling the “ring-matches-ring-only” option, as otherwise it doesn’t make sense.

Second, in order to be a “best” subgraph, all bonds in the subgraph which are ring bonds in the original molecule must also be in a ring in the subgraph. This is handled as a post-processing step.

(Note: some possible optimizations, like removing ring bonds from structure fragments which are not in a ring, are not yet implemented.)

== Prune seeds which have no potential for growing large enough ==

Given a seed, its set of edges available for growth, and the set of excluded bonds, figure out the maximum possible growth for the seed. If this maximum possible is less than the current best subgraph then prune.

This requires a graph search, currently done in Python, which is a bit expensive. To speed things up, I precompute some edge information. That is, if I know that a given bond is a chain bond (not in a ring) then I can calculate the maximum number of atoms and bonds for seed growth along that bond, in either direction. However, precomputation doesn’t take into account the excluded bonds, so after a while the predicted value is too high.

Again, I’m still working on characterizing this, and an implementation in C++ would have different tradeoffs.

class rdkit.Chem.fmcs.fmcs.Atom(real_atom, atom_smarts, bond_indices, is_in_ring)

Bases: tuple

Create new instance of Atom(real_atom, atom_smarts, bond_indices, is_in_ring)

atom_smarts

Alias for field number 1

bond_indices

Alias for field number 2

is_in_ring

Alias for field number 3

real_atom

Alias for field number 0

class rdkit.Chem.fmcs.fmcs.AtomSmartsNoAromaticity

Bases: dict

class rdkit.Chem.fmcs.fmcs.Bond(real_bond, bond_smarts, canonical_bondtype, atom_indices, is_in_ring)

Bases: tuple

Create new instance of Bond(real_bond, bond_smarts, canonical_bondtype, atom_indices, is_in_ring)

atom_indices

Alias for field number 3

bond_smarts

Alias for field number 1

canonical_bondtype

Alias for field number 2

is_in_ring

Alias for field number 4

real_bond

Alias for field number 0

class rdkit.Chem.fmcs.fmcs.CachingTargetsMatcher(targets, required_match_count=None)

Bases: dict

shift_targets()
class rdkit.Chem.fmcs.fmcs.CangenNode(index, atom_smarts)

Bases: object

atom_smarts
index
neighbors
outgoing_edges
rank
value
class rdkit.Chem.fmcs.fmcs.Default

Bases: object

atomCompare = 'elements'
bondCompare = 'bondtypes'
completeRingsOnly = False
matchValences = False
maximize = 'bonds'
ringMatchesRingOnly = False
timeout = None
timeoutString = 'none'
class rdkit.Chem.fmcs.fmcs.DirectedEdge(bond_index, end_atom_index)

Bases: tuple

Create new instance of DirectedEdge(bond_index, end_atom_index)

bond_index

Alias for field number 0

end_atom_index

Alias for field number 1

rdkit.Chem.fmcs.fmcs.EnumerationMolecule

alias of Molecule

class rdkit.Chem.fmcs.fmcs.FragmentedTypedMolecule(rdmol, rdmol_atoms, orig_atoms, orig_bonds, atom_smarts_types, bond_smarts_types, canonical_bondtypes)

Bases: object

rdkit.Chem.fmcs.fmcs.MATCH(mol, pat)
class rdkit.Chem.fmcs.fmcs.MCSResult(num_atoms, num_bonds, smarts, completed)

Bases: object

class rdkit.Chem.fmcs.fmcs.OutgoingEdge(from_atom_index, bond_index, bond_smarts, other_node_idx, other_node)

Bases: tuple

Create new instance of OutgoingEdge(from_atom_index, bond_index, bond_smarts, other_node_idx, other_node)

bond_index

Alias for field number 1

bond_smarts

Alias for field number 2

from_atom_index

Alias for field number 0

other_node

Alias for field number 4

other_node_idx

Alias for field number 3

class rdkit.Chem.fmcs.fmcs.SingleBestAtoms(timer, verbose)

Bases: _SingleBest

add_new_match(subgraph, mol, smarts)
class rdkit.Chem.fmcs.fmcs.SingleBestAtomsCompleteRingsOnly(timer, verbose)

Bases: _SingleBest

add_new_match(subgraph, mol, smarts)
class rdkit.Chem.fmcs.fmcs.SingleBestBonds(timer, verbose)

Bases: _SingleBest

add_new_match(subgraph, mol, smarts)
class rdkit.Chem.fmcs.fmcs.SingleBestBondsCompleteRingsOnly(timer, verbose)

Bases: _SingleBest

add_new_match(subgraph, mol, smarts)
class rdkit.Chem.fmcs.fmcs.Subgraph(atom_indices, bond_indices)

Bases: tuple

Create new instance of Subgraph(atom_indices, bond_indices)

atom_indices

Alias for field number 0

bond_indices

Alias for field number 1

class rdkit.Chem.fmcs.fmcs.Timer

Bases: object

mark(name)
class rdkit.Chem.fmcs.fmcs.TypedFragment(rdmol, orig_atoms, orig_bonds, atom_smarts_types, bond_smarts_types, canonical_bondtypes)

Bases: object

class rdkit.Chem.fmcs.fmcs.TypedMolecule(rdmol, rdmol_atoms, rdmol_bonds, atom_smarts_types, bond_smarts_types, canonical_bondtypes)

Bases: object

class rdkit.Chem.fmcs.fmcs.Uniquer

Bases: dict

class rdkit.Chem.fmcs.fmcs.VerboseCachingTargetsMatcher(targets, required_match_count=None)

Bases: object

report()
shift_targets()
class rdkit.Chem.fmcs.fmcs.VerboseHeapOps(trigger, verboseDelay)

Bases: object

heappop(seeds)
heappush(seeds, item)
report()
trigger_report()
rdkit.Chem.fmcs.fmcs.all_subgraph_extensions(enumeration_mol, subgraph, visited_bond_indices, internal_bonds, external_edges)
rdkit.Chem.fmcs.fmcs.assign_isotopes_from_class_tag(mol, atom_class_tag)
rdkit.Chem.fmcs.fmcs.atom_typer_any(atoms)
rdkit.Chem.fmcs.fmcs.atom_typer_elements(atoms)
rdkit.Chem.fmcs.fmcs.atom_typer_isotopes(atoms)
rdkit.Chem.fmcs.fmcs.bond_typer_any(bonds)
rdkit.Chem.fmcs.fmcs.bond_typer_bondtypes(bonds)
rdkit.Chem.fmcs.fmcs.canon(cangen_nodes)
rdkit.Chem.fmcs.fmcs.check_completeRingsOnly(smarts, subgraph, enumeration_mol)
rdkit.Chem.fmcs.fmcs.compute_mcs(fragmented_mols, typed_mols, minNumAtoms, threshold_count=None, maximize='bonds', completeRingsOnly=False, timeout=None, timer=None, verbose=False, verboseDelay=1.0)
rdkit.Chem.fmcs.fmcs.convert_input_to_typed_molecules(mols, atom_typer, bond_typer, matchValences, ringMatchesRingOnly)
rdkit.Chem.fmcs.fmcs.default_atom_typer(atoms)
rdkit.Chem.fmcs.fmcs.default_bond_typer(bonds)
rdkit.Chem.fmcs.fmcs.enumerate_subgraphs(enumeration_mols, prune, atom_assignment, matches_all_targets, hits, timeout, heappush, heappop)
rdkit.Chem.fmcs.fmcs.find_duplicates(cangen_nodes, start, end)
rdkit.Chem.fmcs.fmcs.find_extension_size(enumeration_mol, known_atoms, exclude_bonds, directed_edges)
rdkit.Chem.fmcs.fmcs.find_extensions(atom_indices, visited_bond_indices, directed_edges)
rdkit.Chem.fmcs.fmcs.find_upper_fragment_size_limits(rdmol, atoms)
rdkit.Chem.fmcs.fmcs.fmcs(mols, minNumAtoms=2, maximize='bonds', atomCompare='elements', bondCompare='bondtypes', threshold=1.0, matchValences=False, ringMatchesRingOnly=False, completeRingsOnly=False, timeout=None, times=None, verbose=False, verboseDelay=1.0)
rdkit.Chem.fmcs.fmcs.fragmented_mol_to_enumeration_mols(typed_mol, minNumAtoms=2)
rdkit.Chem.fmcs.fmcs.gen_primes()
rdkit.Chem.fmcs.fmcs.generate_smarts(cangen_nodes)
rdkit.Chem.fmcs.fmcs.get_canonical_bondtype_counts(typed_mols)
rdkit.Chem.fmcs.fmcs.get_canonical_bondtypes(rdmol, bonds, atom_smarts_types, bond_smarts_types)
rdkit.Chem.fmcs.fmcs.get_closure_label(bond_smarts, closure)
rdkit.Chem.fmcs.fmcs.get_counts(it)
rdkit.Chem.fmcs.fmcs.get_initial_cangen_nodes(subgraph, enumeration_mol, atom_assignment, do_initial_assignment=True)
rdkit.Chem.fmcs.fmcs.get_isotopes(mol)
rdkit.Chem.fmcs.fmcs.get_selected_atom_classes(mol, atom_indices)
rdkit.Chem.fmcs.fmcs.get_specified_types(rdmol, atom_types, ringMatchesRingOnly)
rdkit.Chem.fmcs.fmcs.get_typed_fragment(typed_mol, atom_indices)
rdkit.Chem.fmcs.fmcs.get_typed_molecule(rdmol, atom_typer, bond_typer, matchValences=False, ringMatchesRingOnly=False)
rdkit.Chem.fmcs.fmcs.intersect_counts(counts1, counts2)
rdkit.Chem.fmcs.fmcs.main(args=None)
rdkit.Chem.fmcs.fmcs.make_arbitrary_smarts(subgraph, enumeration_mol, atom_assignment)
rdkit.Chem.fmcs.fmcs.make_canonical_smarts(subgraph, enumeration_mol, atom_assignment)
rdkit.Chem.fmcs.fmcs.make_complete_sdf(mcs, mol, subgraph, args)
rdkit.Chem.fmcs.fmcs.make_fragment_sdf(mcs, mol, subgraph, args)
rdkit.Chem.fmcs.fmcs.make_fragment_smiles(mcs, mol, subgraph, args=None)
rdkit.Chem.fmcs.fmcs.make_structure_format(format_name, mcs, mol, subgraph, args)
rdkit.Chem.fmcs.fmcs.nonempty_powerset([1,2,3]) --> (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)
rdkit.Chem.fmcs.fmcs.parse_num_atoms(s)
rdkit.Chem.fmcs.fmcs.parse_select(s)
rdkit.Chem.fmcs.fmcs.parse_threshold(s)
rdkit.Chem.fmcs.fmcs.parse_timeout(s)
rdkit.Chem.fmcs.fmcs.powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)
rdkit.Chem.fmcs.fmcs.prune_maximize_atoms(subgraph, mol, num_remaining_atoms, num_remaining_bonds, best_sizes)
rdkit.Chem.fmcs.fmcs.prune_maximize_bonds(subgraph, mol, num_remaining_atoms, num_remaining_bonds, best_sizes)
rdkit.Chem.fmcs.fmcs.remove_unknown_bondtypes(typed_mol, supported_canonical_bondtypes)
rdkit.Chem.fmcs.fmcs.rerank(cangen_nodes)
rdkit.Chem.fmcs.fmcs.restore_isotopes(mol)
rdkit.Chem.fmcs.fmcs.save_atom_classes(mol, atom_classes)
rdkit.Chem.fmcs.fmcs.save_isotopes(mol, isotopes)
rdkit.Chem.fmcs.fmcs.set_isotopes(mol, isotopes)
class rdkit.Chem.fmcs.fmcs.starting_from(left)

Bases: object

rdkit.Chem.fmcs.fmcs.subgraph_to_fragment(mol, subgraph)
rdkit.Chem.fmcs.fmcs.tiebreaker()