rdkit.Chem.RegistrationHash module

Generate a unique hash code for a molecule based on chemistry. If two molecules are chemically “the same”, they should have the same hash.

Using molhash adds value beyond using SMILES because it:

  • Ignores SMILES features that are not chemically meaningful

(e.g. atom map numbers) * Canonicalizes enhanced stereochemistry groups. For example C[C@H](O)CC |&1:1| and C[C@@H](O)CC |&1:1| have the same molhash * Canonicalizes S group data (for example, polymer data)

There are two hash schemes, the default, and one in which tautomers are considered equivalent.

class rdkit.Chem.RegistrationHash.EnhancedStereoUpdateMode(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

ADD_WEIGHTS = 1
REMOVE_WEIGHTS = 2
rdkit.Chem.RegistrationHash.GetMolHash(all_layers, hash_scheme: HashScheme = HashScheme.ALL_LAYERS) str

Generate a molecular hash using a specified set of layers.

Parameters:
  • all_layers – a dictionary of layers

  • hash_scheme – enum encoding information layers for the hash

Returns:

hash for the given scheme constructed from the input layers

rdkit.Chem.RegistrationHash.GetMolLayers(original_molecule: Mol, data_field_names: Iterable | None = None, escape: str | None = None, cxflag=1089, enable_tautomer_hash_v2=False) {<HashLayer.SGROUP_DATA: 6>, <HashLayer.NO_STEREO_SMILES: 4>, <HashLayer.NO_STEREO_TAUTOMER_HASH: 5>, <HashLayer.CANONICAL_SMILES: 1>, <HashLayer.FORMULA: 3>, <HashLayer.TAUTOMER_HASH: 7>, <HashLayer.ESCAPE: 2>}

Generate layers of data about that could be used to identify a molecule

Parameters:
  • original_molecule – molecule to obtain canonicalization layers from

  • data_field_names – optional sequence of names of SGroup DAT fields which will be included in the hash.

  • escape – optional field which can contain arbitrary information

  • enable_tautomer_hash_v2 – use v2 of the tautomer hash

Returns:

dictionary of HashLayer enum to calculated hash

rdkit.Chem.RegistrationHash.GetNoStereoLayers(mol, enable_tautomer_hash_v2=False)
rdkit.Chem.RegistrationHash.GetStereoTautomerHash(molecule, cxflag=1089, enable_tautomer_hash_v2=False)
class rdkit.Chem.RegistrationHash.HashLayer(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Variables:
  • CANONICAL_SMILES – RDKit canonical SMILES (excluding enhanced stereo)

  • ESCAPE – arbitrary other information to be incorporated

  • FORMULA – a simple molecular formula for the molecule

  • NO_STEREO_SMILES – RDKit canonical SMILES with all stereo removed

  • SGROUP_DATA – canonicalization of all SGroups data present

  • TAUTOMER_HASH – SMILES-like representation for a generic tautomer form

  • NO_STEREO_TAUTOMER_HASH – the above tautomer hash lacking all stereo

CANONICAL_SMILES = 1
ESCAPE = 2
FORMULA = 3
NO_STEREO_SMILES = 4
NO_STEREO_TAUTOMER_HASH = 5
SGROUP_DATA = 6
TAUTOMER_HASH = 7
class rdkit.Chem.RegistrationHash.HashScheme(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Which hash layers to use to when deduplicating molecules

Typically the “ALL_LAYERS” scheme is used, but some users may want the “TAUTOMER_INSENSITIVE_LAYERS” scheme.

Variables:
  • ALL_LAYERS – most strict hash scheme utilizing all layers

  • STEREO_INSENSITIVE_LAYERS – excludes stereo sensitive layers

  • TAUTOMER_INSENSITIVE_LAYERS – excludes tautomer sensitive layers

ALL_LAYERS = (HashLayer.CANONICAL_SMILES, HashLayer.ESCAPE, HashLayer.FORMULA, HashLayer.NO_STEREO_SMILES, HashLayer.NO_STEREO_TAUTOMER_HASH, HashLayer.SGROUP_DATA, HashLayer.TAUTOMER_HASH)
STEREO_INSENSITIVE_LAYERS = (HashLayer.ESCAPE, HashLayer.FORMULA, HashLayer.NO_STEREO_SMILES, HashLayer.NO_STEREO_TAUTOMER_HASH, HashLayer.SGROUP_DATA)
TAUTOMER_INSENSITIVE_LAYERS = (HashLayer.ESCAPE, HashLayer.FORMULA, HashLayer.NO_STEREO_TAUTOMER_HASH, HashLayer.SGROUP_DATA, HashLayer.TAUTOMER_HASH)