rdkit.Chem.RegistrationHash module¶
Generate a unique hash code for a molecule based on chemistry. If two molecules are chemically “the same”, they should have the same hash.
Using molhash adds value beyond using SMILES because it:
Ignores SMILES features that are not chemically meaningful
(e.g. atom map numbers) * Canonicalizes enhanced stereochemistry groups. For example C[C@H](O)CC |&1:1| and C[C@@H](O)CC |&1:1| have the same molhash * Canonicalizes S group data (for example, polymer data)
There are two hash schemes, the default, and one in which tautomers are considered equivalent.
- class rdkit.Chem.RegistrationHash.EnhancedStereoUpdateMode(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)¶
Bases:
Enum
- ADD_WEIGHTS = 1¶
- REMOVE_WEIGHTS = 2¶
- rdkit.Chem.RegistrationHash.GetMolHash(all_layers, hash_scheme: HashScheme = HashScheme.ALL_LAYERS) str ¶
Generate a molecular hash using a specified set of layers.
- Parameters:
all_layers – a dictionary of layers
hash_scheme – enum encoding information layers for the hash
- Returns:
hash for the given scheme constructed from the input layers
- rdkit.Chem.RegistrationHash.GetMolLayers(original_molecule: Mol, data_field_names: Iterable | None = None, escape: str | None = None, cxflag=1089, enable_tautomer_hash_v2=False) {<HashLayer.NO_STEREO_TAUTOMER_HASH: 5>, <HashLayer.CANONICAL_SMILES: 1>, <HashLayer.ESCAPE: 2>, <HashLayer.SGROUP_DATA: 6>, <HashLayer.TAUTOMER_HASH: 7>, <HashLayer.NO_STEREO_SMILES: 4>, <HashLayer.FORMULA: 3>} ¶
Generate layers of data about that could be used to identify a molecule
- Parameters:
original_molecule – molecule to obtain canonicalization layers from
data_field_names – optional sequence of names of SGroup DAT fields which will be included in the hash.
escape – optional field which can contain arbitrary information
enable_tautomer_hash_v2 – use v2 of the tautomer hash
- Returns:
dictionary of HashLayer enum to calculated hash
- rdkit.Chem.RegistrationHash.GetNoStereoLayers(mol, enable_tautomer_hash_v2=False)¶
- rdkit.Chem.RegistrationHash.GetStereoTautomerHash(molecule, cxflag=1089, enable_tautomer_hash_v2=False)¶
- class rdkit.Chem.RegistrationHash.HashLayer(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)¶
Bases:
Enum
- Variables:
CANONICAL_SMILES – RDKit canonical SMILES (excluding enhanced stereo)
ESCAPE – arbitrary other information to be incorporated
FORMULA – a simple molecular formula for the molecule
NO_STEREO_SMILES – RDKit canonical SMILES with all stereo removed
SGROUP_DATA – canonicalization of all SGroups data present
TAUTOMER_HASH – SMILES-like representation for a generic tautomer form
NO_STEREO_TAUTOMER_HASH – the above tautomer hash lacking all stereo
- CANONICAL_SMILES = 1¶
- ESCAPE = 2¶
- FORMULA = 3¶
- NO_STEREO_SMILES = 4¶
- NO_STEREO_TAUTOMER_HASH = 5¶
- SGROUP_DATA = 6¶
- TAUTOMER_HASH = 7¶
- class rdkit.Chem.RegistrationHash.HashScheme(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)¶
Bases:
Enum
Which hash layers to use to when deduplicating molecules
Typically the “ALL_LAYERS” scheme is used, but some users may want the “TAUTOMER_INSENSITIVE_LAYERS” scheme.
- Variables:
ALL_LAYERS – most strict hash scheme utilizing all layers
STEREO_INSENSITIVE_LAYERS – excludes stereo sensitive layers
TAUTOMER_INSENSITIVE_LAYERS – excludes tautomer sensitive layers
- ALL_LAYERS = (HashLayer.CANONICAL_SMILES, HashLayer.ESCAPE, HashLayer.FORMULA, HashLayer.NO_STEREO_SMILES, HashLayer.NO_STEREO_TAUTOMER_HASH, HashLayer.SGROUP_DATA, HashLayer.TAUTOMER_HASH)¶
- STEREO_INSENSITIVE_LAYERS = (HashLayer.ESCAPE, HashLayer.FORMULA, HashLayer.NO_STEREO_SMILES, HashLayer.NO_STEREO_TAUTOMER_HASH, HashLayer.SGROUP_DATA)¶
- TAUTOMER_INSENSITIVE_LAYERS = (HashLayer.ESCAPE, HashLayer.FORMULA, HashLayer.NO_STEREO_TAUTOMER_HASH, HashLayer.SGROUP_DATA, HashLayer.TAUTOMER_HASH)¶