pytoda.smiles.processing module¶
SMILES processing utilities.
Summary¶
Functions:
K-Mer SMILES tokenization following SMILES PE (Li et al. 2020): |
|
Pretrained SMILES Pair Encoding tokenizer following (Li et al. 2020). |
|
Tokenize SELFIES, wrapping generator as list. |
|
Tokenize SELFIES. |
|
Tokenize a character-level SMILES string. |
Reference¶
-
tokenize_smiles
(smiles, regexp=re.compile('(\\\\[[^\\\\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\\\\(|\\\\)|\\\\.|=|#|-|\\\\+|\\\\\\\\\\\\\\\\|\\\\/|:|~|@|\\\\?|>|\\\\*|\\\\$|\\\\%[0-9]{2}|[0-9])'), *args, **kwargs)[source]¶ Tokenize a character-level SMILES string.
- Parameters
smiles (str) – a SMILES representation.
regexp (re.Pattern) – optionally pass a regexp for the tokenization. Defaults to SMILES_TOKENIZER.
() (kwargs) – ignored, for backwards compatibility.
() – ignored, for backwards compatibility.
- Returns
the tokenized SMILES.
- Return type
Tokens
-
kmer_smiles_tokenizer
(smiles, k=2, stride=1, *args, **kwargs)[source]¶ - K-Mer SMILES tokenization following SMILES PE (Li et al. 2020):
Li, Xinhao, and Denis Fourches. “SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning.” (2020).
- Parameters
smiles (str) – SMILES string to be tokenized.
k (int) – Positive integer denoting the tuple/k-gram lengths. Defaults to 2 (bigrams).
stride (int, optional) – Stride used for k-mer generation. Higher values result in less tokens. Defaults to 1 (densely overlapping).
() (kwargs) – Optional arguments for kmer_tokenizer.
() – Optional keyword arguments for kmer_tokenizer.
- Returns
Tokenized SMILES sequence (list of str).
- Return type
Tokens
-
spe_smiles_tokenizer
(smiles)[source]¶ - Pretrained SMILES Pair Encoding tokenizer following (Li et al. 2020).
Splits a SMILES into tokens of substructures of varying lengths, depending on occurrence of tokens in ChEMBL dataset.
Li, Xinhao, and Denis Fourches. “SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning.” (2020).
- Parameters
smiles (str) – SMILES string to be tokenized.
- Returns
SMILES tokenized into substructures (list of str).
- Return type
Tokens