pytoda.smiles.processing module

SMILES processing utilities.

Summary

Functions:

kmer_smiles_tokenizer

K-Mer SMILES tokenization following SMILES PE (Li et al. 2020):

spe_smiles_tokenizer

Pretrained SMILES Pair Encoding tokenizer following (Li et al. 2020).

split_selfies

Tokenize SELFIES, wrapping generator as list.

tokenize_selfies

Tokenize SELFIES.

tokenize_smiles

Tokenize a character-level SMILES string.

Reference

tokenize_smiles(smiles, regexp=re.compile('(\\\\[[^\\\\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\\\\(|\\\\)|\\\\.|=|#|-|\\\\+|\\\\\\\\\\\\\\\\|\\\\/|:|~|@|\\\\?|>|\\\\*|\\\\$|\\\\%[0-9]{2}|[0-9])'), *args, **kwargs)[source]

Tokenize a character-level SMILES string.

Parameters
  • smiles (str) – a SMILES representation.

  • regexp (re.Pattern) – optionally pass a regexp for the tokenization. Defaults to SMILES_TOKENIZER.

  • () (kwargs) – ignored, for backwards compatibility.

  • () – ignored, for backwards compatibility.

Returns

the tokenized SMILES.

Return type

Tokens

kmer_smiles_tokenizer(smiles, k=2, stride=1, *args, **kwargs)[source]
K-Mer SMILES tokenization following SMILES PE (Li et al. 2020):

Li, Xinhao, and Denis Fourches. “SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning.” (2020).

Parameters
  • smiles (str) – SMILES string to be tokenized.

  • k (int) – Positive integer denoting the tuple/k-gram lengths. Defaults to 2 (bigrams).

  • stride (int, optional) – Stride used for k-mer generation. Higher values result in less tokens. Defaults to 1 (densely overlapping).

  • () (kwargs) – Optional arguments for kmer_tokenizer.

  • () – Optional keyword arguments for kmer_tokenizer.

Returns

Tokenized SMILES sequence (list of str).

Return type

Tokens

spe_smiles_tokenizer(smiles)[source]
Pretrained SMILES Pair Encoding tokenizer following (Li et al. 2020).

Splits a SMILES into tokens of substructures of varying lengths, depending on occurrence of tokens in ChEMBL dataset.

Li, Xinhao, and Denis Fourches. “SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning.” (2020).

Parameters

smiles (str) – SMILES string to be tokenized.

Returns

SMILES tokenized into substructures (list of str).

Return type

Tokens

tokenize_selfies(selfies)[source]

Tokenize SELFIES.

NOTE: Code adapted from selfies package (def selfies_to_hot):

https://github.com/aspuru-guzik-group/selfies

Parameters

selfies (str) – a SELFIES representation (character-level).

Returns

the tokenized SELFIES.

Return type

Tokens

split_selfies(selfies)[source]

Tokenize SELFIES, wrapping generator as list.

Parameters

selfies (str) – a SELFIES representation (character-level).

Returns

the tokenized SELFIES.

Return type

Tokens