pytoda.smiles.smiles_language module

SMILES language handling.

Summary

Exceptions:

UnknownMaxLengthError

Classes:

SELFIESLanguage

SELFIESLanguage is a SMILESLanguage with a different default tokenizer, transforming SMILES to SELFIES.

SMILESLanguage

SMILESLanguage class.

SMILESTokenizer

SMILESTokenizer class, based on SMILESLanguage applying transforms and and encoding of SMILES string to sequence of token indexes.

Reference

exception UnknownMaxLengthError[source]

Bases: RuntimeError

class SMILESLanguage(name='smiles-language', smiles_tokenizer=<function tokenize_smiles>, tokenizer_name=None, vocab_file=None, max_token_sequence_length=0)[source]

Bases: object

SMILESLanguage class.

SMILESLanguage handle SMILES data defining the vocabulary and utilities to manipulate it, including encoding to token indexes.

vocab_files_names = {'vocab_file': 'vocab.json'}
__init__(name='smiles-language', smiles_tokenizer=<function tokenize_smiles>, tokenizer_name=None, vocab_file=None, max_token_sequence_length=0)[source]

Initialize SMILES language.

Parameters
  • name (str) – name of the SMILESLanguage.

  • smiles_tokenizer (Tokenizer) – optional SMILES tokenization function. Defaults to tokenize_smiles, but tokenizer_name takes precedence when found in available TOKENIZER_FUNCTIONS.

  • tokenizer_name (str) – name, mapping to Tokenizer used to save and restore object from text files. Defaults to None, i.e. using default smiles_tokenizer. Examples for available names are ‘smiles’, ‘selfies’ or ‘spe_smiles’.

  • vocab_file (str) – optional filepath to vocab json or directory containing it.

  • max_token_sequence_length (int) – initial value for keeping track of longest sequence. Defaults to 0.

setup_vocab()[source]

Sets up the vocab by generating the special tokens.

Return type

None

static load(filepath)[source]

Static method to load a SMILESLanguage object.

Parameters

filepath (str) – path to the file.

Returns

the loaded SMILES language object.

Return type

SMILESLanguage

static dump(smiles_language, filepath)[source]

Static method to save a smiles_language object to disk.

Parameters
  • smiles_language (SMILESLanguage) – a SMILESLanguage object.

  • filepath (str) – path where to dump the SMILESLanguage.

save(filepath)[source]

Instance method to save/dump smiles language object.

Parameters

filepath (str) – path where to save the SMILESLanguage.

load_vocabulary(vocab_file)[source]

Load a vocabulary mapping from token to token indexes.

Parameters

vocab_file (str) – a .json with tokens mapping to index. Can also be path to directory.

classmethod from_pretrained(pretrained_path, *init_inputs, **kwargs)[source]
save_vocabulary(vocab_file)[source]

Save the vocabulary mapping tokens to indexes to file.

Parameters

vocab_file (str) – a .json to save tokens mapping to index. Can also be path to directory.

Return type

Tuple[str]

save_pretrained(save_directory)[source]

Save the tokenizer vocabulary files together with tokenizer instantiation positional and keywords inputs.

This method make sure the full tokenizer can then be re-loaded using the from_pretrained class method.

add_smis(smi_filepaths, index_col=1, chunk_size=10000, name='SMILES', names=None)[source]

Add a set of SMILES from a list of .smi files, applying transform_smiles.

Parameters
  • smi_filepaths (Files) – a list of paths to .smi files.

  • index_col (int) – Data column used for indexing, defaults to 1.

  • chunk_size (int) – size of the chunks. Defaults to 10000.

  • name (str) – type of dataset, used to index columns in smi, and must be in names. Defaults to ‘SMILES’.

  • names (Sequence[str]) – User-assigned names given to the columns. Defaults to [name].

Return type

None

add_smi(smi_filepath, index_col=1, chunk_size=10000, name='SMILES', names=None)[source]

Add a set of SMILES from a .smi file, applying transform_smiles.

Parameters
  • smi_filepath (str) – path to the .smi file.

  • index_col (int) – Data column used for indexing, defaults to 1.

  • chunk_size (int) – number of rows to read in a chunk. Defaults to 100000.

  • name (str) – type of dataset, used to index columns in smi, and must be in names. Defaults to ‘SMILES’.

  • names (Sequence[str]) – User-assigned names given to the columns. Defaults to [name].

Return type

None

add_dataset(dataset)[source]

Add a set of SMILES from an iterable, applying transform_smiles.

Collects and warns about invalid SMILES, and warns on finding new tokens.

Parameters

dataset (Iterable) – returning SMILES strings.

add_smiles(smiles)[source]

Add a SMILES to the language.

Updates max_token_sequence_length. Adds missing tokens to the language.

Parameters

smiles (str) – a SMILES representation.

Return type

None

add_token(token)[source]

Add a token to the language.

Parameters

token (str) – a token.

Return type

None

smiles_to_token_indexes(smiles)[source]

Transform character-level SMILES into a sequence of token indexes.

Parameters

smiles (str) – a SMILES (or SELFIES) representation.

Returns

indexes representation for the

SMILES/SELFIES provided.

Return type

Union[Indexes, Tensor]

token_indexes_to_smiles(token_indexes)[source]

Transform a sequence of token indexes into SMILES, ignoring special tokens.

Parameters

token_indexes (Union[Indexes, Tensor]) – Sequence of integers representing tokens in vocabulary.

Returns

a SMILES (or SELFIES) representation.

Return type

str

static tensor_to_indexes(token_indexes)[source]

Utility to get Indexes from Tensors.

Parameters

token_indexes (Union[Indexes, Tensor]) – from single SMILES.

Raises

ValueError – in case the Tensor is not shaped correctly

Returns

list from Tensor or else the initial token_indexes.

Return type

Indexes

selfies_to_smiles(selfies)[source]

SELFIES to SMILES converter method. Based on: https://arxiv.org/abs/1905.13741

Parameters

selfies (str) – SELFIES representation

Returns

A SMILES string

Return type

str

smiles_to_selfies(smiles)[source]

SMILES to SELFIES converter method. Based on: https://arxiv.org/abs/1905.13741

Parameters

smiles (str) – SMILES representation

Returns

A SELFIES string

Return type

str

class SELFIESLanguage(name='selfies-language', vocab_file=None, max_token_sequence_length=0)[source]

Bases: pytoda.smiles.smiles_language.SMILESLanguage

SELFIESLanguage is a SMILESLanguage with a different default tokenizer, transforming SMILES to SELFIES.

__init__(name='selfies-language', vocab_file=None, max_token_sequence_length=0)[source]

Initialize SMILES language.

Parameters
  • name (str) – name of the SMILESLanguage.

  • vocab_file (str) – optional filepath to vocab json or directory containing it.

  • max_token_sequence_length (int) – initial value for keeping track of longest sequence. Defaults to 0.

class SMILESTokenizer(name='smiles-language', smiles_tokenizer=<function tokenize_smiles>, tokenizer_name=None, vocab_file=None, max_token_sequence_length=0, canonical=False, augment=False, kekulize=False, all_bonds_explicit=False, all_hs_explicit=False, remove_bonddir=False, remove_chirality=False, selfies=False, sanitize=True, randomize=False, add_start_and_stop=False, padding=False, padding_length=None, device=None)[source]

Bases: pytoda.smiles.smiles_language.SMILESLanguage

SMILESTokenizer class, based on SMILESLanguage applying transforms and and encoding of SMILES string to sequence of token indexes.

__init__(name='smiles-language', smiles_tokenizer=<function tokenize_smiles>, tokenizer_name=None, vocab_file=None, max_token_sequence_length=0, canonical=False, augment=False, kekulize=False, all_bonds_explicit=False, all_hs_explicit=False, remove_bonddir=False, remove_chirality=False, selfies=False, sanitize=True, randomize=False, add_start_and_stop=False, padding=False, padding_length=None, device=None)[source]

Initialize SMILES language.

Parameters
  • name (str) – name of the SMILESLanguage.

  • smiles_tokenizer (Tokenizer) – optional SMILES tokenization function. Defaults to tokenize_smiles, but tokenizer_name takes precedence when found in available TOKENIZER_FUNCTIONS.

  • tokenizer_name (str) – optional name mapping to Tokenizer. Defaults to None, i.e. using default smiles_tokenizer.

  • vocab_file (str) – optional filepath to vocab json or directory containing it.

  • max_token_sequence_length (int) – initial value for keeping track of longest sequence. Defaults to 0.

  • canonical (bool) – performs canonicalization of SMILES (one original string for one molecule), if True, then other transformations (augment etc, see below) do not apply

  • augment (bool) – perform SMILES augmentation. Defaults to False.

  • kekulize (bool) – kekulizes SMILES (implicit aromaticity only). Defaults to False.

  • all_bonds_explicit (bool) – Makes all bonds explicit. Defaults to False, only applies if kekulize = True.

  • all_hs_explicit (bool) – Makes all hydrogens explicit. Defaults to False, only applies if kekulize = True.

  • randomize (bool) – perform a true randomization of SMILES tokens. Defaults to False.

  • remove_bonddir (bool) – Remove directional info of bonds. Defaults to False.

  • remove_chirality (bool) – Remove chirality information. Defaults to False.

  • selfies (bool) – Whether selfies is used instead of smiles, defaults to False.

  • sanitize (bool) – Sanitize SMILES. Defaults to True.

  • add_start_and_stop (bool) – add start and stop token indexes. Defaults to False.

  • padding (bool) – pad sequences from the left to matching length. Defaults to False.

  • padding_length (int) – common length of all token sequences, applies only if padding is True. See set_max_padding to set it to longest token sequence the smiles language encountered. Defaults to None.

  • device (Any) – Deprecated argument that will be removed in the future.

Note

See set_smiles_transforms and set_encoding_transforms to change the transforms temporarily and reset with reset_initial_transforms. Assignment of class attributes in the parameter list will trigger such a reset.

set_max_padding()[source]

Set padding_length that does not truncate any sequence. Requires updated max_token_sequence_length.

Raises

UnknownMaxLengthError – When max_token_sequence_length is 0 because no SMILES were added to the language.

reset_initial_transforms()[source]

Reset smiles and token indexes transforms as on initialization.

set_smiles_transforms(canonical=None, augment=None, kekulize=None, all_bonds_explicit=None, all_hs_explicit=None, remove_bonddir=None, remove_chirality=None, selfies=None, sanitize=None)[source]

Helper function to reversibly change steps of the transforms.

set_encoding_transforms(randomize=None, add_start_and_stop=None, padding=None, padding_length=None)[source]

Helper function to reversibly change steps of the transforms.