pytoda.smiles.smiles_language module¶

SMILES language handling.

Summary¶

Exceptions:

UnknownMaxLengthError

Classes:

`SELFIESLanguage`	SELFIESLanguage is a SMILESLanguage with a different default tokenizer, transforming SMILES to SELFIES.
`SMILESLanguage`	SMILESLanguage class.
`SMILESTokenizer`	SMILESTokenizer class, based on SMILESLanguage applying transforms and and encoding of SMILES string to sequence of token indexes.

Reference¶

exception UnknownMaxLengthError[source]¶: Bases: RuntimeError

class SMILESLanguage(name='smiles-language', smiles_tokenizer=<function tokenize_smiles>, tokenizer_name=None, vocab_file=None, max_token_sequence_length=0)[source]¶

Bases: object

SMILESLanguage class.

SMILESLanguage handle SMILES data defining the vocabulary and utilities to manipulate it, including encoding to token indexes.

vocab_files_names = {'vocab_file': 'vocab.json'}¶

__init__(name='smiles-language', smiles_tokenizer=<function tokenize_smiles>, tokenizer_name=None, vocab_file=None, max_token_sequence_length=0)[source]¶

Initialize SMILES language.

Parameters

name (str) – name of the SMILESLanguage.
smiles_tokenizer (Tokenizer) – optional SMILES tokenization function. Defaults to tokenize_smiles, but tokenizer_name takes precedence when found in available TOKENIZER_FUNCTIONS.
tokenizer_name (str) – name, mapping to Tokenizer used to save and restore object from text files. Defaults to None, i.e. using default smiles_tokenizer. Examples for available names are ‘smiles’, ‘selfies’ or ‘spe_smiles’.
vocab_file (str) – optional filepath to vocab json or directory containing it.
max_token_sequence_length (int) – initial value for keeping track of longest sequence. Defaults to 0.

setup_vocab()[source]¶

Sets up the vocab by generating the special tokens.

Return type: None

static load(filepath)[source]¶

Static method to load a SMILESLanguage object.

Parameters: filepath (str) – path to the file.
Returns: the loaded SMILES language object.
Return type: SMILESLanguage

static dump(smiles_language, filepath)[source]¶

Static method to save a smiles_language object to disk.

Parameters

smiles_language (SMILESLanguage) – a SMILESLanguage object.
filepath (str) – path where to dump the SMILESLanguage.

save(filepath)[source]¶

Instance method to save/dump smiles language object.

Parameters: filepath (str) – path where to save the SMILESLanguage.

load_vocabulary(vocab_file)[source]¶

Load a vocabulary mapping from token to token indexes.

Parameters: vocab_file (str) – a .json with tokens mapping to index. Can also be path to directory.

classmethod from_pretrained(pretrained_path, *init_inputs, **kwargs)[source]¶

save_vocabulary(vocab_file)[source]¶

Save the vocabulary mapping tokens to indexes to file.

Parameters: vocab_file (str) – a .json to save tokens mapping to index. Can also be path to directory.
Return type: Tuple[str]

save_pretrained(save_directory)[source]¶

Save the tokenizer vocabulary files together with tokenizer instantiation positional and keywords inputs.

This method make sure the full tokenizer can then be re-loaded using the from_pretrained class method.

add_smis(smi_filepaths, index_col=1, chunk_size=10000, name='SMILES', names=None)[source]¶

Add a set of SMILES from a list of .smi files, applying transform_smiles.

Parameters

smi_filepaths (Files) – a list of paths to .smi files.
index_col (int) – Data column used for indexing, defaults to 1.
chunk_size (int) – size of the chunks. Defaults to 10000.
name (str) – type of dataset, used to index columns in smi, and must be in names. Defaults to ‘SMILES’.
names (Sequence[str]) – User-assigned names given to the columns. Defaults to [name].

Return type

None

add_smi(smi_filepath, index_col=1, chunk_size=10000, name='SMILES', names=None)[source]¶

Add a set of SMILES from a .smi file, applying transform_smiles.

Parameters

smi_filepath (str) – path to the .smi file.
index_col (int) – Data column used for indexing, defaults to 1.
chunk_size (int) – number of rows to read in a chunk. Defaults to 100000.
name (str) – type of dataset, used to index columns in smi, and must be in names. Defaults to ‘SMILES’.
names (Sequence[str]) – User-assigned names given to the columns. Defaults to [name].

Return type

None

add_dataset(dataset)[source]¶

Add a set of SMILES from an iterable, applying transform_smiles.

Collects and warns about invalid SMILES, and warns on finding new tokens.

Parameters: dataset (Iterable) – returning SMILES strings.

add_smiles(smiles)[source]¶

Add a SMILES to the language.

Updates max_token_sequence_length. Adds missing tokens to the language.

Parameters: smiles (str) – a SMILES representation.
Return type: None

add_token(token)[source]¶

Add a token to the language.

Parameters: token (str) – a token.
Return type: None

smiles_to_token_indexes(smiles)[source]¶

Transform character-level SMILES into a sequence of token indexes.

Parameters

smiles (str) – a SMILES (or SELFIES) representation.

Returns

indexes representation for the: SMILES/SELFIES provided.

Return type

Union[Indexes, Tensor]

token_indexes_to_smiles(token_indexes)[source]¶

Transform a sequence of token indexes into SMILES, ignoring special tokens.

Parameters: token_indexes (Union[Indexes, Tensor]) – Sequence of integers representing tokens in vocabulary.
Returns: a SMILES (or SELFIES) representation.
Return type: str

static tensor_to_indexes(token_indexes)[source]¶

Utility to get Indexes from Tensors.

Parameters: token_indexes (Union[Indexes, Tensor]) – from single SMILES.
Raises: ValueError – in case the Tensor is not shaped correctly
Returns: list from Tensor or else the initial token_indexes.
Return type: Indexes

selfies_to_smiles(selfies)[source]¶

SELFIES to SMILES converter method. Based on: https://arxiv.org/abs/1905.13741

Parameters: selfies (str) – SELFIES representation
Returns: A SMILES string
Return type: str

smiles_to_selfies(smiles)[source]¶

SMILES to SELFIES converter method. Based on: https://arxiv.org/abs/1905.13741

Parameters: smiles (str) – SMILES representation
Returns: A SELFIES string
Return type: str

class SELFIESLanguage(name='selfies-language', vocab_file=None, max_token_sequence_length=0)[source]¶

Bases: pytoda.smiles.smiles_language.SMILESLanguage

SELFIESLanguage is a SMILESLanguage with a different default tokenizer, transforming SMILES to SELFIES.

__init__(name='selfies-language', vocab_file=None, max_token_sequence_length=0)[source]¶

Initialize SMILES language.

Parameters

name (str) – name of the SMILESLanguage.
vocab_file (str) – optional filepath to vocab json or directory containing it.
max_token_sequence_length (int) – initial value for keeping track of longest sequence. Defaults to 0.

class SMILESTokenizer(name='smiles-language', smiles_tokenizer=<function tokenize_smiles>, tokenizer_name=None, vocab_file=None, max_token_sequence_length=0, canonical=False, augment=False, kekulize=False, all_bonds_explicit=False, all_hs_explicit=False, remove_bonddir=False, remove_chirality=False, selfies=False, sanitize=True, randomize=False, add_start_and_stop=False, padding=False, padding_length=None, device=None)[source]¶

Bases: pytoda.smiles.smiles_language.SMILESLanguage

SMILESTokenizer class, based on SMILESLanguage applying transforms and and encoding of SMILES string to sequence of token indexes.

__init__(name='smiles-language', smiles_tokenizer=<function tokenize_smiles>, tokenizer_name=None, vocab_file=None, max_token_sequence_length=0, canonical=False, augment=False, kekulize=False, all_bonds_explicit=False, all_hs_explicit=False, remove_bonddir=False, remove_chirality=False, selfies=False, sanitize=True, randomize=False, add_start_and_stop=False, padding=False, padding_length=None, device=None)[source]¶

Initialize SMILES language.

Parameters

name (str) – name of the SMILESLanguage.
smiles_tokenizer (Tokenizer) – optional SMILES tokenization function. Defaults to tokenize_smiles, but tokenizer_name takes precedence when found in available TOKENIZER_FUNCTIONS.
tokenizer_name (str) – optional name mapping to Tokenizer. Defaults to None, i.e. using default smiles_tokenizer.
vocab_file (str) – optional filepath to vocab json or directory containing it.
max_token_sequence_length (int) – initial value for keeping track of longest sequence. Defaults to 0.
canonical (bool) – performs canonicalization of SMILES (one original string for one molecule), if True, then other transformations (augment etc, see below) do not apply
augment (bool) – perform SMILES augmentation. Defaults to False.
kekulize (bool) – kekulizes SMILES (implicit aromaticity only). Defaults to False.
all_bonds_explicit (bool) – Makes all bonds explicit. Defaults to False, only applies if kekulize = True.
all_hs_explicit (bool) – Makes all hydrogens explicit. Defaults to False, only applies if kekulize = True.
randomize (bool) – perform a true randomization of SMILES tokens. Defaults to False.
remove_bonddir (bool) – Remove directional info of bonds. Defaults to False.
remove_chirality (bool) – Remove chirality information. Defaults to False.
selfies (bool) – Whether selfies is used instead of smiles, defaults to False.
sanitize (bool) – Sanitize SMILES. Defaults to True.
add_start_and_stop (bool) – add start and stop token indexes. Defaults to False.
padding (bool) – pad sequences from the left to matching length. Defaults to False.
padding_length (int) – common length of all token sequences, applies only if padding is True. See set_max_padding to set it to longest token sequence the smiles language encountered. Defaults to None.
device (Any) – Deprecated argument that will be removed in the future.

Note

See set_smiles_transforms and set_encoding_transforms to change the transforms temporarily and reset with reset_initial_transforms. Assignment of class attributes in the parameter list will trigger such a reset.

set_max_padding()[source]¶

Set padding_length that does not truncate any sequence. Requires updated max_token_sequence_length.

Raises: UnknownMaxLengthError – When max_token_sequence_length is 0 because no SMILES were added to the language.

reset_initial_transforms()[source]¶: Reset smiles and token indexes transforms as on initialization.

set_smiles_transforms(canonical=None, augment=None, kekulize=None, all_bonds_explicit=None, all_hs_explicit=None, remove_bonddir=None, remove_chirality=None, selfies=None, sanitize=None)[source]¶: Helper function to reversibly change steps of the transforms.

set_encoding_transforms(randomize=None, add_start_and_stop=None, padding=None, padding_length=None)[source]¶: Helper function to reversibly change steps of the transforms.