pytoda.smiles.smiles_language module¶
SMILES language handling.
Summary¶
Exceptions:
Classes:
SELFIESLanguage is a SMILESLanguage with a different default tokenizer, transforming SMILES to SELFIES. |
|
SMILESLanguage class. |
|
SMILESTokenizer class, based on SMILESLanguage applying transforms and and encoding of SMILES string to sequence of token indexes. |
Reference¶
-
class
SMILESLanguage
(name='smiles-language', smiles_tokenizer=<function tokenize_smiles>, tokenizer_name=None, vocab_file=None, max_token_sequence_length=0)[source]¶ Bases:
object
SMILESLanguage class.
SMILESLanguage handle SMILES data defining the vocabulary and utilities to manipulate it, including encoding to token indexes.
-
vocab_files_names
= {'vocab_file': 'vocab.json'}¶
-
__init__
(name='smiles-language', smiles_tokenizer=<function tokenize_smiles>, tokenizer_name=None, vocab_file=None, max_token_sequence_length=0)[source]¶ Initialize SMILES language.
- Parameters
name (str) – name of the SMILESLanguage.
smiles_tokenizer (Tokenizer) – optional SMILES tokenization function. Defaults to tokenize_smiles, but tokenizer_name takes precedence when found in available TOKENIZER_FUNCTIONS.
tokenizer_name (str) – name, mapping to Tokenizer used to save and restore object from text files. Defaults to None, i.e. using default smiles_tokenizer. Examples for available names are ‘smiles’, ‘selfies’ or ‘spe_smiles’.
vocab_file (str) – optional filepath to vocab json or directory containing it.
max_token_sequence_length (int) – initial value for keeping track of longest sequence. Defaults to 0.
-
static
load
(filepath)[source]¶ Static method to load a SMILESLanguage object.
- Parameters
filepath (str) – path to the file.
- Returns
the loaded SMILES language object.
- Return type
-
static
dump
(smiles_language, filepath)[source]¶ Static method to save a smiles_language object to disk.
- Parameters
smiles_language (SMILESLanguage) – a SMILESLanguage object.
filepath (str) – path where to dump the SMILESLanguage.
-
save
(filepath)[source]¶ Instance method to save/dump smiles language object.
- Parameters
filepath (str) – path where to save the SMILESLanguage.
-
load_vocabulary
(vocab_file)[source]¶ Load a vocabulary mapping from token to token indexes.
- Parameters
vocab_file (str) – a .json with tokens mapping to index. Can also be path to directory.
-
save_vocabulary
(vocab_file)[source]¶ Save the vocabulary mapping tokens to indexes to file.
- Parameters
vocab_file (str) – a .json to save tokens mapping to index. Can also be path to directory.
- Return type
Tuple
[str
]
-
save_pretrained
(save_directory)[source]¶ Save the tokenizer vocabulary files together with tokenizer instantiation positional and keywords inputs.
This method make sure the full tokenizer can then be re-loaded using the from_pretrained class method.
-
add_smis
(smi_filepaths, index_col=1, chunk_size=10000, name='SMILES', names=None)[source]¶ Add a set of SMILES from a list of .smi files, applying transform_smiles.
- Parameters
smi_filepaths (Files) – a list of paths to .smi files.
index_col (int) – Data column used for indexing, defaults to 1.
chunk_size (int) – size of the chunks. Defaults to 10000.
name (str) – type of dataset, used to index columns in smi, and must be in names. Defaults to ‘SMILES’.
names (Sequence[str]) – User-assigned names given to the columns. Defaults to [name].
- Return type
None
-
add_smi
(smi_filepath, index_col=1, chunk_size=10000, name='SMILES', names=None)[source]¶ Add a set of SMILES from a .smi file, applying transform_smiles.
- Parameters
smi_filepath (str) – path to the .smi file.
index_col (int) – Data column used for indexing, defaults to 1.
chunk_size (int) – number of rows to read in a chunk. Defaults to 100000.
name (str) – type of dataset, used to index columns in smi, and must be in names. Defaults to ‘SMILES’.
names (Sequence[str]) – User-assigned names given to the columns. Defaults to [name].
- Return type
None
-
add_dataset
(dataset)[source]¶ Add a set of SMILES from an iterable, applying transform_smiles.
Collects and warns about invalid SMILES, and warns on finding new tokens.
- Parameters
dataset (Iterable) – returning SMILES strings.
-
add_smiles
(smiles)[source]¶ Add a SMILES to the language.
Updates max_token_sequence_length. Adds missing tokens to the language.
- Parameters
smiles (str) – a SMILES representation.
- Return type
None
-
add_token
(token)[source]¶ Add a token to the language.
- Parameters
token (str) – a token.
- Return type
None
-
smiles_to_token_indexes
(smiles)[source]¶ Transform character-level SMILES into a sequence of token indexes.
- Parameters
smiles (str) – a SMILES (or SELFIES) representation.
- Returns
- indexes representation for the
SMILES/SELFIES provided.
- Return type
Union[Indexes, Tensor]
-
token_indexes_to_smiles
(token_indexes)[source]¶ Transform a sequence of token indexes into SMILES, ignoring special tokens.
- Parameters
token_indexes (Union[Indexes, Tensor]) – Sequence of integers representing tokens in vocabulary.
- Returns
a SMILES (or SELFIES) representation.
- Return type
str
-
static
tensor_to_indexes
(token_indexes)[source]¶ Utility to get Indexes from Tensors.
- Parameters
token_indexes (Union[Indexes, Tensor]) – from single SMILES.
- Raises
ValueError – in case the Tensor is not shaped correctly
- Returns
list from Tensor or else the initial token_indexes.
- Return type
Indexes
-
selfies_to_smiles
(selfies)[source]¶ SELFIES to SMILES converter method. Based on: https://arxiv.org/abs/1905.13741
- Parameters
selfies (str) – SELFIES representation
- Returns
A SMILES string
- Return type
str
-
smiles_to_selfies
(smiles)[source]¶ SMILES to SELFIES converter method. Based on: https://arxiv.org/abs/1905.13741
- Parameters
smiles (str) – SMILES representation
- Returns
A SELFIES string
- Return type
str
-
-
class
SELFIESLanguage
(name='selfies-language', vocab_file=None, max_token_sequence_length=0)[source]¶ Bases:
pytoda.smiles.smiles_language.SMILESLanguage
SELFIESLanguage is a SMILESLanguage with a different default tokenizer, transforming SMILES to SELFIES.
-
__init__
(name='selfies-language', vocab_file=None, max_token_sequence_length=0)[source]¶ Initialize SMILES language.
- Parameters
name (str) – name of the SMILESLanguage.
vocab_file (str) – optional filepath to vocab json or directory containing it.
max_token_sequence_length (int) – initial value for keeping track of longest sequence. Defaults to 0.
-
-
class
SMILESTokenizer
(name='smiles-language', smiles_tokenizer=<function tokenize_smiles>, tokenizer_name=None, vocab_file=None, max_token_sequence_length=0, canonical=False, augment=False, kekulize=False, all_bonds_explicit=False, all_hs_explicit=False, remove_bonddir=False, remove_chirality=False, selfies=False, sanitize=True, randomize=False, add_start_and_stop=False, padding=False, padding_length=None, device=None)[source]¶ Bases:
pytoda.smiles.smiles_language.SMILESLanguage
SMILESTokenizer class, based on SMILESLanguage applying transforms and and encoding of SMILES string to sequence of token indexes.
-
__init__
(name='smiles-language', smiles_tokenizer=<function tokenize_smiles>, tokenizer_name=None, vocab_file=None, max_token_sequence_length=0, canonical=False, augment=False, kekulize=False, all_bonds_explicit=False, all_hs_explicit=False, remove_bonddir=False, remove_chirality=False, selfies=False, sanitize=True, randomize=False, add_start_and_stop=False, padding=False, padding_length=None, device=None)[source]¶ Initialize SMILES language.
- Parameters
name (str) – name of the SMILESLanguage.
smiles_tokenizer (Tokenizer) – optional SMILES tokenization function. Defaults to tokenize_smiles, but tokenizer_name takes precedence when found in available TOKENIZER_FUNCTIONS.
tokenizer_name (str) – optional name mapping to Tokenizer. Defaults to None, i.e. using default smiles_tokenizer.
vocab_file (str) – optional filepath to vocab json or directory containing it.
max_token_sequence_length (int) – initial value for keeping track of longest sequence. Defaults to 0.
canonical (bool) – performs canonicalization of SMILES (one original string for one molecule), if True, then other transformations (augment etc, see below) do not apply
augment (bool) – perform SMILES augmentation. Defaults to False.
kekulize (bool) – kekulizes SMILES (implicit aromaticity only). Defaults to False.
all_bonds_explicit (bool) – Makes all bonds explicit. Defaults to False, only applies if kekulize = True.
all_hs_explicit (bool) – Makes all hydrogens explicit. Defaults to False, only applies if kekulize = True.
randomize (bool) – perform a true randomization of SMILES tokens. Defaults to False.
remove_bonddir (bool) – Remove directional info of bonds. Defaults to False.
remove_chirality (bool) – Remove chirality information. Defaults to False.
selfies (bool) – Whether selfies is used instead of smiles, defaults to False.
sanitize (bool) – Sanitize SMILES. Defaults to True.
add_start_and_stop (bool) – add start and stop token indexes. Defaults to False.
padding (bool) – pad sequences from the left to matching length. Defaults to False.
padding_length (int) – common length of all token sequences, applies only if padding is True. See set_max_padding to set it to longest token sequence the smiles language encountered. Defaults to None.
device (Any) – Deprecated argument that will be removed in the future.
Note
See set_smiles_transforms and set_encoding_transforms to change the transforms temporarily and reset with reset_initial_transforms. Assignment of class attributes in the parameter list will trigger such a reset.
-
set_max_padding
()[source]¶ Set padding_length that does not truncate any sequence. Requires updated max_token_sequence_length.
- Raises
UnknownMaxLengthError – When max_token_sequence_length is 0 because no SMILES were added to the language.
-