pytoda.datasets.smiles_dataset module¶

Datasets for smiles and transformations of smiles.

Summary¶

Classes:

`SMILESDataset`	Dataset of SMILES.
`SMILESTokenizerDataset`	Dataset of token indexes from SMILES.

Reference¶

class SMILESDataset(*smi_filepaths, backend='eager', name='smiles-dataset', device=None, **kwargs)[source]¶

Bases: Generic[torch.utils.data.dataset.T_co]

Dataset of SMILES.

__init__(*smi_filepaths, backend='eager', name='smiles-dataset', device=None, **kwargs)[source]¶

Initialize a SMILES dataset.

Parameters

smi_filepaths (Files) – paths to .smi files.
name (str) – name of the SMILESDataset.
backend (str) – memory management backend. Defaults to eager, prefer speed over memory consumption.
device (torch.device) – DEPRECATED
kwargs (dict) – additional arguments for dataset constructor.

class SMILESTokenizerDataset(*smi_filepaths, smiles_language=None, canonical=False, augment=False, kekulize=False, all_bonds_explicit=False, all_hs_explicit=False, remove_bonddir=False, remove_chirality=False, selfies=False, sanitize=True, randomize=False, add_start_and_stop=False, padding=True, padding_length=None, vocab_file=None, iterate_dataset=True, backend='eager', device=None, name='smiles-encoder-dataset', **kwargs)[source]¶

Bases: Generic[torch.utils.data.dataset.T_co]

Dataset of token indexes from SMILES.

__init__(*smi_filepaths, smiles_language=None, canonical=False, augment=False, kekulize=False, all_bonds_explicit=False, all_hs_explicit=False, remove_bonddir=False, remove_chirality=False, selfies=False, sanitize=True, randomize=False, add_start_and_stop=False, padding=True, padding_length=None, vocab_file=None, iterate_dataset=True, backend='eager', device=None, name='smiles-encoder-dataset', **kwargs)[source]¶

Initialize a dataset providing token indexes from source SMILES.

The datasets transformations on smiles and encodings can be adapted, depending on the smiles_language used (see SMILESTokenizer).

Parameters

smi_filepaths (Files) – paths to .smi files.
smiles_language (SMILESLanguage) – a smiles language that transforms and encodes SMILES to token indexes. Defaults to None, where a SMILESTokenizer is instantited with the following arguments.
canonical (bool) – performs canonicalization of SMILES (one original string for one molecule), if True, then other transformations (augment etc, see below) do not apply
augment (bool) – perform SMILES augmentation. Defaults to False.
kekulize (bool) – kekulizes SMILES (implicit aromaticity only). Defaults to False.
all_bonds_explicit (bool) – Makes all bonds explicit. Defaults to False, only applies if kekulize = True.
all_hs_explicit (bool) – Makes all hydrogens explicit. Defaults to False, only applies if kekulize = True.
randomize (bool) – perform a true randomization of SMILES tokens. Defaults to False.
remove_bonddir (bool) – Remove directional info of bonds. Defaults to False.
remove_chirality (bool) – Remove chirality information. Defaults to False.
selfies (bool) – Whether selfies is used instead of smiles, defaults to False.
sanitize (bool) – RDKit sanitization of the molecule. Defaults to True.
add_start_and_stop (bool) – add start and stop token indexes. Defaults to False.
padding (bool) – pad sequences to longest in the smiles language. Defaults to True.
padding_length (int) – padding to match manually set length, applies only if padding is True. Defaults to None.
vocab_file (str) – Optional .json to load vocabulary. Defaults to None.
iterate_dataset (bool) – whether to go through all SMILES in the dataset to extend/build vocab, find longest sequence, and checks the passed padding length if applicable. Defaults to True.
backend (str) – memory management backend. Defaults to eager, prefer speed over memory consumption.
name (str) – name of the SMILESTokenizerDataset.
device (torch.device) – DEPRECATED
kwargs (dict) – additional arguments for dataset constructor.