pytoda.datasets.smiles_dataset module

Datasets for smiles and transformations of smiles.

Summary

Classes:

SMILESDataset

Dataset of SMILES.

SMILESTokenizerDataset

Dataset of token indexes from SMILES.

Reference

class SMILESDataset(*smi_filepaths, backend='eager', name='smiles-dataset', device=None, **kwargs)[source]

Bases: Generic[torch.utils.data.dataset.T_co]

Dataset of SMILES.

__init__(*smi_filepaths, backend='eager', name='smiles-dataset', device=None, **kwargs)[source]

Initialize a SMILES dataset.

Parameters
  • smi_filepaths (Files) – paths to .smi files.

  • name (str) – name of the SMILESDataset.

  • backend (str) – memory management backend. Defaults to eager, prefer speed over memory consumption.

  • device (torch.device) – DEPRECATED

  • kwargs (dict) – additional arguments for dataset constructor.

class SMILESTokenizerDataset(*smi_filepaths, smiles_language=None, canonical=False, augment=False, kekulize=False, all_bonds_explicit=False, all_hs_explicit=False, remove_bonddir=False, remove_chirality=False, selfies=False, sanitize=True, randomize=False, add_start_and_stop=False, padding=True, padding_length=None, vocab_file=None, iterate_dataset=True, backend='eager', device=None, name='smiles-encoder-dataset', **kwargs)[source]

Bases: Generic[torch.utils.data.dataset.T_co]

Dataset of token indexes from SMILES.

__init__(*smi_filepaths, smiles_language=None, canonical=False, augment=False, kekulize=False, all_bonds_explicit=False, all_hs_explicit=False, remove_bonddir=False, remove_chirality=False, selfies=False, sanitize=True, randomize=False, add_start_and_stop=False, padding=True, padding_length=None, vocab_file=None, iterate_dataset=True, backend='eager', device=None, name='smiles-encoder-dataset', **kwargs)[source]

Initialize a dataset providing token indexes from source SMILES.

The datasets transformations on smiles and encodings can be adapted, depending on the smiles_language used (see SMILESTokenizer).

Parameters
  • smi_filepaths (Files) – paths to .smi files.

  • smiles_language (SMILESLanguage) – a smiles language that transforms and encodes SMILES to token indexes. Defaults to None, where a SMILESTokenizer is instantited with the following arguments.

  • canonical (bool) – performs canonicalization of SMILES (one original string for one molecule), if True, then other transformations (augment etc, see below) do not apply

  • augment (bool) – perform SMILES augmentation. Defaults to False.

  • kekulize (bool) – kekulizes SMILES (implicit aromaticity only). Defaults to False.

  • all_bonds_explicit (bool) – Makes all bonds explicit. Defaults to False, only applies if kekulize = True.

  • all_hs_explicit (bool) – Makes all hydrogens explicit. Defaults to False, only applies if kekulize = True.

  • randomize (bool) – perform a true randomization of SMILES tokens. Defaults to False.

  • remove_bonddir (bool) – Remove directional info of bonds. Defaults to False.

  • remove_chirality (bool) – Remove chirality information. Defaults to False.

  • selfies (bool) – Whether selfies is used instead of smiles, defaults to False.

  • sanitize (bool) – RDKit sanitization of the molecule. Defaults to True.

  • add_start_and_stop (bool) – add start and stop token indexes. Defaults to False.

  • padding (bool) – pad sequences to longest in the smiles language. Defaults to True.

  • padding_length (int) – padding to match manually set length, applies only if padding is True. Defaults to None.

  • vocab_file (str) – Optional .json to load vocabulary. Defaults to None.

  • iterate_dataset (bool) – whether to go through all SMILES in the dataset to extend/build vocab, find longest sequence, and checks the passed padding length if applicable. Defaults to True.

  • backend (str) – memory management backend. Defaults to eager, prefer speed over memory consumption.

  • name (str) – name of the SMILESTokenizerDataset.

  • device (torch.device) – DEPRECATED

  • kwargs (dict) – additional arguments for dataset constructor.