pytoda.datasets.polymer_dataset module¶
PolymerTokenizerDataset module.
Reference¶
-
class
PolymerTokenizerDataset
(*smi_filepaths, entity_names, annotations_filepath, annotations_column_names=None, smiles_language=None, canonical=False, augment=False, kekulize=False, all_bonds_explicit=False, all_hs_explicit=False, randomize=False, remove_bonddir=False, remove_chirality=False, selfies=False, sanitize=True, padding=True, padding_length=None, iterate_dataset=True, backend='eager', device=None, **kwargs)[source]¶ Bases:
Generic
[torch.utils.data.dataset.T_co
]Dataset of SMILES from multiple entities encoded as token indexes.
Creates a tuple of SMILES datasets, one per given entity (i.e. molecule class, e.g monomer and catalysts). Rows in the annotation df needs to have column names identical to entities, mapping to SMILES in the datasets.
Uses a PolymerTokenizer
-
__init__
(*smi_filepaths, entity_names, annotations_filepath, annotations_column_names=None, smiles_language=None, canonical=False, augment=False, kekulize=False, all_bonds_explicit=False, all_hs_explicit=False, randomize=False, remove_bonddir=False, remove_chirality=False, selfies=False, sanitize=True, padding=True, padding_length=None, iterate_dataset=True, backend='eager', device=None, **kwargs)[source]¶ Initialize a Polymer dataset.
All SMILES dataset parameter can be controlled either separately for each dataset (by iterable of correct length) or globally (bool/int).
- Parameters
smi_filepaths (Files) – paths to .smi files, one per entity
entity_names (Sequence[str]) – List of chemical entities.
annotations_filepath (str) – Path to .csv with the IDs of the chemical entities and their properties. Needs to have one column per entity name.
annotations_column_names (Union[List[int], List[str]]) – indexes (positional or strings) for the annotations. Defaults to None, a.k.a. all the columns, except the entity_names are annotation labels.
smiles_language (PolymerTokenizer) – a polymer language. Defaults to None, in which case a new object is created.
padding (Union[Sequence[bool], bool]) – pad sequences to longest in the smiles language. Defaults to True. Controlled either for each dataset separately (by iterable) or globally (bool).
padding_length (Union[Sequence[int], int]) – manually sets number of applied paddings, applies only if padding is True. Defaults to None. Controlled either for each dataset separately (by iterable) or globally (int).
canonical (Union[Sequence[bool], bool]) – performs canonicalization of SMILES (one original string for one molecule), if True, then other transformations (augment etc, see below) do not apply.
augment (Union[Sequence[bool], bool]) – perform SMILES augmentation. Defaults to False.
kekulize (Union[Sequence[bool], bool]) – kekulizes SMILES (implicit aromaticity only). Defaults to False.
all_bonds_explicit (Union[Sequence[bool], bool]) – Makes all bonds explicit. Defaults to False, only applies if kekulize = True.
all_hs_explicit (Union[Sequence[bool], bool]) – Makes all hydrogens explicit. Defaults to False, only applies if kekulize = True.
randomize (Union[Sequence[bool], bool]) – perform a true randomization of SMILES tokens. Defaults to False.
remove_bonddir (Union[Sequence[bool], bool]) – Remove directional info of bonds. Defaults to False.
remove_chirality (Union[Sequence[bool], bool]) – Remove chirality information. Defaults to False.
selfies (Union[Sequence[bool], bool]) – Whether selfies is used instead of smiles, defaults to False.
sanitize (Union[Sequence[bool], bool]) – sanitize (bool): RDKit sanitization of the molecule. Defaults to True.
iterate_dataset (bool) – whether to go through all SMILES in the dataset to build/extend vocab, find longest sequence, and checks the passed padding length if applicable. Defaults to True.
backend (str) – memory management backend. Defaults to eager, prefer speed over memory consumption.
device (torch.device) – DEPRECATED
kwargs (dict) – additional arguments for dataset constructor.
NOTE: If a parameter that can be given as Union[Sequence[bool], bool] is given as Sequence[bool] of wrong length (!= len(entity_names)), the first list item is used for all datasets.
-