pytoda.datasets.drug_affinity_dataset module

Implementation of DrugAffinityDataset.

Summary

Classes:

DrugAffinityDataset

Drug affinity dataset implementation.

Reference

class DrugAffinityDataset(drug_affinity_filepath, smi_filepath, protein_filepath, column_names=['ligand_name', 'sequence_id', 'label'], drug_affinity_dtype=torch.int32, smiles_language=None, smiles_padding=True, smiles_padding_length=None, smiles_add_start_and_stop=False, smiles_augment=False, smiles_canonical=False, smiles_kekulize=False, smiles_all_bonds_explicit=False, smiles_all_hs_explicit=False, smiles_randomize=False, smiles_remove_bonddir=False, smiles_remove_chirality=False, smiles_vocab_file=None, smiles_selfies=False, smiles_sanitize=True, protein_language=None, protein_amino_acid_dict='iupac', protein_padding=True, protein_padding_length=None, protein_add_start_and_stop=False, protein_augment_by_revert=False, protein_sequence_augment={}, protein_randomize=False, iterate_dataset=True, backend='eager', device=None)[source]

Bases: Generic[torch.utils.data.dataset.T_co]

Drug affinity dataset implementation.

__init__(drug_affinity_filepath, smi_filepath, protein_filepath, column_names=['ligand_name', 'sequence_id', 'label'], drug_affinity_dtype=torch.int32, smiles_language=None, smiles_padding=True, smiles_padding_length=None, smiles_add_start_and_stop=False, smiles_augment=False, smiles_canonical=False, smiles_kekulize=False, smiles_all_bonds_explicit=False, smiles_all_hs_explicit=False, smiles_randomize=False, smiles_remove_bonddir=False, smiles_remove_chirality=False, smiles_vocab_file=None, smiles_selfies=False, smiles_sanitize=True, protein_language=None, protein_amino_acid_dict='iupac', protein_padding=True, protein_padding_length=None, protein_add_start_and_stop=False, protein_augment_by_revert=False, protein_sequence_augment={}, protein_randomize=False, iterate_dataset=True, backend='eager', device=None)[source]

Initialize a drug affinity dataset.

Parameters
  • drug_affinity_filepath (str) – path to drug affinity .csv file. Currently, the only supported format is .csv, with an index and three header columns named as specified in column_names.

  • smi_filepath (str) – path to .smi file.

  • protein_filepath (str) – path to .smi or .fasta file.

  • column_names (Tuple[str]) – Names of columns in data files to retrieve labels, ligands and protein name respectively. Defaults to [‘ligand_name’, ‘sequence_id’, ‘label’].

  • drug_affinity_dtype (torch.dtype) – drug affinity data type. Defaults to torch.int.

  • smiles_language (SMILESLanguage) – a smiles language. Defaults to None.

  • smiles_vocab_file (str) – Optional .json to load vocabulary. Tries to load metadata if iterate_dataset is False. Defaults to None.

  • smiles_padding (bool) – pad sequences to longest in the smiles language. Defaults to True.

  • smiles_padding_length (int) – manually sets number of applied paddings, applies only if padding is True. Defaults to None.

  • smiles_add_start_and_stop (bool) – add start and stop token indexes. Defaults to False.

  • smiles_canonical (bool) – performs canonicalization of SMILES (one original string for one molecule), if True, then other transformations (augment etc, see below) do not apply

  • smiles_augment (bool) – perform SMILES augmentation. Defaults to False.

  • smiles_kekulize (bool) – kekulizes SMILES (implicit aromaticity only). Defaults to False.

  • smiles_all_bonds_explicit (bool) – Makes all bonds explicit. Defaults to False, only applies if kekulize = True.

  • smiles_all_hs_explicit (bool) – Makes all hydrogens explicit. Defaults to False, only applies if kekulize = True.

  • smiles_randomize (bool) – perform a true randomization of SMILES tokens. Defaults to False.

  • smiles_remove_bonddir (bool) – Remove directional info of bonds. Defaults to False.

  • smiles_remove_chirality (bool) – Remove chirality information. Defaults to False.

  • smiles_selfies (bool) – Whether selfies is used instead of smiles. Default to False.

  • smiles_sanitize (bool) – RDKit sanitization of the molecule. Defaults to True.

  • protein_language (ProteinLanguage) – protein language. Defaults to None, a.k.a., build it from scratch.

  • protein_amino_acid_dict (str) – Amino acid dictionary. Defaults to ‘iupac’.

  • protein_padding (bool) – pad sequences to the longest in the protein language. Defaults to True.

  • protein_padding_length (int) – manually set the padding. Defaults to None.

  • protein_add_start_and_stop (bool) – add start and stop token indexes. Defaults to False.

  • protein_augment_by_revert (bool) – augment data by reverting the sequence. Defaults to False.

  • protein_sequence_augment (Dict) – a dictionary to specify additional sequence augmentation. Defaults to {}. NOTE: For details please see ProteinSequenceDataset.

  • protein_randomize (bool) – perform a randomization of the protein sequence tokens. Defaults to False.

  • protein_vocab_file (str) – Optional .json to load vocabulary. Tries to load metadata if iterate_dataset is False. Defaults to None.

  • iterate_dataset (bool) – whether to go through all items in the dataset to extend/build vocab, find longest sequence, and checks the passed padding length if applicable. Defaults to True.

  • backend (str) – memory management backend. Defaults to eager, prefer speed over memory consumption. Note that at the moment only the smiles dataset implement both backends. The drug affinity data and the protein dataset are loaded in memory.

  • device (torch.device) – DEPRECATED