pytoda.datasets.drug_sensitivity_dataset module¶
Implementation of DrugSensitivityDataset.
Reference¶
-
class
DrugSensitivityDataset
(drug_sensitivity_filepath, smi_filepath, gene_expression_filepath, column_names=['drug', 'cell_line', 'IC50'], drug_sensitivity_dtype=torch.float32, drug_sensitivity_min_max=True, drug_sensitivity_processing_parameters={}, smiles_language=None, padding=True, padding_length=None, add_start_and_stop=False, augment=False, canonical=False, kekulize=False, all_bonds_explicit=False, all_hs_explicit=False, randomize=False, remove_bonddir=False, remove_chirality=False, selfies=False, sanitize=True, vocab_file=None, iterate_dataset=True, gene_list=None, gene_expression_standardize=True, gene_expression_min_max=False, gene_expression_processing_parameters={}, gene_expression_dtype=torch.float32, gene_expression_kwargs={}, backend='eager', device=None)[source]¶ Bases:
Generic
[torch.utils.data.dataset.T_co
]Drug sensitivity dataset implementation.
-
__init__
(drug_sensitivity_filepath, smi_filepath, gene_expression_filepath, column_names=['drug', 'cell_line', 'IC50'], drug_sensitivity_dtype=torch.float32, drug_sensitivity_min_max=True, drug_sensitivity_processing_parameters={}, smiles_language=None, padding=True, padding_length=None, add_start_and_stop=False, augment=False, canonical=False, kekulize=False, all_bonds_explicit=False, all_hs_explicit=False, randomize=False, remove_bonddir=False, remove_chirality=False, selfies=False, sanitize=True, vocab_file=None, iterate_dataset=True, gene_list=None, gene_expression_standardize=True, gene_expression_min_max=False, gene_expression_processing_parameters={}, gene_expression_dtype=torch.float32, gene_expression_kwargs={}, backend='eager', device=None)[source]¶ Initialize a drug sensitivity dataset.
- Parameters
drug_sensitivity_filepath (str) – path to drug sensitivity .csv file. Currently, the only supported format is .csv, with an index and three header columns named as specified in the column_names argument.
smi_filepath (str) – path to .smi file.
gene_expression_filepath (str) – path to gene expression .csv file. Currently, the only supported format is .csv, with an index and header columns containing gene names.
column_names (Tuple[str]) – Names of columns in data files to retrieve labels, ligands and protein name respectively. Defaults to [‘drug’, ‘cell_line’, ‘IC50’].
drug_sensitivity_dtype (torch.dtype) – drug sensitivity data type. Defaults to torch.float.
drug_sensitivity_min_max (bool) – min-max scale drug sensitivity data. Defaults to True.
drug_sensitivity_processing_parameters (dict) – transformation parameters for drug sensitivity data, e.g. for min-max scaling. Defaults to {}.
smiles_language (SMILESLanguage) – a smiles language. Defaults to None.
padding (bool) – pad sequences to longest in the smiles language. Defaults to True.
padding_length (int) – manually sets number of applied paddings, applies only if padding is True. Defaults to None.
add_start_and_stop (bool) – add start and stop token indexes. Defaults to False.
canonical (bool) – performs canonicalization of SMILES (one original string for one molecule), if True, then other transformations (augment etc, see below) do not apply
augment (bool) – perform SMILES augmentation. Defaults to False.
kekulize (bool) – kekulizes SMILES (implicit aromaticity only). Defaults to False.
all_bonds_explicit (bool) – Makes all bonds explicit. Defaults to False, only applies if kekulize = True.
all_hs_explicit (bool) – Makes all hydrogens explicit. Defaults to False, only applies if kekulize = True.
randomize (bool) – perform a true randomization of SMILES tokens. Defaults to False.
remove_bonddir (bool) – Remove directional info of bonds. Defaults to False.
remove_chirality (bool) – Remove chirality information. Defaults to False.
selfies (bool) – Whether selfies is used instead of smiles, defaults to False.
sanitize (bool) – RDKit sanitization of the molecule. Defaults to True.
vocab_file (str) – Optional .json to load vocabulary. Tries to load metadata if iterate_dataset is False. Defaults to None.
iterate_dataset (bool) – whether to go through all SMILES in the dataset to extend/build vocab, find longest sequence, and checks the passed padding length if applicable. Defaults to True.
gene_list (GeneList) – a list of genes.
gene_expression_standardize (bool) – perform gene expression data standardization. Defaults to True.
gene_expression_min_max (bool) – perform min-max scaling on gene expression data. Defaults to False.
gene_expression_processing_parameters (dict) – transformation parameters for gene expression, e.g. for min-max scaling. Defaults to {}.
gene_expression_dtype (torch.dtype) – gene expression data type. Defaults to torch.float.
gene_expression_kwargs (dict) – additional parameters for GeneExpressionDataset.
backend (str) – memory management backend. Defaults to eager, prefer speed over memory consumption. Note that at the moment only the gene expression and the smiles datasets implement both backends. The drug sensitivity data are loaded in memory.
device (torch.device) – DEPRECATED
-