pytoda.datasets.drug_sensitivity_dose_dataset module

Implementation of DrugSensitivityDoseDataset.

Summary

Classes:

DrugSensitivityDoseDataset

Drug sensitivity dose dataset implementation.

Reference

class DrugSensitivityDoseDataset(drug_sensitivity_filepath, smi_filepath, gene_expression_filepath, smiles_language, column_names=['drug', 'cell_line', 'dose', 'viability'], dose_transform=<ufunc 'log10'>, iterate_dataset=False, gene_list=None, gene_expression_standardize=True, gene_expression_min_max=False, gene_expression_processing_parameters={}, gene_expression_dtype=torch.float32, gene_expression_kwargs={}, backend='eager', device=None, **kwargs)[source]

Bases: Generic[torch.utils.data.dataset.T_co]

Drug sensitivity dose dataset implementation.

__init__(drug_sensitivity_filepath, smi_filepath, gene_expression_filepath, smiles_language, column_names=['drug', 'cell_line', 'dose', 'viability'], dose_transform=<ufunc 'log10'>, iterate_dataset=False, gene_list=None, gene_expression_standardize=True, gene_expression_min_max=False, gene_expression_processing_parameters={}, gene_expression_dtype=torch.float32, gene_expression_kwargs={}, backend='eager', device=None, **kwargs)[source]

Initialize a drug sensitivity dose dataset.

Parameters
  • drug_sensitivity_filepath (str) – path to drug sensitivity .csv file. Currently, the only supported format is .csv, with an index and three header columns named as specified in the column_names argument.

  • smi_filepath (str) – path to .smi file.

  • gene_expression_filepath (str) – path to gene expression .csv file. Currently, the only supported format is .csv, with an index and header columns containing gene names.

  • smiles_language (SMILESTokenizer) – a smiles language/tokenizer must be passed. Specifies tokens and all transforms for SMILES conversion.

  • column_names (Tuple[str]) – Names of columns in data files to retrieve molecules, cell-line-data, drug dose and viability (label). Defaults to [‘drug’, ‘cell_line’, ‘dose’, ‘viability’]. All but the 2nd last (dosedose) are passed to drug_sensitivity_dataset.

  • dose_transform (Callable[[float], float]) – A callable to convert the raw concentration into an input for the model. E.g. if raw concentration is uMol, torch.log10 could make sense. Defaults to torch.log10. NOTE: To switch it off, pass lambda x:x.

  • iterate_dataset (bool) – whether to go through all SMILES in the dataset to extend/build vocab, find longest sequence, and checks the passed padding length if applicable. Defaults to False.

  • gene_list (GeneList) – a list of genes.

  • gene_expression_standardize (bool) – perform gene expression data standardization. Defaults to True.

  • gene_expression_min_max (bool) – perform min-max scaling on gene expression data. Defaults to False.

  • gene_expression_processing_parameters (dict) – transformation parameters for gene expression, e.g. for min-max scaling. Defaults to {}.

  • backend (str) – memory management backend. Defaults to eager, prefer speed over memory consumption. Note that at the moment only the gene expression and the smiles datasets implement both backends. The drug sensitivity data are loaded in memory.

  • device (torch.device) – DEPRECATED

  • **kwargs – Additional keyword arguments for parent class (DrugSensitivityDataset).