pytoda.datasets.gene_expression_dataset module

GeneExpressionDataset module.

Summary

Classes:

GeneExpressionDataset

Gene expression dataset implementation.

Reference

class GeneExpressionDataset(*gene_expression_filepaths, gene_list=None, standardize=True, min_max=False, processing_parameters={}, impute=0.0, dtype=torch.float32, backend='eager', chunk_size=10000, device=None, **kwargs)[source]

Bases: Generic[torch.utils.data.dataset.T_co]

Gene expression dataset implementation.

__init__(*gene_expression_filepaths, gene_list=None, standardize=True, min_max=False, processing_parameters={}, impute=0.0, dtype=torch.float32, backend='eager', chunk_size=10000, device=None, **kwargs)[source]

Initialize a gene expression dataset.

Parameters
  • gene_expression_filepaths (Files) – paths to .csv files. Currently, the only supported format is .csv, with gene profiles on rows and gene names as columns.

  • gene_list (GeneList) – a list of genes. Defaults to None.

  • standardize (bool) – perform data standardization. Defaults to True.

  • min_max (bool) – perform min-max scaling. Defaults to False.

  • processing_parameters (dict) – processing parameters. Keys can be ‘min’, ‘max’ or ‘mean’, ‘std’ respectively. Values must be readable by np.array, and the required order and subset of features has to match that determined by the dataset setup (see self.gene_list after initialization). Defaults to {}.

  • impute (Optional[float]) – NaN imputation with value if given. Defaults to 0.0.

  • dtype (torch.dtype) – data type. Defaults to torch.float.

  • backend (str) – memory management backend. Defaults to eager, prefer speed over memory consumption.

  • chunk_size (int) – size of the chunks in case of lazy reading, is ignored with ‘eager’ backend. Defaults to 10000.

  • device (torch.device) – DEPRECATED

  • kwargs (dict) – additional parameters for pd.read_csv.