pytoda.datasets.gene_expression_dataset module¶

GeneExpressionDataset module.

Summary¶

Classes:

GeneExpressionDataset

Gene expression dataset implementation.

Reference¶

class GeneExpressionDataset(*gene_expression_filepaths, gene_list=None, standardize=True, min_max=False, processing_parameters={}, impute=0.0, dtype=torch.float32, backend='eager', chunk_size=10000, device=None, **kwargs)[source]¶

Bases: Generic[torch.utils.data.dataset.T_co]

Gene expression dataset implementation.

__init__(*gene_expression_filepaths, gene_list=None, standardize=True, min_max=False, processing_parameters={}, impute=0.0, dtype=torch.float32, backend='eager', chunk_size=10000, device=None, **kwargs)[source]¶

Initialize a gene expression dataset.

Parameters

gene_expression_filepaths (Files) – paths to .csv files. Currently, the only supported format is .csv, with gene profiles on rows and gene names as columns.
gene_list (GeneList) – a list of genes. Defaults to None.
standardize (bool) – perform data standardization. Defaults to True.
min_max (bool) – perform min-max scaling. Defaults to False.
processing_parameters (dict) – processing parameters. Keys can be ‘min’, ‘max’ or ‘mean’, ‘std’ respectively. Values must be readable by np.array, and the required order and subset of features has to match that determined by the dataset setup (see self.gene_list after initialization). Defaults to {}.
impute (Optional[float]) – NaN imputation with value if given. Defaults to 0.0.
dtype (torch.dtype) – data type. Defaults to torch.float.
backend (str) – memory management backend. Defaults to eager, prefer speed over memory consumption.
chunk_size (int) – size of the chunks in case of lazy reading, is ignored with ‘eager’ backend. Defaults to 10000.
device (torch.device) – DEPRECATED
kwargs (dict) – additional parameters for pd.read_csv.