pytoda.datasets.protein_protein_interaction_dataset module

Implementation of ProteinProteinInteractionDataset.

Summary

Classes:

ProteinProteinInteractionDataset

PPI Dataset implementation.

Reference

class ProteinProteinInteractionDataset(sequence_filepaths, entity_names, labels_filepath, sequence_filetypes='infer', annotations_column_names=None, protein_languages=None, paddings=True, padding_lengths=None, add_start_and_stops=False, augment_by_reverts=False, randomizes=False, iterate_datasets=False, device=None)[source]

Bases: Generic[torch.utils.data.dataset.T_co]

PPI Dataset implementation. Designed for two sources of protein sequences and on source of discrete labels. NOTE: Only supports classification (possibly multitask) but no regression tasks.

__init__(sequence_filepaths, entity_names, labels_filepath, sequence_filetypes='infer', annotations_column_names=None, protein_languages=None, paddings=True, padding_lengths=None, add_start_and_stops=False, augment_by_reverts=False, randomizes=False, iterate_datasets=False, device=None)[source]

Initialize a protein protein interactiondataset.

Parameters
  • sequence_filepaths (Union[Files, Sequence[Files]]) – paths to .smi (also as .csv) or .fasta (.gz) file for protein sequences. For each item in the iterable, one protein sequence dataset is created. Sequences can be nested, i.e. each protein sequence dataset can be created from an iterable of filepaths of same type, see sequence_filetypes.

  • entity_names (Sequence[str]) – List of protein sequence entities, e.g. [‘Peptides’, ‘T-Cell-Receptors’]. These names should be column names of the labels_filepaths in order respective to sequence_filepaths.

  • labels_filepath (str) – path to .csv file with classification labels.

  • sequence_filetypes (Union[str, List[str]]) – the filetypes of the sequence files. Can either be a str if all files have identical types or an Sequence if different entities have different types. Different types across the same entity are not supported. Supported formats are {.smi, .csv, .fasta, .fasta.gz}. Default is infer, i.e. filetypes are inferred automatically.

  • annotations_column_names (Union[List[int], List[str]]) – indexes (positional or strings) for the annotations. Defaults to None, a.k.a. all the columns, except the entity_names are annotation labels.

  • protein_languages (Union[ProteinLanguage, List[ProteinLanguage]) – one or multiple ProteinLanguages. If multiple are provided, exactly one should be given for each entity. If only one is provided, the same language will be used for all entities. You can also pass child instances like ProteinFeatureLanguage. Defaults to None, i.e., creating a single protein language with iupac dictionary for all entities.

  • paddings (Union[bool, Sequence[bool]]) – pad sequences to longest in the protein language. Defaults to True.

  • padding_lengths (Union[int, Sequence[int]]) – manually sets number of applied paddings (only if padding = True). Defaults to None.

  • add_start_and_stops (Union[bool, Sequence[bool]]) – add start and stop token indexes. Defaults to False.

  • augment_by_reverts (Union[bool, Sequence[bool]]) – perform a stochastic reversion of the amino acid sequence.

  • randomizes (Union[bool, Sequence[bool]]) – perform a true randomization of the amino acid sequences. Defaults to False.

  • iterate_datasets (Union[bool, Sequence[bool]]) – whether to go through all items in the datasets to detect unknown characters, find longest sequence and checks passed padding length if applicable. Defaults to False.

  • device (torch.device) – DEPRECATED