pytoda.datasets.protein_sequence_dataset module

Implementation of ProteinSequenceDataset.

Summary

Classes:

ProteinSequenceDataset

Protein Sequence dataset using a Language to transform sequences.

Functions:

protein_sequence_dataset

Return a dataset of protein sequences.

Reference

protein_sequence_dataset(*filepaths, filetype, backend, **kwargs)[source]

Return a dataset of protein sequences.

Parameters
  • filepaths (Iterable[str]) – Paths to the files containing the protein sequences.

  • filetype (str) – The filetype of the protein sequences.

  • backend (str) – The backend to use for the dataset.

  • kwargs – Keyword arguments for the dataset.

Return type

KeyDataset

class ProteinSequenceDataset(*filepaths, filetype='.smi', protein_language=None, amino_acid_dict='iupac', padding=True, padding_length=None, add_start_and_stop=False, augment_by_revert=False, sequence_augment={}, protein_keep_only_uppercase=False, randomize=False, backend='eager', iterate_dataset=False, name='protein-sequences', device=None, **kwargs)[source]

Bases: Generic[torch.utils.data.dataset.T_co]

Protein Sequence dataset using a Language to transform sequences.

__init__(*filepaths, filetype='.smi', protein_language=None, amino_acid_dict='iupac', padding=True, padding_length=None, add_start_and_stop=False, augment_by_revert=False, sequence_augment={}, protein_keep_only_uppercase=False, randomize=False, backend='eager', iterate_dataset=False, name='protein-sequences', device=None, **kwargs)[source]

Initialize a Protein Sequence dataset.

Parameters
  • *filepaths (Files) – paths to .smi, .csv/.fasta/.fasta.gz file with the sequences.

  • filetype (str) – From {.smi, .csv, .fasta, .fasta.gz}.

  • protein_language (ProteinLanguage) – a ProteinLanguage (or child) instance, e.g. ProteinFeatureLanguage. Defaults to None, creating a default instance.

  • amino_acid_dict (str) – Type of dictionary used for amino acid sequences. Defaults to ‘iupac’, alternative is ‘unirep’ or ‘human-kinase-alignment’

  • padding (bool) – pad sequences to longest in the protein language. Defaults to True.

  • padding_length (int) – manually sets number of applied paddings, applies only if padding is True. Defaults to None.

  • add_start_and_stop (bool) – add start and stop token indexes. Defaults to False.

  • augment_by_revert (bool) – perform Protein augmentation by reverting Sequences. Defaults to False.

  • randomize (bool) – perform a true randomization of Protein tokens. Defaults to False.

  • sequence_augment (Dict) –

    a dictionary to specify additional sequence augmentation. Defaults to {}. Items can be:

    alignment_path: A path (str) to a .smi (or .tsv) file

    with alignment information that specifies which residues of the sequence are to be used. E.g., to extract active site sequences from full proteins. Do not use a header in the file. 1. column has to be the full protein sequence (use upper

    case only for residues to be used). E.g., ggABCggDEFgg

    1. column has to be the condensed sequence (ABCDEF).

    3. column has to be the protein identifier. NOTE: Such a file is necessary to apply all augmentation types specified in this dictionary (sequence_augment). NOTE: Unless specified, this defaults to kinase_activesite_alignment.smi, a file in pytoda.proteins.metadata that can ONLY be used to extract active site sequences of human kinases (based on the active site definition in Sheridan et al. (2009, JCIM) and Martin & Mukherjee (2012, JCIM).

    discard_lowercase: A (bool) specifying whether all lowercase

    characters (residues) in the sequence should be discarded. NOTE: This defaults to True.

    flip_substrings: A probability (float) to flip each contiguous

    upper-case substring (e.g., an active site substring that lies contiguously in the original sequence). Defaults to 0.0, i.e., no flipping. E.g., ABCDEF could become CBADEF or CBAFED or ABCFED if the original sequence is ggABCggDEFgg.

    swap_substrings: A probability (float) to swap neighboring

    substrings. Defaults to 0.0, i.e., no swapping. E.g., ABCDEF could become DEFABC if the original sequence is ggABCggDEFgg.

    noise: A 2-Tuple of (float, float) that specifies the probability

    for a random, single-residue mutation inside and outside the relevent part. Defaults to (0.0, 0.0), i.e., no noise. E.g., with (0.0, 0.5), ggABCggDEFgg could become hgABCgbDEFgg.

  • iterate_dataset (bool) – whether to go through all items in the dataset to detect unknown characters, find longest sequence and checks passed padding length if applicable. Defaults to False.

  • backend (str) – memory management backend. Defaults to eager, prefer speed over memory consumption.

  • name (str) – name of the ProteinSequenceDataset.

  • device (torch.device) – DEPRECATED

  • kwargs (dict) – additional arguments for dataset constructor.

setup_sequence_augmentation(sequence_augment)[source]

Setup the sequence augmentation.

Parameters

sequence_augment (Dict) – A dictionary to specify the sequence augmentation strategies. For details see the constructor docs.