pytoda.proteins.transforms module¶

Amino Acid Sequence transforms.

Summary¶

Classes:

`MutateResidues`	Augment a protein sequence by injecting (possibly different) noise to residues inside and outside the relevant part (e.g., active site).
`ProteinAugmentFlipSubstrs`	Augment a protein sequence by randomly flipping each contiguous subsequence.
`ProteinAugmentSwapSubstrs`	Augment a protein sequence by randomly swapping neighboring subsequences.
`ReplaceByFullProteinSequence`	A transform to replace short amino acid sequences with the full protein sequence.
`SequenceToTokenIndexes`	Transform Sequence to token indexes using Sequence language.

Functions:

`extract_active_sites_info`	Processes and extracts useful information from an aligned protein sequence.
`verify_aligned_info`	Verify that the sequence is aligned.

Reference¶

class SequenceToTokenIndexes(protein_language)[source]¶

Bases: pytoda.transforms.Transform

Transform Sequence to token indexes using Sequence language.

__init__(protein_language)[source]¶

Initialize a Sequence to token indexes object.

Parameters: protein_language (ProteinLanguage) – a Protein language.

class ReplaceByFullProteinSequence(alignment_path)[source]¶

Bases: pytoda.transforms.Transform

A transform to replace short amino acid sequences with the full protein sequence. For example, replace active site sequence of a kinase with its full sequence.

__init__(alignment_path)[source]¶

Loads alignment info with two “classes” (or types) of residues.

Parameters

alignment_path (str) –

path to .smi or .tsv file which allows to map between shortened and full, aligned sequences. Do not use a header in the file.

NOTE: By convention, residues in upper case are important and will be kept and residues in lower case are less important and are (usually) discarded. NOTE: The first column has to be the full protein sequence (use upper case only for residues to be used). E.g., ggABCggDEFgg NOTE: The second column has to be the condensed sequence (ABCDEF). NOTE: The third column has to be a protein id (can be duplicated).

extract_active_sites_info(aligned_seq)[source]¶

Processes and extracts useful information from an aligned protein sequence. Expects lower case amino acids to be outside of the relevant area (e.g., active site) and upper case amino acids to be inside it.

Parameters

aligned_seq (str) – A string containing the aligned protein sequence including lower case amino acids and high case amino acids.

Returns

aligned_seq (str): The input sequence. non_active_sites (List[str]): A list of strings, one item for each contiguous

subsequence NOT belonging to active site.

active_sites (List[str]): A list of strings, one item for each contiguous: subsequence belonging to active site.
all_seqs (List[str]): A list of strings, one item for each contiguous: subsequence that either belongs to the active site or not.

Return type

4-Tuple of

verify_aligned_info(sequence)[source]¶

Verify that the sequence is aligned.

Parameters: sequence (str) – An amino acid sequence.
Raises: Exception – If alignment could not be detected.
Return type: None

class ProteinAugmentFlipSubstrs(p=0.5)[source]¶

Bases: pytoda.transforms.Transform

Augment a protein sequence by randomly flipping each contiguous subsequence.

__init__(p=0.5)[source]¶

Parameters: p (float) – Probability that reverting occurs.

class MutateResidues(mutate_upper=0.01, mutate_lower=0.1)[source]¶

Bases: pytoda.transforms.Transform

Augment a protein sequence by injecting (possibly different) noise to residues inside and outside the relevant part (e.g., active site). NOTE: Noise means single-residue point mutations.

__init__(mutate_upper=0.01, mutate_lower=0.1)[source]¶

Parameters

mutate_lower (float) – probability for mutating lowercase residues
mutate_upper (float) – probability for mutating uppercase residues.

class ProteinAugmentSwapSubstrs(p=0.2)[source]¶

Bases: pytoda.transforms.Transform

Augment a protein sequence by randomly swapping neighboring subsequences.

__init__(p=0.2)[source]¶

Parameters: p (float) – Probability that any substr switches places with its “neighbour”.