pytoda.proteins.transforms module¶
Amino Acid Sequence transforms.
Summary¶
Classes:
Augment a protein sequence by injecting (possibly different) noise to residues inside and outside the relevant part (e.g., active site). |
|
Augment a protein sequence by randomly flipping each contiguous subsequence. |
|
Augment a protein sequence by randomly swapping neighboring subsequences. |
|
A transform to replace short amino acid sequences with the full protein sequence. |
|
Transform Sequence to token indexes using Sequence language. |
Functions:
Processes and extracts useful information from an aligned protein sequence. |
|
Verify that the sequence is aligned. |
Reference¶
-
class
SequenceToTokenIndexes
(protein_language)[source]¶ Bases:
pytoda.transforms.Transform
Transform Sequence to token indexes using Sequence language.
-
__init__
(protein_language)[source]¶ Initialize a Sequence to token indexes object.
- Parameters
protein_language (ProteinLanguage) – a Protein language.
-
-
class
ReplaceByFullProteinSequence
(alignment_path)[source]¶ Bases:
pytoda.transforms.Transform
A transform to replace short amino acid sequences with the full protein sequence. For example, replace active site sequence of a kinase with its full sequence.
-
__init__
(alignment_path)[source]¶ Loads alignment info with two “classes” (or types) of residues.
- Parameters
alignment_path (str) –
path to .smi or .tsv file which allows to map between shortened and full, aligned sequences. Do not use a header in the file.
NOTE: By convention, residues in upper case are important and will be kept and residues in lower case are less important and are (usually) discarded. NOTE: The first column has to be the full protein sequence (use upper case only for residues to be used). E.g., ggABCggDEFgg NOTE: The second column has to be the condensed sequence (ABCDEF). NOTE: The third column has to be a protein id (can be duplicated).
-
-
extract_active_sites_info
(aligned_seq)[source]¶ Processes and extracts useful information from an aligned protein sequence. Expects lower case amino acids to be outside of the relevant area (e.g., active site) and upper case amino acids to be inside it.
- Parameters
aligned_seq (
str
) – A string containing the aligned protein sequence including lower case amino acids and high case amino acids.- Returns
aligned_seq (str): The input sequence. non_active_sites (List[str]): A list of strings, one item for each contiguous
subsequence NOT belonging to active site.
- active_sites (List[str]): A list of strings, one item for each contiguous
subsequence belonging to active site.
- all_seqs (List[str]): A list of strings, one item for each contiguous
subsequence that either belongs to the active site or not.
- Return type
4-Tuple of
-
verify_aligned_info
(sequence)[source]¶ Verify that the sequence is aligned.
- Parameters
sequence (
str
) – An amino acid sequence.- Raises
Exception – If alignment could not be detected.
- Return type
None
-
class
ProteinAugmentFlipSubstrs
(p=0.5)[source]¶ Bases:
pytoda.transforms.Transform
Augment a protein sequence by randomly flipping each contiguous subsequence.
-
class
MutateResidues
(mutate_upper=0.01, mutate_lower=0.1)[source]¶ Bases:
pytoda.transforms.Transform
Augment a protein sequence by injecting (possibly different) noise to residues inside and outside the relevant part (e.g., active site). NOTE: Noise means single-residue point mutations.
-
class
ProteinAugmentSwapSubstrs
(p=0.2)[source]¶ Bases:
pytoda.transforms.Transform
Augment a protein sequence by randomly swapping neighboring subsequences.