pytoda.proteins.transforms module

Amino Acid Sequence transforms.

Summary

Classes:

MutateResidues

Augment a protein sequence by injecting (possibly different) noise to residues inside and outside the relevant part (e.g., active site).

ProteinAugmentFlipSubstrs

Augment a protein sequence by randomly flipping each contiguous subsequence.

ProteinAugmentSwapSubstrs

Augment a protein sequence by randomly swapping neighboring subsequences.

ReplaceByFullProteinSequence

A transform to replace short amino acid sequences with the full protein sequence.

SequenceToTokenIndexes

Transform Sequence to token indexes using Sequence language.

Functions:

extract_active_sites_info

Processes and extracts useful information from an aligned protein sequence.

verify_aligned_info

Verify that the sequence is aligned.

Reference

class SequenceToTokenIndexes(protein_language)[source]

Bases: pytoda.transforms.Transform

Transform Sequence to token indexes using Sequence language.

__init__(protein_language)[source]

Initialize a Sequence to token indexes object.

Parameters

protein_language (ProteinLanguage) – a Protein language.

class ReplaceByFullProteinSequence(alignment_path)[source]

Bases: pytoda.transforms.Transform

A transform to replace short amino acid sequences with the full protein sequence. For example, replace active site sequence of a kinase with its full sequence.

__init__(alignment_path)[source]

Loads alignment info with two “classes” (or types) of residues.

Parameters

alignment_path (str) –

path to .smi or .tsv file which allows to map between shortened and full, aligned sequences. Do not use a header in the file.

NOTE: By convention, residues in upper case are important and will be kept and residues in lower case are less important and are (usually) discarded. NOTE: The first column has to be the full protein sequence (use upper case only for residues to be used). E.g., ggABCggDEFgg NOTE: The second column has to be the condensed sequence (ABCDEF). NOTE: The third column has to be a protein id (can be duplicated).

extract_active_sites_info(aligned_seq)[source]

Processes and extracts useful information from an aligned protein sequence. Expects lower case amino acids to be outside of the relevant area (e.g., active site) and upper case amino acids to be inside it.

Parameters

aligned_seq (str) – A string containing the aligned protein sequence including lower case amino acids and high case amino acids.

Returns

aligned_seq (str): The input sequence. non_active_sites (List[str]): A list of strings, one item for each contiguous

subsequence NOT belonging to active site.

active_sites (List[str]): A list of strings, one item for each contiguous

subsequence belonging to active site.

all_seqs (List[str]): A list of strings, one item for each contiguous

subsequence that either belongs to the active site or not.

Return type

4-Tuple of

verify_aligned_info(sequence)[source]

Verify that the sequence is aligned.

Parameters

sequence (str) – An amino acid sequence.

Raises

Exception – If alignment could not be detected.

Return type

None

class ProteinAugmentFlipSubstrs(p=0.5)[source]

Bases: pytoda.transforms.Transform

Augment a protein sequence by randomly flipping each contiguous subsequence.

__init__(p=0.5)[source]
Parameters

p (float) – Probability that reverting occurs.

class MutateResidues(mutate_upper=0.01, mutate_lower=0.1)[source]

Bases: pytoda.transforms.Transform

Augment a protein sequence by injecting (possibly different) noise to residues inside and outside the relevant part (e.g., active site). NOTE: Noise means single-residue point mutations.

__init__(mutate_upper=0.01, mutate_lower=0.1)[source]
Parameters
  • mutate_lower (float) – probability for mutating lowercase residues

  • mutate_upper (float) – probability for mutating uppercase residues.

class ProteinAugmentSwapSubstrs(p=0.2)[source]

Bases: pytoda.transforms.Transform

Augment a protein sequence by randomly swapping neighboring subsequences.

__init__(p=0.2)[source]
Parameters

p (float) – Probability that any substr switches places with its “neighbour”.