pytoda.proteins.protein_language module¶
Protein language handling.
Reference¶
-
class
ProteinLanguage
(name='protein-language', amino_acid_dict='iupac', tokenizer=<class 'list'>, add_start_and_stop=True)[source]¶ Bases:
object
ProteinLanguage class.
ProteinLanguage handle Protein data defining the vocabulary and utilities to manipulate it.
-
unknown_token
= '<UNK>'¶
-
__init__
(name='protein-language', amino_acid_dict='iupac', tokenizer=<class 'list'>, add_start_and_stop=True)[source]¶ Initialize Protein language.
- Parameters
name (str) – name of the ProteinLanguage.
amino_acid_dict (str) – Tokenization regime for amino acid sequence. Defaults to ‘iupac’, alternative is ‘unirep’ or ‘human-kinase-alignment’.
tokenizer (Tokenizer) – This needs to be a function used to tokenize the amino acid sequences. The default is list which simply splits the sequence character-by-character.
add_start_and_stop (bool) – add <START> and <STOP> in the sequence, of tokens. Defaults to True.
-
static
load
(filepath)[source]¶ Static method to load a ProteinLanguage object.
- Parameters
filepath (str) – path to the file.
- Returns
the loaded Protein language object.
- Return type
-
static
dump
(protein_language, filepath)[source]¶ Static method to save a Protein_language object to disk.
- Parameters
protein_language (ProteinLanguage) – a ProteinLanguage object.
filepath (str) – path where to dump the ProteinLanguage.
-
save
(filepath)[source]¶ Instance method to save/dump Protein language object.
- Parameters
filepath (str) – path where to save the ProteinLanguage.
-
add_file
(filepath, file_type='.smi', index_col=1, chunk_size=100000)[source]¶ Add a set of protein sequences from a file.
- Parameters
filepath (str) – path to the file.
file_type (str) – Type of file, from {‘.smi’, ‘.csv’, ‘.fasta’, ‘.fasta.gz’}. If ‘.csv’ is selected, it is assumed to be tab- separated.
chunk_size (int) – number of rows to read in a chunk. Defaults to 100000. Does not apply for fasta files.
index_col (int) – Data column used for indexing, defaults to 1, does not apply to fasta files.
- Return type
None
-
add_sequence
(sequence)[source]¶ Add a amino acid sequence to the language.
- Parameters
sequence (str) – a sequence of amino acids.
- Return type
None
-
sequence_to_token_indexes_generator
(sequence)[source]¶ Transform tokens into indexes using a generator
- Parameters
sequence (str) – an AAS representations
- Yields
Generator[int] – The generator of token indexes.
- Return type
Iterator
[int
]
-
sequence_to_token_indexes
(sequence)[source]¶ Transform character-level amino acid sequence (AAS) into a sequence of token indexes.
- Parameters
sequence (str) – an AAS representation.
- Returns
indexes representation for the AAS provided.
- Return type
Indexes
-
token_indexes_to_sequence
(token_indexes)[source]¶ Transform a sequence of token indexes into amino acid sequence.
- Parameters
token_indexes (Indexes) – a sequence of token indexes.
- Returns
an amino acid sequence representation.
- Return type
str
-
property
method
¶ A string denoting the language method
- Return type
str
-