pytoda.proteins.protein_language module¶

Protein language handling.

Summary¶

Classes:

ProteinLanguage

ProteinLanguage class.

Reference¶

class ProteinLanguage(name='protein-language', amino_acid_dict='iupac', tokenizer=<class 'list'>, add_start_and_stop=True)[source]¶

Bases: object

ProteinLanguage class.

ProteinLanguage handle Protein data defining the vocabulary and utilities to manipulate it.

unknown_token = '<UNK>'¶

__init__(name='protein-language', amino_acid_dict='iupac', tokenizer=<class 'list'>, add_start_and_stop=True)[source]¶

Initialize Protein language.

Parameters

name (str) – name of the ProteinLanguage.
amino_acid_dict (str) – Tokenization regime for amino acid sequence. Defaults to ‘iupac’, alternative is ‘unirep’ or ‘human-kinase-alignment’.
tokenizer (Tokenizer) – This needs to be a function used to tokenize the amino acid sequences. The default is list which simply splits the sequence character-by-character.
add_start_and_stop (bool) – add <START> and <STOP> in the sequence, of tokens. Defaults to True.

setup_dict()[source]¶

Setup the dictionary.

Return type: None

static load(filepath)[source]¶

Static method to load a ProteinLanguage object.

Parameters: filepath (str) – path to the file.
Returns: the loaded Protein language object.
Return type: ProteinLanguage

static dump(protein_language, filepath)[source]¶

Static method to save a Protein_language object to disk.

Parameters

protein_language (ProteinLanguage) – a ProteinLanguage object.
filepath (str) – path where to dump the ProteinLanguage.

save(filepath)[source]¶

Instance method to save/dump Protein language object.

Parameters: filepath (str) – path where to save the ProteinLanguage.

add_file(filepath, file_type='.smi', index_col=1, chunk_size=100000)[source]¶

Add a set of protein sequences from a file.

Parameters

filepath (str) – path to the file.
file_type (str) – Type of file, from {‘.smi’, ‘.csv’, ‘.fasta’, ‘.fasta.gz’}. If ‘.csv’ is selected, it is assumed to be tab- separated.
chunk_size (int) – number of rows to read in a chunk. Defaults to 100000. Does not apply for fasta files.
index_col (int) – Data column used for indexing, defaults to 1, does not apply to fasta files.

Return type

None

add_sequence(sequence)[source]¶

Add a amino acid sequence to the language.

Parameters: sequence (str) – a sequence of amino acids.
Return type: None

sequence_to_token_indexes_generator(sequence)[source]¶

Transform tokens into indexes using a generator

Parameters: sequence (str) – an AAS representations
Yields: Generator[int] – The generator of token indexes.
Return type: Iterator[int]

sequence_to_token_indexes(sequence)[source]¶

Transform character-level amino acid sequence (AAS) into a sequence of token indexes.

Parameters: sequence (str) – an AAS representation.
Returns: indexes representation for the AAS provided.
Return type: Indexes

token_indexes_to_sequence(token_indexes)[source]¶

Transform a sequence of token indexes into amino acid sequence.

Parameters: token_indexes (Indexes) – a sequence of token indexes.
Returns: an amino acid sequence representation.
Return type: str

property method¶

A string denoting the language method

Return type: str