pytoda.proteins.protein_language module

Protein language handling.

Summary

Classes:

ProteinLanguage

ProteinLanguage class.

Reference

class ProteinLanguage(name='protein-language', amino_acid_dict='iupac', tokenizer=<class 'list'>, add_start_and_stop=True)[source]

Bases: object

ProteinLanguage class.

ProteinLanguage handle Protein data defining the vocabulary and utilities to manipulate it.

unknown_token = '<UNK>'
__init__(name='protein-language', amino_acid_dict='iupac', tokenizer=<class 'list'>, add_start_and_stop=True)[source]

Initialize Protein language.

Parameters
  • name (str) – name of the ProteinLanguage.

  • amino_acid_dict (str) – Tokenization regime for amino acid sequence. Defaults to ‘iupac’, alternative is ‘unirep’ or ‘human-kinase-alignment’.

  • tokenizer (Tokenizer) – This needs to be a function used to tokenize the amino acid sequences. The default is list which simply splits the sequence character-by-character.

  • add_start_and_stop (bool) – add <START> and <STOP> in the sequence, of tokens. Defaults to True.

setup_dict()[source]

Setup the dictionary.

Return type

None

static load(filepath)[source]

Static method to load a ProteinLanguage object.

Parameters

filepath (str) – path to the file.

Returns

the loaded Protein language object.

Return type

ProteinLanguage

static dump(protein_language, filepath)[source]

Static method to save a Protein_language object to disk.

Parameters
  • protein_language (ProteinLanguage) – a ProteinLanguage object.

  • filepath (str) – path where to dump the ProteinLanguage.

save(filepath)[source]

Instance method to save/dump Protein language object.

Parameters

filepath (str) – path where to save the ProteinLanguage.

add_file(filepath, file_type='.smi', index_col=1, chunk_size=100000)[source]

Add a set of protein sequences from a file.

Parameters
  • filepath (str) – path to the file.

  • file_type (str) – Type of file, from {‘.smi’, ‘.csv’, ‘.fasta’, ‘.fasta.gz’}. If ‘.csv’ is selected, it is assumed to be tab- separated.

  • chunk_size (int) – number of rows to read in a chunk. Defaults to 100000. Does not apply for fasta files.

  • index_col (int) – Data column used for indexing, defaults to 1, does not apply to fasta files.

Return type

None

add_sequence(sequence)[source]

Add a amino acid sequence to the language.

Parameters

sequence (str) – a sequence of amino acids.

Return type

None

sequence_to_token_indexes_generator(sequence)[source]

Transform tokens into indexes using a generator

Parameters

sequence (str) – an AAS representations

Yields

Generator[int] – The generator of token indexes.

Return type

Iterator[int]

sequence_to_token_indexes(sequence)[source]

Transform character-level amino acid sequence (AAS) into a sequence of token indexes.

Parameters

sequence (str) – an AAS representation.

Returns

indexes representation for the AAS provided.

Return type

Indexes

token_indexes_to_sequence(token_indexes)[source]

Transform a sequence of token indexes into amino acid sequence.

Parameters

token_indexes (Indexes) – a sequence of token indexes.

Returns

an amino acid sequence representation.

Return type

str

property method

A string denoting the language method

Return type

str