biotite.sequence.align#

This subpackage provides functionality for sequence alignments.

The two central classes involved are SubstitutionMatrix and Alignment:

Every function that performs an alignment requires a SubstitutionMatrix that provides similarity scores for each symbol combination of two alphabets (usually both alphabets are equal). The alphabets in the SubstitutionMatrix must match or extend the alphabets of the sequences to be aligned.

An alignment cannot be directly represented as list of Sequence objects, since a gap indicates the absence of any symbol. Instead, the aligning functions return one or more Alignment instances. These objects contain the original sequences and a trace, that describe which positions (indices) in the sequences are aligned. Optionally they also contain the similarity score.

The aligning functions align_optimal() and align_multiple() cover most use cases for pairwise and multiple sequence alignments respectively.

However, Biotite provides also a modular system to build performant heuristic alignment search methods, e.g. for finding homologies in a sequence database or map reads to a genome. The table below summarizes those provided functionalities. The typical stages in alignment search, where those functionalities are used, are arranged from top to bottom.

Entire k-mer set

k-mer subset selection

Minimizers

MinimizerSelector

Mincode

MincodeSelector

k-mer indexing and matching

Perfect hashing

KmerTable

Space-efficient hashing

BucketKmerTable

bucket_number()

Ungapped seed extension

align_local_ungapped()

Gapped alignment

Banded local/semiglobal alignment

align_banded()

Local alignment (X-drop)

align_local_gapped()

Significance evaluation

EValueEstimator

Substitution matrices#

SubstitutionMatrix

A SubstitutionMatrix is the foundation for scoring in sequence alignments.

Aligners#

align_ungapped

Align two sequences without insertion of gaps.

align_optimal

Perform an optimal alignment of two sequences based on a dynamic programming algorithm.

align_local_ungapped

Perform a local alignment extending from given seed position without inserting gaps.

align_local_gapped

Perform a local gapped alignment extending from a given seed position.

align_banded

Perform a local or semi-global alignment within a defined diagonal band.

align_multiple

Perform a multiple sequence alignment using a progressive alignment algorithm.

Alignments#

Alignment

An Alignment object stores information about which symbols of n sequences are aligned to each other and it stores the corresponding alignment score.

get_codes

Get the sequence codes of the sequences in the alignment.

get_symbols

Similar to get_codes(), but contains the decoded symbols instead of codes.

get_sequence_identity

Calculate the sequence identity for an alignment.

get_pairwise_sequence_identity

Calculate the pairwise sequence identity for an alignment.

score

Calculate the similarity score of an alignment.

k-mers#

KmerAlphabet

This type of alphabet uses k-mers as symbols, i.e. all combinations of k symbols from its base alphabet.

KmerTable

This class represents a k-mer index table.

BucketKmerTable

This class represents a k-mer index table.

SimilarityRule

This is the abstract base class for all similarity rules.

ScoreThresholdRule

This similarity rule calculates all k-mers that have a greater or equal similarity score with a given k-mer than a defined threshold score.

bucket_number

Find an appropriate number of buckets for a BucketKmerTable based on the number of elements (i.e. k-mers) that should be stored in the table.

k-mer subset selections#

MinimizerSelector

Selects the minimizers in sequences.

SyncmerSelector

Selects the syncmers in sequences.

CachedSyncmerSelector

Selects the syncmers in sequences.

MincodeSelector

Selects the \(1/\text{compression}\) smallest k-mers from KmerAlphabet.

k-mer permutations#

Permutation

Provides an order for k-mers, usually used by k-mer subset selectors such as MinimizerSelector.

RandomPermutation

Provide a pseudo-randomized order for k-mers.

FrequencyPermutation

Provide an order for k-mers from a given KmerAlphabet, such that less frequent k-mers are smaller than more frequent k-mers.

CIGAR strings#

CigarOp

An enum for the different CIGAR operations.

read_alignment_from_cigar

Create an Alignment from a CIGAR string.

write_alignment_to_cigar

Convert an Alignment into a CIGAR string.

Miscellaneous#

EValueEstimator

This class is used to calculate expect values (E-values) for local pairwise sequence alignments.

find_terminal_gaps

Find the slice indices that would remove terminal gaps from an alignment.

remove_terminal_gaps

Remove terminal gaps from an alignment.