biotite.sequence.align.KmerTable¶
- class biotite.sequence.align.KmerTable[source]¶
Bases:
object
This class represents a k-mer index table. It maps k-mers (subsequences with length k) to the sequence positions, where the k-mer appears. It is primarily used to find k-mer matches between two sequences. A match is defined as a k-mer that appears in both sequences. Instances of this class are immutable.
There are multiple ways to create a
KmerTable
:from_sequences()
iterates through all overlapping k-mers in a sequence and stores the sequence position of each kmer in the table.from_kmers()
is similar tofrom_sequences()
but directly accepts k-mers as input instead of sequences.from_tables()
merges the entries from multipleKmerTable
objects into a new table.from_positions()
let’s the user provide manual k-mer positions, which can be useful for loading aKmerTable
from file.
The standard constructor merely returns an empty table and is reserved for internal use.
Each indexed k-mer position is represented by a tuple of
a unique reference ID that identifies to which sequence a position refers to and
the zero-based sequence position of the first symbol in the k-mer.
The
match()
method iterates through all overlapping k-mers in another sequence and, for each k-mer, looks up the reference IDs and positions of this k-mer in the table. For each matching position, it adds the k-mer position in this sequence, the matching reference ID and the matching sequence position to the array of matches. Finally these matches are returned to the user. Optionally, aSimilarityRule
can be supplied, to find also matches for similar k-mers. This is especially useful for protein sequences to match two k-mers with a high substitution probability.The positions for a given k-mer code can be obtained via indexing. Iteration over a
KmerTable
yields the k-mers that have at least one associated position. The k-mer code for a k-mer can be calculated withtable.kmer_alphabet.encode()
(seeKmerAlphabet
).Notes
The design of the
KmerTable
is inspired by the MMseqs2 software [1].Memory consumption
For efficient mapping, a
KmerTable
contains a pointer array, that contains one 64-bit pointer for each possible k-mer. If there is at least one position for a k-mer, the corresponding pointer points to a C-array that containsThe length of the C-array (int64)
The reference ID for each position of this k-mer (uint32)
The sequence position for each position of this k-mer (uint32)
Hence, the memory requirements can be quite large for long k-mers or large alphabets. The required memory space \(S\) in byte is within the bounds of
\[8 n^k + 8L \leq S \leq 16 n^k + 8L,\]where \(n\) is the number of symbols in the alphabet and \(L\) is the summed length of all sequences added to the table.
Multiprocessing
KmerTable
objects can be used in multi-processed setups: Adding a large database of sequences to a table can be sped up by splitting the database into smaller chunks and create a separate table for each chunk in separate processes. Eventually, the tables can be merged to one large table usingfrom_tables()
.Since
KmerTable
supports the pickle protocol, the matching step can also be divided into multiple processes, if multiple sequences need to be matched.Storage on hard drive
The most time efficient way to read/write a
KmerTable
is the pickle format. If a custom format is desired, the user needs to extract the reference IDs and position for each k-mer. To restrict this task to all k-mer that have at least one matchget_kmers()
can be used. Conversely, the reference IDs and positions can be restored viafrom_positions()
.References
Examples
Create a 2-mer index table for some nucleotide sequences:
>>> table = KmerTable.from_sequences( ... k = 2, ... sequences = [NucleotideSequence("TTATA"), NucleotideSequence("CTAG")], ... ref_ids = [0, 1] ... )
Display the contents of the table as (reference ID, sequence position) tuples:
>>> print(table) TA: (0, 1), (0, 3), (1, 1) AG: (1, 2) AT: (0, 2) CT: (1, 0) TT: (0, 0)
Find matches of the table with a sequence:
>>> query = NucleotideSequence("TAG") >>> matches = table.match(query) >>> for query_pos, table_ref_id, table_pos in matches: ... print("Query sequence position:", query_pos) ... print("Table reference ID: ", table_ref_id) ... print("Table sequence position:", table_pos) ... print() Query sequence position: 0 Table reference ID: 0 Table sequence position: 1 Query sequence position: 0 Table reference ID: 0 Table sequence position: 3 Query sequence position: 0 Table reference ID: 1 Table sequence position: 1 Query sequence position: 1 Table reference ID: 1 Table sequence position: 2
Get all reference IDs and positions for a given k-mer:
>>> kmer_code = table.kmer_alphabet.encode("TA") >>> print(table[kmer_code]) [[0 1] [0 3] [1 1]]
- Attributes:
- kmer_alphabetKmerAlphabet
The internal
KmerAlphabet
, that is used to encode all overlapping k-mers of an input sequence.- alphabetAlphabet
The base alphabet, from which this
KmerTable
was created.- kint
The length of the k-mers.
- static from_kmers(kmer_alphabet, kmers, ref_ids=None, masks=None)¶
Create a
KmerTable
by storing the positions of all input k-mers.- Parameters:
- kmer_alphabetKmerAlphabet
The
KmerAlphabet
to use for the new table. Should be the same alphabet that was used to calculate the input kmers.- kmersiterable object of (ndarray, dtype=np.int64)
Arrays containing the symbol codes for k-mers from a sequence. For each array the index of the k-mer code in the array, is stored in the table.
- ref_idsiterable object of int, optional
The reference IDs for the sequences. These are used to identify the corresponding sequence for a k-mer match. By default the IDs are counted from 0 to n for n input kmers.
- masksiterable object of (ndarray, dtype=bool), optional
A k-mer code at a position, where the corresponding mask is false, is not added to the table. By default, all positions are added.
- Returns:
- tableKmerTable
The newly created table.
Examples
>>> sequences = [ProteinSequence("BIQTITE"), ProteinSequence("NIQBITE")] >>> kmer_alphabet = KmerAlphabet(ProteinSequence.alphabet, 3) >>> kmer_codes = [kmer_alphabet.create_kmers(s.code) for s in sequences] >>> for code in kmer_codes: ... print(code) [7676 9535 4429 9400 2119] [ 7667 11839 4525 9404 2119] >>> table = KmerTable.from_kmers( ... kmer_alphabet, kmer_codes ... ) >>> print(table) ITE: (0, 4), (1, 4) QTI: (0, 2) QBI: (1, 2) NIQ: (1, 0) BIQ: (0, 0) TIT: (0, 3) BIT: (1, 3) IQT: (0, 1) IQB: (1, 1)
- static from_positions(kmer_alphabet kmer_positions)¶
Create a
KmerTable
from k-mer reference IDs and positions. This constructor is especially useful for restoring a table from previously serialized data.- Parameters:
- kmer_alphabetKmerAlphabet
The
KmerAlphabet
to use for the new table- kmer_positionsdict of (int -> ndarray, shape=(n,2), dtype=int)
A dictionary representing the k-mer reference IDs and positions to be stored in the newly created table. It maps a k-mer code to a
ndarray
. To achieve a high performance the data typeuint32
is preferred for the arrays.
- Returns:
- tableKmerTable
The newly created table.
Examples
>>> sequence = ProteinSequence("BIQTITE") >>> table = KmerTable.from_sequences(3, [sequence], ref_ids=[100]) >>> print(table) ITE: (100, 4) QTI: (100, 2) BIQ: (100, 0) TIT: (100, 3) IQT: (100, 1) >>> data = {kmer: table[kmer] for kmer in table} >>> print(data) {2119: array([[100, 4]], dtype=uint32), 4429: array([[100, 2]], dtype=uint32), 7676: array([[100, 0]], dtype=uint32), 9400: array([[100, 3]], dtype=uint32), 9535: array([[100, 1]], dtype=uint32)} >>> restored_table = KmerTable.from_positions(table.kmer_alphabet, data) >>> print(restored_table) ITE: (100, 4) QTI: (100, 2) BIQ: (100, 0) TIT: (100, 3) IQT: (100, 1)
- static from_sequences(k, sequences, ref_ids=None, ignore_masks=None, alphabet=None, spacing=None)¶
Create a
KmerTable
by storing the positions of all overlapping k-mers from the input sequences.- Parameters:
- kint
The length of the k-mers.
- sequencesiterable object of Sequence
The sequences to get the k-mer positions from. These sequences must have equal alphabets, or one of these sequences must have an alphabet that extends the alphabets of all other sequences.
- ref_idsiterable object of int, optional
The reference IDs for the given sequences. These are used to identify the corresponding sequence for a k-mer match. By default the IDs are counted from 0 to n for n input sequences.
- ignore_masksiterable object of (ndarray, dtype=bool), optional
Sequence positions to ignore. k-mers that involve these sequence positions are not added to the table. This is used e.g. to skip repeat regions. If provided, the list must contain one boolean mask (or
None
) for each sequence, and each bolean mask must have the same length as the sequence. By default, no sequence position is ignored.- alphabetAlphabet, optional
The alphabet to use for this table. It must extend the alphabets of the input sequences. By default, an appropriate alphabet is inferred from the input sequences. This option is usually used for compatibility with another sequence/table in the matching step.
- spacingNone or str or list or ndarray, dtype=int, shape=(k,)
If provided, spaced k-mers are used instead of continuous ones. The value contains the informative positions relative to the start of the k-mer, also called the model. The number of informative positions must equal k. Refer to
KmerAlphabet
for more details.
- Returns:
- tableKmerTable
The newly created table.
Examples
>>> sequences = [NucleotideSequence("TTATA"), NucleotideSequence("CTAG")] >>> table = KmerTable.from_sequences( ... 2, sequences, ref_ids=[100, 101] ... ) >>> print(table) TA: (100, 1), (100, 3), (101, 1) AG: (101, 2) AT: (100, 2) CT: (101, 0) TT: (100, 0)
Give an explicit compatible alphabet:
>>> table = KmerTable.from_sequences( ... 2, sequences, ref_ids=[100, 101], ... alphabet=NucleotideSequence.ambiguous_alphabet() ... )
Ignore all
N
in a sequence:>>> sequence = NucleotideSequence("ACCNTANNG") >>> table = KmerTable.from_sequences( ... 2, [sequence], ignore_masks=[sequence.symbols == "N"] ... ) >>> print(table) TA: (0, 4) AC: (0, 0) CC: (0, 1)
- static from_tables(tables)¶
Create a
KmerTable
by merging the k-mer positions from existing tables.- Parameters:
- tablesiterable object of KmerTable
The tables to be merged. All tables must have equal
KmerAlphabet
objects, i.e. the same k and equal base alphabets.
- Returns:
- tableKmerTable
The newly created table.
Examples
>>> table1 = KmerTable.from_sequences( ... 2, [NucleotideSequence("TTATA")], ref_ids=[100] ... ) >>> table2 = KmerTable.from_sequences( ... 2, [NucleotideSequence("CTAG")], ref_ids=[101] ... ) >>> merged_table = KmerTable.from_tables([table1, table2]) >>> print(merged_table) TA: (100, 1), (100, 3), (101, 1) AG: (101, 2) AT: (100, 2) CT: (101, 0) TT: (100, 0)
- get_kmers()¶
Get the k-mer codes for all k-mers that have at least one position in the table.
- Returns:
- kmersndarray, shape=(n,), dtype=np.int64
The k-mer codes.
Examples
>>> sequence = ProteinSequence("BIQTITE") >>> table = KmerTable.from_sequences(3, [sequence], ref_ids=[100]) >>> print(table) ITE: (100, 4) QTI: (100, 2) BIQ: (100, 0) TIT: (100, 3) IQT: (100, 1) >>> kmer_codes = table.get_kmers() >>> print(kmer_codes) [2119 4429 7676 9400 9535] >>> for code in kmer_codes: ... print(table[code]) [[100 4]] [[100 2]] [[100 0]] [[100 3]] [[100 1]]
- match(self, sequence, similarity_rule=None, ignore_mask=None)¶
Find matches between the k-mers in this table with all overlapping k-mers in the given sequence. k is determined by the table.
- Parameters:
- sequenceSequence
The sequence to be matched. The table’s base alphabet must extend the alphabet of the sequence.
- similarity_ruleSimilarityRule, optional
If this parameter is given, not only exact k-mer matches are considered, but also similar ones according to the given
SimilarityRule
.- ignore_maskndarray, dtype=bool, optional
Boolean mask of sequence positions to ignore. k-mers that involve these sequence positions are not added to the table. This is used e.g. to skip repeat regions. By default, no sequence position is ignored.
- Returns:
- matchesndarray, shape=(n,3), dtype=np.uint32
The k-mer matches. Each row contains one match. Each match has the following columns:
The sequence position in the input sequence
The reference ID of the matched sequence in the table
The sequence position of the matched sequence in the table
Notes
The matches are ordered by the first column.
Examples
>>> sequence1 = ProteinSequence("BIQTITE") >>> table = KmerTable.from_sequences(3, [sequence1], ref_ids=[100]) >>> print(table) ITE: (100, 4) QTI: (100, 2) BIQ: (100, 0) TIT: (100, 3) IQT: (100, 1) >>> sequence2 = ProteinSequence("TITANITE") >>> print(table.match(sequence2)) [[ 0 100 3] [ 5 100 4]]
- match_table(table, similarity_rule=None)¶
Find matches between the k-mers in this table with the k-mers in another table.
This means that for each k-mer the cartesian product between the positions in both tables is added to the matches.
- Parameters:
- tableKmerTable
The table to be matched. Both tables must have equal
KmerAlphabet
objects, i.e. the same k and equal base alphabets.- similarity_ruleSimilarityRule, optional
If this parameter is given, not only exact k-mer matches are considered, but also similar ones according to the given
SimilarityRule
.
- Returns:
- matchesndarray, shape=(n,4), dtype=np.uint32
The k-mer matches. Each row contains one match. Each match has the following columns:
The reference ID of the matched sequence in the other table
The sequence position of the matched sequence in the other table
The reference ID of the matched sequence in this table
The sequence position of the matched sequence in this table
Notes
There is no guaranteed order of the reference IDs or sequence positions in the returned matches.
Examples
>>> sequence1 = ProteinSequence("BIQTITE") >>> table1 = KmerTable.from_sequences(3, [sequence1], ref_ids=[100]) >>> print(table1) ITE: (100, 4) QTI: (100, 2) BIQ: (100, 0) TIT: (100, 3) IQT: (100, 1) >>> sequence2 = ProteinSequence("TITANITE") >>> table2 = KmerTable.from_sequences(3, [sequence2], ref_ids=[101]) >>> print(table2) ITA: (101, 1) ITE: (101, 5) ANI: (101, 3) TAN: (101, 2) NIT: (101, 4) TIT: (101, 0) >>> print(table1.match_table(table2)) [[101 5 100 4] [101 0 100 3]]
Gallery¶

Genome comparison between chloroplasts and cyanobacteria