`KmerTable`#

class biotite.sequence.align.KmerTable[source]#

Bases: object

This class represents a k-mer index table. It maps k-mers (subsequences with length k) to the sequence positions, where the k-mer appears. It is primarily used to find k-mer matches between two sequences. A match is defined as a k-mer that appears in both sequences. Instances of this class are immutable.

There are multiple ways to create a KmerTable:

from_sequences() iterates through all overlapping k-mers in a sequence and stores the sequence position of each kmer in the table.

from_kmers() is similar to from_sequences() but directly accepts k-mers as input instead of sequences.

from_kmer_selection() takes a combination of k-mers and their positions in a sequence, which can be used to apply subset selectors, such as MinimizerSelector.

from_tables() merges the entries from multiple KmerTable objects into a new table.

from_positions() let’s the user provide manual k-mer positions, which can be useful for loading a KmerTable from file.

The standard constructor merely returns an empty table and is reserved for internal use.

Each indexed k-mer position is represented by a tuple of

a unique reference ID that identifies to which sequence a position refers to and

the zero-based sequence position of the first symbol in the k-mer.

The match() method iterates through all overlapping k-mers in another sequence and, for each k-mer, looks up the reference IDs and positions of this k-mer in the table. For each matching position, it adds the k-mer position in this sequence, the matching reference ID and the matching sequence position to the array of matches. Finally these matches are returned to the user. Optionally, a SimilarityRule can be supplied, to find also matches for similar k-mers. This is especially useful for protein sequences to match two k-mers with a high substitution probability.

The positions for a given k-mer code can be obtained via indexing. Iteration over a KmerTable yields the k-mers that have at least one associated position. The k-mer code for a k-mer can be calculated with table.kmer_alphabet.encode() (see KmerAlphabet).

See also

BucketKmerTable

Notes

The design of the KmerTable is inspired by the MMseqs2 software [1].

Memory consumption

For efficient mapping, a KmerTable contains a pointer array, that contains one 64-bit pointer for each possible k-mer. If there is at least one position for a k-mer, the corresponding pointer points to a C-array that contains

The length of the C-array (int64)

The reference ID for each position of this k-mer (uint32)

The sequence position for each position of this k-mer (uint32)

Hence, the memory requirements can be quite large for long k-mers or large alphabets. The required memory space \(S\) in byte is within the bounds of

\[8 n^k + 8L \leq S \leq 16 n^k + 8L,\]

where \(n\) is the number of symbols in the alphabet and \(L\) is the summed length of all sequences added to the table.

Multiprocessing

KmerTable objects can be used in multi-processed setups: Adding a large database of sequences to a table can be sped up by splitting the database into smaller chunks and create a separate table for each chunk in separate processes. Eventually, the tables can be merged to one large table using from_tables().

Since KmerTable supports the pickle protocol, the matching step can also be divided into multiple processes, if multiple sequences need to be matched.

Storage on hard drive

The most time efficient way to read/write a KmerTable is the pickle format. If a custom format is desired, the user needs to extract the reference IDs and position for each k-mer. To restrict this task to all k-mer that have at least one match get_kmers() can be used. Conversely, the reference IDs and positions can be restored via from_positions().

References

Examples

Create a 2-mer index table for some nucleotide sequences:

>>> table = KmerTable.from_sequences(
...     k = 2,
...     sequences = [NucleotideSequence("TTATA"), NucleotideSequence("CTAG")],
...     ref_ids = [0, 1]
... )

Display the contents of the table as (reference ID, sequence position) tuples:

>>> print(table)
AG: (1, 2)
AT: (0, 2)
CT: (1, 0)
TA: (0, 1), (0, 3), (1, 1)
TT: (0, 0)

Find matches of the table with a sequence:

>>> query = NucleotideSequence("TAG")
>>> matches = table.match(query)
>>> for query_pos, table_ref_id, table_pos in matches:
...     print("Query sequence position:", query_pos)
...     print("Table reference ID:  ", table_ref_id)
...     print("Table sequence position:", table_pos)
...     print()
Query sequence position: 0
Table reference ID: 0
Table sequence position: 1

Query sequence position: 0
Table reference ID: 0
Table sequence position: 3

Query sequence position: 0
Table reference ID: 1
Table sequence position: 1

Query sequence position: 1
Table reference ID: 1
Table sequence position: 2

Get all reference IDs and positions for a given k-mer:

>>> kmer_code = table.kmer_alphabet.encode("TA")
>>> print(table[kmer_code])
[[0 1]
 [0 3]
 [1 1]]

Attributes:

kmer_alphabetKmerAlphabet: The internal KmerAlphabet, that is used to encode all overlapping k-mers of an input sequence.
alphabetAlphabet: The base alphabet, from which this KmerTable was created.
kint: The length of the k-mers.

count(kmers=None)#

Count the number of occurences for each k-mer in the table.

Parameters:

kmersndarray, dtype=np.int64, optional: The count is returned for these k-mer codes. By default all k-mers are counted in ascending order, i.e. count_for_kmer = counts[kmer].

Returns:

countsndarray, dtype=np.int64, optional: The counts for each given k-mer.

Examples

>>> table = KmerTable.from_sequences(
...     k = 2,
...     sequences = [NucleotideSequence("TTATA"), NucleotideSequence("CTAG")],
...     ref_ids = [0, 1]
... )
>>> print(table)
AG: (1, 2)
AT: (0, 2)
CT: (1, 0)
TA: (0, 1), (0, 3), (1, 1)
TT: (0, 0)

Count two selected k-mers:

>>> print(table.count(table.kmer_alphabet.encode_multiple(["TA", "AG"])))
[3 1]

Count all k-mers:

>>> counts = table.count()
>>> print(counts)
[0 0 1 1 0 0 0 1 0 0 0 0 3 0 0 1]
>>> for kmer, count in zip(table.kmer_alphabet.get_symbols(), counts):
...     print(kmer, count)
AA 0
AC 0
AG 1
AT 1
CA 0
CC 0
CG 0
CT 1
GA 0
GC 0
GG 0
GT 0
TA 3
TC 0
TG 0
TT 1

static from_kmer_selection(kmer_alphabet, positions, kmers, ref_ids=None)#

Create a KmerTable by storing the positions of a filtered subset of input k-mers.

This can be used to reduce the number of stored k-mers using a k-mer subset selector such as MinimizerSelector.

Parameters:

kmer_alphabetKmerAlphabet: The KmerAlphabet to use for the new table. Should be the same alphabet that was used to calculate the input kmers.
positionssized iterable object of (ndarray, shape=(n,), dtype=uint32), length=m: List where each array contains the sequence positions of the filtered subset of k-mers given in kmers. The list may contain multiple elements for multiple sequences.
kmerssized iterable object of (ndarray, shape=(n,), dtype=np.int64), length=m: List where each array contains the filtered subset of k-mer codes from a sequence. For each array the index of the k-mer code in the array, is stored in the table as sequence position. The list may contain multiple elements for multiple sequences.
ref_idssized iterable object of int, length=m, optional: The reference IDs for the sequences. These are used to identify the corresponding sequence for a k-mer match. By default the IDs are counted from 0 to m.

Returns:

tableKmerTable: The newly created table.

Examples

Reduce the size of sequence data in the table using minimizers:

>>> sequence1 = ProteinSequence("THIS*IS*A*SEQVENCE")
>>> kmer_alph = KmerAlphabet(sequence1.alphabet, k=3)
>>> minimizer = MinimizerSelector(kmer_alph, window=4)
>>> minimizer_pos, minimizers = minimizer.select(sequence1)
>>> kmer_table = KmerTable.from_kmer_selection(
...     kmer_alph, [minimizer_pos], [minimizers]
... )

Use the same MinimizerSelector to select the minimizers from the query sequence and match them against the table. Although the amount of k-mers is reduced, matching is still guanrateed to work, if the two sequences share identity in the given window:

>>> sequence2 = ProteinSequence("ANQTHER*SEQVENCE")
>>> minimizer_pos, minimizers = minimizer.select(sequence2)
>>> matches = kmer_table.match_kmer_selection(minimizer_pos, minimizers)
>>> print(matches)
[[ 9  0 11]
 [12  0 14]]
>>> for query_pos, _, db_pos in matches:
...     print(sequence1)
...     print(" " * (db_pos-1) + "^" * kmer_table.k)
...     print(sequence2)
...     print(" " * (query_pos-1) + "^" * kmer_table.k)
...     print()
THIS*IS*A*SEQVENCE
  ^^^
ANQTHER*SEQVENCE
        ^^^

THIS*IS*A*SEQVENCE
            ^^^
ANQTHER*SEQVENCE
        ^^^

static from_kmers(kmer_alphabet, kmers, ref_ids=None, masks=None)#

Create a KmerTable by storing the positions of all input k-mers.

Parameters:

kmer_alphabetKmerAlphabet: The KmerAlphabet to use for the new table. Should be the same alphabet that was used to calculate the input kmers.
kmerssized iterable object of (ndarray, dtype=np.int64), length=m: List where each array contains the k-mer codes from a sequence. For each array the index of the k-mer code in the array is stored in the table as sequence position.
ref_idssized iterable object of int, length=m, optional: The reference IDs for the sequences. These are used to identify the corresponding sequence for a k-mer match. By default the IDs are counted from 0 to m.
maskssized iterable object of (ndarray, dtype=bool), length=m, optional: A k-mer code at a position, where the corresponding mask is false, is not added to the table. By default, all positions are added.

Returns:

tableKmerTable: The newly created table.

See also

from_kmers: The same functionality based on already created k-mers

Examples

>>> sequences = [NucleotideSequence("TTATA"), NucleotideSequence("CTAG")]
>>> table = KmerTable.from_sequences(
...     2, sequences, ref_ids=[100, 101]
... )
>>> print(table)
AG: (101, 2)
AT: (100, 2)
CT: (101, 0)
TA: (100, 1), (100, 3), (101, 1)
TT: (100, 0)

Give an explicit compatible alphabet:

>>> table = KmerTable.from_sequences(
...     2, sequences, ref_ids=[100, 101],
...     alphabet=NucleotideSequence.ambiguous_alphabet()
... )

Ignore all N in a sequence:

>>> sequence = NucleotideSequence("ACCNTANNG")
>>> table = KmerTable.from_sequences(
...     2, [sequence], ignore_masks=[sequence.symbols == "N"]
... )
>>> print(table)
AC: (0, 0)
CC: (0, 1)
TA: (0, 4)

static from_tables(tables)#

Create a KmerTable by merging the k-mer positions from existing tables.

Parameters:

tablesiterable object of KmerTable: The tables to be merged. All tables must have equal KmerAlphabet objects, i.e. the same k and equal base alphabets.

Returns:

tableKmerTable: The newly created table.

Examples

>>> table1 = KmerTable.from_sequences(
...     2, [NucleotideSequence("TTATA")], ref_ids=[100]
... )
>>> table2 = KmerTable.from_sequences(
...     2, [NucleotideSequence("CTAG")], ref_ids=[101]
... )
>>> merged_table = KmerTable.from_tables([table1, table2])
>>> print(merged_table)
AG: (101, 2)
AT: (100, 2)
CT: (101, 0)
TA: (100, 1), (100, 3), (101, 1)
TT: (100, 0)

get_kmers()#

Get the k-mer codes for all k-mers that have at least one position in the table.

Returns:

kmersndarray, shape=(n,), dtype=np.int64: The k-mer codes.

Examples

>>> sequence = ProteinSequence("BIQTITE")
>>> table = KmerTable.from_sequences(3, [sequence], ref_ids=[100])
>>> print(table)
IQT: (100, 1)
ITE: (100, 4)
QTI: (100, 2)
TIT: (100, 3)
BIQ: (100, 0)
>>> kmer_codes = table.get_kmers()
>>> print(kmer_codes)
[ 4360  4419  7879  9400 11701]
>>> for code in kmer_codes:
...     print(table[code])
[[100   1]]
[[100   4]]
[[100   2]]
[[100   3]]
[[100   0]]

match(sequence, similarity_rule=None, ignore_mask=None)#

Find matches between the k-mers in this table with all overlapping k-mers in the given sequence. k is determined by the table.

Parameters:

sequenceSequence: The sequence to be matched. The table’s base alphabet must extend the alphabet of the sequence.
similarity_ruleSimilarityRule, optional: If this parameter is given, not only exact k-mer matches are considered, but also similar ones according to the given SimilarityRule.
ignore_maskndarray, dtype=bool, optional: Boolean mask of sequence positions to ignore. k-mers that involve these sequence positions are not added to the table. This is used e.g. to skip repeat regions. By default, no sequence position is ignored.

Returns:

matchesndarray, shape=(n,3), dtype=np.uint32

The k-mer matches. Each row contains one match. Each match has the following columns:

The sequence position in the input sequence

The reference ID of the matched sequence in the table

The sequence position of the matched sequence in the table

Notes

The matches are ordered by the first column.

Examples

>>> sequence1 = ProteinSequence("BIQTITE")
>>> table = KmerTable.from_sequences(3, [sequence1], ref_ids=[100])
>>> print(table)
IQT: (100, 1)
ITE: (100, 4)
QTI: (100, 2)
TIT: (100, 3)
BIQ: (100, 0)
>>> sequence2 = ProteinSequence("TITANITE")
>>> print(table.match(sequence2))
[[  0 100   3]
 [  5 100   4]]

match_kmer_selection(positions, kmers)#

Find matches between the k-mers in this table with the given k-mer selection.

It is intended to use this method to find matches in a table that was created using from_kmer_selection().

Parameters:

positionsndarray, shape=(n,), dtype=uint32: Sequence positions of the filtered subset of k-mers given in kmers.
kmersndarray, shape=(n,), dtype=np.int64: Filtered subset of k-mer codes to match against.

Returns:

matchesndarray, shape=(n,3), dtype=np.uint32

The k-mer matches. Each row contains one k-mer match. Each match has the following columns:

The sequence position of the input k-mer, taken from positions

The reference ID of the matched sequence in the table

The sequence position of the matched k-mer in the table

Examples

Reduce the size of sequence data in the table using minimizers:

>>> sequence1 = ProteinSequence("THIS*IS*A*SEQVENCE")
>>> kmer_alph = KmerAlphabet(sequence1.alphabet, k=3)
>>> minimizer = MinimizerSelector(kmer_alph, window=4)
>>> minimizer_pos, minimizers = minimizer.select(sequence1)
>>> kmer_table = KmerTable.from_kmer_selection(
...     kmer_alph, [minimizer_pos], [minimizers]
... )

>>> sequence2 = ProteinSequence("ANQTHER*SEQVENCE")
>>> minimizer_pos, minimizers = minimizer.select(sequence2)
>>> matches = kmer_table.match_kmer_selection(minimizer_pos, minimizers)
>>> print(matches)
[[ 9  0 11]
 [12  0 14]]
>>> for query_pos, _, db_pos in matches:
...     print(sequence1)
...     print(" " * (db_pos-1) + "^" * kmer_table.k)
...     print(sequence2)
...     print(" " * (query_pos-1) + "^" * kmer_table.k)
...     print()
THIS*IS*A*SEQVENCE
  ^^^
ANQTHER*SEQVENCE
        ^^^

THIS*IS*A*SEQVENCE
            ^^^
ANQTHER*SEQVENCE
        ^^^

match_table(table, similarity_rule=None)#

Find matches between the k-mers in this table with the k-mers in another table.

This means that for each k-mer the cartesian product between the positions in both tables is added to the matches.

Parameters:

tableKmerTable: The table to be matched. Both tables must have equal KmerAlphabet objects, i.e. the same k and equal base alphabets.
similarity_ruleSimilarityRule, optional: If this parameter is given, not only exact k-mer matches are considered, but also similar ones according to the given SimilarityRule.

Returns:

matchesndarray, shape=(n,4), dtype=np.uint32

The k-mer matches. Each row contains one match. Each match has the following columns:

The reference ID of the matched sequence in the other table

The sequence position of the matched sequence in the other table

The reference ID of the matched sequence in this table

The sequence position of the matched sequence in this table

Notes

There is no guaranteed order of the reference IDs or sequence positions in the returned matches.

Examples

>>> sequence1 = ProteinSequence("BIQTITE")
>>> table1 = KmerTable.from_sequences(3, [sequence1], ref_ids=[100])
>>> print(table1)
IQT: (100, 1)
ITE: (100, 4)
QTI: (100, 2)
TIT: (100, 3)
BIQ: (100, 0)
>>> sequence2 = ProteinSequence("TITANITE")
>>> table2 = KmerTable.from_sequences(3, [sequence2], ref_ids=[101])
>>> print(table2)
ANI: (101, 3)
ITA: (101, 1)
ITE: (101, 5)
NIT: (101, 4)
TAN: (101, 2)
TIT: (101, 0)
>>> print(table1.match_table(table2))
[[101   5 100   4]
 [101   0 100   3]]

Gallery#

Finding homologous regions in two genomes

Finding homologs of a gene in a genome

KmerTable#

Gallery#

`KmerTable`#