biotite.sequence.align.BucketKmerTable

class biotite.sequence.align.BucketKmerTable[source]

Bases: object

This class represents a k-mer index table. In contrast to KmerTable it does store each unique k-mer in a separate C-array, but limits the number of C-arrays instead to a number of buckets. Hence, different k-mer may be stored in the same bucket, like in a hash table. This approach makes k-mer indices with large k-mer alphabets fit into memory.

Otherwise, the API for creating a BucketKmerTable and matching to it is analogous to KmerTable.

See also

KmerTable

Notes

Memory consumption

For efficient mapping, a BucketKmerTable contains a pointer array, that contains one 64-bit pointer for each bucket. If there is at least one position for a bucket, the corresponding pointer points to a C-array that contains

  1. The length of the C-array (int64)

  2. The k-mers (int64)

  3. The reference ID for each k-mer (uint32)

  4. The sequence position for each k-mer (uint32)

As buckets are used, the memory requirements are limited to the number of buckets instead of scaling with the KmerAlphabet size. If each bucket is used, the required memory space \(S\) in byte is

\[S = 16B + 16L\]

where \(B\) is the number of buckets and \(L\) is the summed length of all sequences added to the table.

Buckets

The ratio \(L/B\) is called load_factor. By default BucketKmerTable uses a load factor of approximately 0.8 to ensure efficient k-mer matching. The number fo buckets can be adjusted by setting the n_buckets parameters on BucketKmerTable creation. It is recommended to use bucket_number() to compute an appropriate number of buckets.

Multiprocessing

BucketKmerTable objects can be used in multi-processed setups: Adding a large database of sequences to a table can be sped up by splitting the database into smaller chunks and create a separate table for each chunk in separate processes. Eventually, the tables can be merged to one large table using from_tables().

Since BucketKmerTable supports the pickle protocol, the matching step can also be divided into multiple processes, if multiple sequences need to be matched.

Storage on hard drive

The most time efficient way to read/write a BucketKmerTable is the pickle format.

Indexing and iteration

Due to the higher complexity in the k-mer lookup compared to KmerTable, this class is still indexable but not iterable.

Examples

Create a 2-mer index table for some nucleotide sequences:

>>> table = BucketKmerTable.from_sequences(
...     k = 2,
...     sequences = [NucleotideSequence("TTATA"), NucleotideSequence("CTAG")],
...     ref_ids = [0, 1]
... )

Display the contents of the table as (reference ID, sequence position) tuples:

>>> print(table)
AG: (1, 2)
AT: (0, 2)
CT: (1, 0)
TA: (0, 1), (0, 3), (1, 1)
TT: (0, 0)

Find matches of the table with a sequence:

>>> query = NucleotideSequence("TAG")
>>> matches = table.match(query)
>>> for query_pos, table_ref_id, table_pos in matches:
...     print("Query sequence position:", query_pos)
...     print("Table reference ID:  ", table_ref_id)
...     print("Table sequence position:", table_pos)
...     print()
Query sequence position: 0
Table reference ID: 0
Table sequence position: 1

Query sequence position: 0
Table reference ID: 0
Table sequence position: 3

Query sequence position: 0
Table reference ID: 1
Table sequence position: 1

Query sequence position: 1
Table reference ID: 1
Table sequence position: 2

Get all reference IDs and positions for a given k-mer:

>>> kmer_code = table.kmer_alphabet.encode("TA")
>>> print(table[kmer_code])
[[0 1]
 [0 3]
 [1 1]]
Attributes
kmer_alphabetKmerAlphabet

The internal KmerAlphabet, that is used to encode all overlapping k-mers of an input sequence.

alphabetAlphabet

The base alphabet, from which this BucketKmerTable was created.

kint

The length of the k-mers.

n_bucketsint

The number of buckets, the k-mers are divided into.

count(kmers=None)

Count the number of occurences for each k-mer in the table.

Parameters
kmersndarray, dtype=np.int64, optional

The count is returned for these k-mer codes. By default all k-mers are counted in ascending order, i.e. count_for_kmer = counts[kmer].

Returns
countsndarray, dtype=np.int64, optional

The counts for each given k-mer.

Notes

As each bucket need to be inspected for the actual k-mer entries, this method requires far more computation time than its KmerTable equivalent.

Examples

>>> table = BucketKmerTable.from_sequences(
...     k = 2,
...     sequences = [NucleotideSequence("TTATA"), NucleotideSequence("CTAG")],
...     ref_ids = [0, 1]
... )
>>> print(table)
AG: (1, 2)
AT: (0, 2)
CT: (1, 0)
TA: (0, 1), (0, 3), (1, 1)
TT: (0, 0)

Count two selected k-mers:

>>> print(table.count(table.kmer_alphabet.encode_multiple(["TA", "AG"])))
[3 1]
static from_kmer_selection(kmer_alphabet, positions, kmers, ref_ids=None, n_buckets=None)

Create a BucketKmerTable by storing the positions of a filtered subset of input k-mers.

This can be used to reduce the number of stored k-mers using a k-mer subset selector such as MinimizerSelector.

Parameters
kmer_alphabetKmerAlphabet

The KmerAlphabet to use for the new table. Should be the same alphabet that was used to calculate the input kmers.

positionssized iterable object of (ndarray, shape=(n,), dtype=uint32), length=m

List where each array contains the sequence positions of the filtered subset of k-mers given in kmers. The list may contain multiple elements for multiple sequences.

kmerssized iterable object of (ndarray, shape=(n,), dtype=np.int64), length=m

List where each array contains the filtered subset of k-mer codes from a sequence. For each array the index of the k-mer code in the array, is stored in the table as sequence position. The list may contain multiple elements for multiple sequences.

ref_idssized iterable object of int, length=m, optional

The reference IDs for the sequences. These are used to identify the corresponding sequence for a k-mer match. By default the IDs are counted from 0 to m.

n_bucketsint, optional

Set the number of buckets in the table, e.g. to use a different load factor. It is recommended to use bucket_number() for this purpose. By default, a load factor of approximately 0.8 is used.

Returns
tableBucketKmerTable

The newly created table.

Examples

Reduce the size of sequence data in the table using minimizers:

>>> sequence1 = ProteinSequence("THIS*IS*A*SEQVENCE")
>>> kmer_alph = KmerAlphabet(sequence1.alphabet, k=3)
>>> minimizer = MinimizerSelector(kmer_alph, window=4)
>>> minimizer_pos, minimizers = minimizer.select(sequence1)
>>> kmer_table = BucketKmerTable.from_kmer_selection(
...     kmer_alph, [minimizer_pos], [minimizers]
... )

Use the same MinimizerSelector to select the minimizers from the query sequence and match them against the table. Although the amount of k-mers is reduced, matching is still guanrateed to work, if the two sequences share identity in the given window:

>>> sequence2 = ProteinSequence("ANQTHER*SEQVENCE")
>>> minimizer_pos, minimizers = minimizer.select(sequence2)
>>> matches = kmer_table.match_kmer_selection(minimizer_pos, minimizers)
>>> print(matches)
[[ 9  0 11]
 [12  0 14]]
>>> for query_pos, _, db_pos in matches:
...     print(sequence1)
...     print(" " * (db_pos-1) + "^" * kmer_table.k)
...     print(sequence2)
...     print(" " * (query_pos-1) + "^" * kmer_table.k)
...     print()
THIS*IS*A*SEQVENCE
  ^^^
ANQTHER*SEQVENCE
        ^^^

THIS*IS*A*SEQVENCE
            ^^^
ANQTHER*SEQVENCE
        ^^^
static from_kmers(kmer_alphabet, kmers, ref_ids=None, masks=None, n_buckets=None)

Create a BucketKmerTable by storing the positions of all input k-mers.

Parameters
kmer_alphabetKmerAlphabet

The KmerAlphabet to use for the new table. Should be the same alphabet that was used to calculate the input kmers.

kmerssized iterable object of (ndarray, dtype=np.int64), length=m

List where each array contains the k-mer codes from a sequence. For each array the index of the k-mer code in the array is stored in the table as sequence position.

ref_idssized iterable object of int, length=m, optional

The reference IDs for the sequences. These are used to identify the corresponding sequence for a k-mer match. By default the IDs are counted from 0 to m.

maskssized iterable object of (ndarray, dtype=bool), length=m, optional

A k-mer code at a position, where the corresponding mask is false, is not added to the table. By default, all positions are added.

n_bucketsint, optional

Set the number of buckets in the table, e.g. to use a different load factor. It is recommended to use bucket_number() for this purpose. By default, a load factor of approximately 0.8 is used.

Returns
tableBucketKmerTable

The newly created table.

See also

from_sequences

The same functionality based on undecomposed sequences

Examples

>>> sequences = [ProteinSequence("BIQTITE"), ProteinSequence("NIQBITE")]
>>> kmer_alphabet = KmerAlphabet(ProteinSequence.alphabet, 3)
>>> kmer_codes = [kmer_alphabet.create_kmers(s.code) for s in sequences]
>>> for code in kmer_codes:
...     print(code)
[11701  4360  7879  9400  4419]
[ 6517  4364  7975 11704  4419]
>>> table = BucketKmerTable.from_kmers(kmer_alphabet, kmer_codes)
>>> print(table)
IQT: (0, 1)
IQB: (1, 1)
ITE: (0, 4), (1, 4)
NIQ: (1, 0)
QTI: (0, 2)
QBI: (1, 2)
TIT: (0, 3)
BIQ: (0, 0)
BIT: (1, 3)
static from_sequences(k, sequences, ref_ids=None, ignore_masks=None, alphabet=None, spacing=None, n_buckets=None)

Create a BucketKmerTable by storing the positions of all overlapping k-mers from the input sequences.

Parameters
kint

The length of the k-mers.

sequencessized iterable object of Sequence, length=m

The sequences to get the k-mer positions from. These sequences must have equal alphabets, or one of these sequences must have an alphabet that extends the alphabets of all other sequences.

ref_idssized iterable object of int, length=m, optional

The reference IDs for the given sequences. These are used to identify the corresponding sequence for a k-mer match. By default the IDs are counted from 0 to m.

ignore_maskssized iterable object of (ndarray, dtype=bool), length=m, optional

Sequence positions to ignore. k-mers that involve these sequence positions are not added to the table. This is used e.g. to skip repeat regions. If provided, the list must contain one boolean mask (or None) for each sequence, and each bolean mask must have the same length as the sequence. By default, no sequence position is ignored.

alphabetAlphabet, optional

The alphabet to use for this table. It must extend the alphabets of the input sequences. By default, an appropriate alphabet is inferred from the input sequences. This option is usually used for compatibility with another sequence/table in the matching step.

spacingNone or str or list or ndarray, dtype=int, shape=(k,)

If provided, spaced k-mers are used instead of continuous ones. The value contains the informative positions relative to the start of the k-mer, also called the model. The number of informative positions must equal k. Refer to KmerAlphabet for more details.

n_bucketsint, optional

Set the number of buckets in the table, e.g. to use a different load factor. It is recommended to use bucket_number() for this purpose. By default, a load factor of approximately 0.8 is used.

Returns
tableBucketKmerTable

The newly created table.

See also

from_kmers

The same functionality based on already created k-mers

Examples

>>> sequences = [NucleotideSequence("TTATA"), NucleotideSequence("CTAG")]
>>> table = BucketKmerTable.from_sequences(
...     2, sequences, ref_ids=[100, 101]
... )
>>> print(table)
AG: (101, 2)
AT: (100, 2)
CT: (101, 0)
TA: (100, 1), (100, 3), (101, 1)
TT: (100, 0)

Give an explicit compatible alphabet:

>>> table = BucketKmerTable.from_sequences(
...     2, sequences, ref_ids=[100, 101],
...     alphabet=NucleotideSequence.ambiguous_alphabet()
... )

Ignore all N in a sequence:

>>> sequence = NucleotideSequence("ACCNTANNG")
>>> table = BucketKmerTable.from_sequences(
...     2, [sequence], ignore_masks=[sequence.symbols == "N"]
... )
>>> print(table)
AC: (0, 0)
CC: (0, 1)
TA: (0, 4)
static from_tables(tables)

Create a BucketKmerTable by merging the k-mer positions from existing tables.

Parameters
tablesiterable object of BucketKmerTable

The tables to be merged. All tables must have equal number of buckets and equal KmerAlphabet objects, i.e. the same k and equal base alphabets.

Returns
tableBucketKmerTable

The newly created table.

Examples

To ensure that all tables have the same number of buckets, n_buckets need to be set on table creation.

>>> # The sequence length is not exactly the length of resulting k-mers,
>>> # but it is close enough for bucket computation
>>> n_buckets = bucket_number(len("TTATA") + len("CTAG"))
>>> table1 = BucketKmerTable.from_sequences(
...     2, [NucleotideSequence("TTATA")], ref_ids=[100],
...     n_buckets=n_buckets
... )
>>> table2 = BucketKmerTable.from_sequences(
...     2, [NucleotideSequence("CTAG")], ref_ids=[101],
...     n_buckets=n_buckets
... )
>>> merged_table = BucketKmerTable.from_tables([table1, table2])
>>> print(merged_table)
AG: (101, 2)
AT: (100, 2)
CT: (101, 0)
TA: (100, 1), (100, 3), (101, 1)
TT: (100, 0)
get_kmers()

Get the k-mer codes for all k-mers that have at least one position in the table.

Returns
kmersndarray, shape=(n,), dtype=np.int64

The k-mer codes.

Notes

As each bucket need to be inspected for the actual k-mer entries, this method requires far more computation time than its KmerTable equivalent.

Examples

>>> sequence = ProteinSequence("BIQTITE")
>>> table = BucketKmerTable.from_sequences(3, [sequence], ref_ids=[100])
>>> print(table)
IQT: (100, 1)
ITE: (100, 4)
QTI: (100, 2)
TIT: (100, 3)
BIQ: (100, 0)
>>> kmer_codes = table.get_kmers()
>>> print(kmer_codes)
[ 4360  4419  7879  9400 11701]
>>> for code in kmer_codes:
...     print(table[code])
[[100   1]]
[[100   4]]
[[100   2]]
[[100   3]]
[[100   0]]
match(sequence, similarity_rule=None, ignore_mask=None)

Find matches between the k-mers in this table with all overlapping k-mers in the given sequence. k is determined by the table.

Parameters
sequenceSequence

The sequence to be matched. The table’s base alphabet must extend the alphabet of the sequence.

similarity_ruleSimilarityRule, optional

If this parameter is given, not only exact k-mer matches are considered, but also similar ones according to the given SimilarityRule.

ignore_maskndarray, dtype=bool, optional

Boolean mask of sequence positions to ignore. k-mers that involve these sequence positions are not added to the table. This is used e.g. to skip repeat regions. By default, no sequence position is ignored.

Returns
matchesndarray, shape=(n,3), dtype=np.uint32

The k-mer matches. Each row contains one match. Each match has the following columns:

  1. The sequence position in the input sequence

  2. The reference ID of the matched sequence in the table

  3. The sequence position of the matched sequence in the table

Notes

The matches are ordered by the first column.

Examples

>>> sequence1 = ProteinSequence("BIQTITE")
>>> table = BucketKmerTable.from_sequences(3, [sequence1], ref_ids=[100])
>>> print(table)
IQT: (100, 1)
ITE: (100, 4)
QTI: (100, 2)
TIT: (100, 3)
BIQ: (100, 0)
>>> sequence2 = ProteinSequence("TITANITE")
>>> print(table.match(sequence2))
[[  0 100   3]
 [  5 100   4]]
match_kmer_selection(positions, kmers)

Find matches between the k-mers in this table with the given k-mer selection.

It is intended to use this method to find matches in a table that was created using from_kmer_selection().

Parameters
positionsndarray, shape=(n,), dtype=uint32

Sequence positions of the filtered subset of k-mers given in kmers.

kmersndarray, shape=(n,), dtype=np.int64

Filtered subset of k-mer codes to match against.

Returns
matchesndarray, shape=(n,3), dtype=np.uint32

The k-mer matches. Each row contains one k-mer match. Each match has the following columns:

  1. The sequence position of the input k-mer, taken from positions

  2. The reference ID of the matched sequence in the table

  3. The sequence position of the matched k-mer in the table

Examples

Reduce the size of sequence data in the table using minimizers:

>>> sequence1 = ProteinSequence("THIS*IS*A*SEQVENCE")
>>> kmer_alph = KmerAlphabet(sequence1.alphabet, k=3)
>>> minimizer = MinimizerSelector(kmer_alph, window=4)
>>> minimizer_pos, minimizers = minimizer.select(sequence1)
>>> kmer_table = BucketKmerTable.from_kmer_selection(
...     kmer_alph, [minimizer_pos], [minimizers]
... )

Use the same MinimizerSelector to select the minimizers from the query sequence and match them against the table. Although the amount of k-mers is reduced, matching is still guanrateed to work, if the two sequences share identity in the given window:

>>> sequence2 = ProteinSequence("ANQTHER*SEQVENCE")
>>> minimizer_pos, minimizers = minimizer.select(sequence2)
>>> matches = kmer_table.match_kmer_selection(minimizer_pos, minimizers)
>>> print(matches)
[[ 9  0 11]
 [12  0 14]]
>>> for query_pos, _, db_pos in matches:
...     print(sequence1)
...     print(" " * (db_pos-1) + "^" * kmer_table.k)
...     print(sequence2)
...     print(" " * (query_pos-1) + "^" * kmer_table.k)
...     print()
THIS*IS*A*SEQVENCE
  ^^^
ANQTHER*SEQVENCE
        ^^^

THIS*IS*A*SEQVENCE
            ^^^
ANQTHER*SEQVENCE
        ^^^
match_table(table, similarity_rule=None)

Find matches between the k-mers in this table with the k-mers in another table.

This means that for each k-mer the cartesian product between the positions in both tables is added to the matches.

Parameters
tableBucketKmerTable

The table to be matched. Both tables must have equal number of buckets and equal KmerAlphabet objects, i.e. the same k and equal base alphabets.

similarity_ruleSimilarityRule, optional

If this parameter is given, not only exact k-mer matches are considered, but also similar ones according to the given SimilarityRule.

Returns
matchesndarray, shape=(n,4), dtype=np.uint32

The k-mer matches. Each row contains one match. Each match has the following columns:

  1. The reference ID of the matched sequence in the other table

  2. The sequence position of the matched sequence in the other table

  3. The reference ID of the matched sequence in this table

  4. The sequence position of the matched sequence in this table

Notes

There is no guaranteed order of the reference IDs or sequence positions in the returned matches.

Examples

To ensure that both tables have the same number of buckets, n_buckets need to be set on table creation.

>>> # The sequence length is not exactly the length of resulting k-mers,
>>> # but it is close enouggh for bucket computation
>>> n_buckets = bucket_number(max(len("BIQTITE"), len("TITANITE")))
>>> sequence1 = ProteinSequence("BIQTITE")
>>> table1 = BucketKmerTable.from_sequences(3, [sequence1], ref_ids=[100])
>>> print(table1)
IQT: (100, 1)
ITE: (100, 4)
QTI: (100, 2)
TIT: (100, 3)
BIQ: (100, 0)
>>> sequence2 = ProteinSequence("TITANITE")
>>> table2 = BucketKmerTable.from_sequences(3, [sequence2], ref_ids=[101])
>>> print(table2)
ANI: (101, 3)
ITA: (101, 1)
ITE: (101, 5)
NIT: (101, 4)
TAN: (101, 2)
TIT: (101, 0)
>>> print(table1.match_table(table2))
[[101   0 100   3]
 [101   5 100   4]]