biotite.sequence.align.bucket_number¶

biotite.sequence.align.bucket_number(n_kmers, load_factor=0.8)[source]¶

Find an appropriate number of buckets for a BucketKmerTable based on the number of elements (i.e. k-mers) that should be stored in the table.

Parameters

n_kmersint: The expected number of k-mers that will be stored in the BucketKmerTable. If this number deviates from the actual number of k-mers that will be stored, the load factor of the table will deviate by the same percentage.
load_factorfloat, optional: The ratio of bucket number to k-mer number. The actual load factor will be lower, as the closest greater prime is returned (see Notes).

Returns

n_bucketsint: The recommended number of buckets to use for a BucketKmerTable, that stores n_kmers at the given load_factor.

Notes

The function returns the closest greater prime number from a precomputed list of primes to use as the number of buckets. The reason is that primer numbers have proven to be good hash table sizes, if the hash function is not randomized.

Let’s take unambiguous nucleotide k-mers as example. If powers of two would be used as table size (another common scheme), taking the modulo operation on the k-mer code would simply erase the upper bits corresponding to the first nucleotide(s) in a k-mer. Hence, all k-mers with the same suffix would be stored in the same bin.