Finding homologous sequences with BLAST#

the biotite.application.blast subpackage provides an interface to NCBI BLAST: the BlastWebApp class. Let’s dive directly into the code, we try to find homologous sequences to the miniprotein TC5b.

import biotite.application.blast as blast
import biotite.sequence as seq

tc5b_seq = seq.ProteinSequence("NLYIQWLKDGGPSSGRPPPS")
app = blast.BlastWebApp("blastp", tc5b_seq)
app.start()
app.join()
alignments = app.get_alignments()
best_ali = alignments[0]
print(best_ali)
print()
print("HSP position in query: ", best_ali.query_interval)
print("HSP position in hit: ", best_ali.hit_interval)
print("Score: ", best_ali.score)
print("E-value: ", best_ali.e_value)
print("Hit UID: ", best_ali.hit_id)
print("Hit name: ", best_ali.hit_definition)
NLYIQWLKDGGPSSGRPPPS
NLYIQWLKDGGPSSGRPPPS

HSP position in query:  (1, 20)
HSP position in hit:  (1, 20)
Score:  101
E-value:  0.000161777
Hit UID:  1L2Y_A
Hit name:  Chain A, TC5b [synthetic construct]

This was too simple for BLAST: As best hit it just found the query sequence itself in the PDB. However, it gives a good impression about how this Application works. Besides some optional parameters, the BlastWebApp requires the BLAST program and the query sequence. After the app has finished, you get a list of alignments with descending score. An alignment is an instance of BlastAlignment, a subclass of the biotite.sequence.align.Alignment encountered in a previous chapter. It contains some additional information as shown above. The hit UID can be used to obtain the complete hit sequence via biotite.database.entrez.

The next alignment should be a bit more challenging. We take a random part of the E. coli BL21 genome and distort it a little bit. Since we still expect a high similarity to the original sequence, we decrease the E-value threshold.

import biotite.application.blast as blast
import biotite.sequence as seq

distorted_bl21_excerpt = seq.NucleotideSequence(
    "CGGAAGCGCTCGGTCTCCTGGCCTTATCAGCCACTGCGCGACGATATGCTCGTCCGTTTCGAAGA"
)
app = blast.BlastWebApp("blastn", distorted_bl21_excerpt)
app.set_max_expect_value(0.1)
app.start()
app.join()
alignments = app.get_alignments()
best_ali = alignments[0]
print(best_ali)
print()
print("HSP position in query: ", best_ali.query_interval)
print("HSP position in hit: ", best_ali.hit_interval)
print("Score: ", best_ali.score)
print("E-value: ", best_ali.e_value)
print("Hit UID: ", best_ali.hit_id)
print("Hit name: ", best_ali.hit_definition)
CGGAAGCGCTCGGTCTCCTGGCC---TTATCAGCCACTGCGCGACGATATGCTCGTCCGTTTCGAAGA
CGGAAGCGCT-GGTC-CCTGCCCGCTTTATCAGGGAATGCGCGACGGCAAAATCGTCCGTTTCGAAGA

HSP position in query:  (1, 65)
HSP position in hit:  (4656858, 4656923)
Score:  56
E-value:  0.0117615
Hit UID:  CP026845
Hit name:  Shigella boydii strain NCTC 9733 chromosome, complete genome