Reading and writing sequences#
Loading sequences from FASTA files#
Biotite enables the user to load and save sequences from/to the
popular FASTA format via the FastaFile class.
As each sequence in a FASTA file is identified by an unique header,
a FastaFile is implemented in a dictionary-like fashion:
The header strings (without the leading '>') are used as keys to access
the respective sequence strings.
Let’s demonstrate this on the hemoglobin example from an earlier chapter.
from tempfile import gettempdir, NamedTemporaryFile
import biotite.sequence as seq
import biotite.sequence.io.fasta as fasta
import biotite.database.entrez as entrez
temp_file = NamedTemporaryFile(suffix=".fasta", delete=False)
file_path = entrez.fetch_single_file(
["6BB5_A", "6BB5_B"], temp_file.name,
db_name="protein", ret_type="fasta"
)
with open(file_path) as file:
print(file.read())
>pdb|6BB5|A Chain A, Hemoglobin subunit alpha
LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVA
HVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKY
>pdb|6BB5|B Chain B, Hemoglobin subunit beta
HLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAF
SDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANAL
AHKYH
After downloading the FASTA file from the NCBI Entrez database, we can load its contents in the following way:
fasta_file = fasta.FastaFile.read(file_path)
for header, seq_string in fasta_file.items():
print("Header:", header)
# For sake of brevity, print only the start of the sequence
print("Sequence:", seq_string[:50], "...")
print()
Header: pdb|6BB5|A Chain A, Hemoglobin subunit alpha
Sequence: LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHG ...
Header: pdb|6BB5|B Chain B, Hemoglobin subunit beta
Sequence: HLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTP ...
Now this string could be used as input parameter for creation of a
NucleotideSequence.
If we want to spare ourselves some unnecessary work, there is already a
convenience function for that:
dna_sequences = fasta.get_sequences(fasta_file)
for header, sequence in dna_sequences.items():
print("Header:", header)
# Note that this time we have a Sequence object instead of a string
print("Sequence:", repr(sequence))
print()
Header: pdb|6BB5|A Chain A, Hemoglobin subunit alpha
Sequence: ProteinSequence("LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKY")
Header: pdb|6BB5|B Chain B, Hemoglobin subunit beta
Sequence: ProteinSequence("HLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH")
The function even automatically recognizes, if the file contains a
DNA or protein sequence and returns a NucleotideSequence or
ProteinSequence, instance respectively.
If we would be interested in only a single sequence (or the file contains only
a single sequence anyway) we would use get_sequence() instead.
Sequences can be written into FASTA files in a similar way: either via
dictionary-like access or using the set_sequence()
convenience function.
# Create new empty FASTA file
fasta_file = fasta.FastaFile()
# PROTIP: Let your cat walk over the keyboard
dna_seq1 = seq.NucleotideSequence("ATCGGATCTATCGATGCTAGCTACAGCTAT")
dna_seq2 = seq.NucleotideSequence("ACGATCTACTAGCTGATGTCGTGCATGTACG")
# Append entries to file...
# ... via set_sequence()
fasta.set_sequence(fasta_file, dna_seq1, header="gibberish")
# .. or dictionary style
fasta_file["more gibberish"] = str(dna_seq2)
print(fasta_file)
temp_file = NamedTemporaryFile(suffix=".fasta", delete=False)
fasta_file.write(temp_file.name)
temp_file.close()
>gibberish
ATCGGATCTATCGATGCTAGCTACAGCTAT
>more gibberish
ACGATCTACTAGCTGATGTCGTGCATGTACG
Assessing sequencing quality using FASTQ files#
In addition to sequences FASTQ files store the
Phred quality score
for each base call from a sequencing run.
Hence, the FastqFile class works also like a dictionary, but the value
is tuple containing both, the sequence strings and the corresponding score
array.
from io import StringIO
from textwrap import dedent
import biotite.sequence.io.fastq as fastq
# Sample FASTQ file from https://en.wikipedia.org/wiki/FASTQ_format
fastq_content = StringIO(dedent(
"""
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
"""
))
fastq_file = fastq.FastqFile.read(fastq_content, offset="Sanger")
seq_string, scores = fastq_file["SEQ_ID"]
print(seq_string)
print(scores)
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
[ 0 6 6 9 7 7 7 7 9 9 9 10 8 8 4 4 4 10 10 8 7 4 4 4
4 8 13 16 9 9 9 12 10 9 6 6 8 8 9 9 20 20 34 34 37 29 29 29
29 29 29 34 34 34 34 34 34 34 21 20]
Also similar to biotite.sequence.io.fasta, there are convenience
functions for loading Sequence objects from FASTQ files.
seq, scores = fastq.get_sequence(fastq_file)
print(repr(seq))
print(scores)
NucleotideSequence("GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT", ambiguous=False)
[ 0 6 6 9 7 7 7 7 9 9 9 10 8 8 4 4 4 10 10 8 7 4 4 4
4 8 13 16 9 9 9 12 10 9 6 6 8 8 9 9 20 20 34 34 37 29 29 29
29 29 29 34 34 34 34 34 34 34 21 20]