Reading and writing sequences#

Loading sequences from FASTA files#

Biotite enables the user to load and save sequences from/to the popular FASTA format via the FastaFile class. As each sequence in a FASTA file is identified by an unique header, a FastaFile is implemented in a dictionary-like fashion: The header strings (without the leading '>') are used as keys to access the respective sequence strings. Let’s demonstrate this on the hemoglobin example from an earlier chapter.

from tempfile import gettempdir, NamedTemporaryFile
import biotite.sequence as seq
import biotite.sequence.io.fasta as fasta
import biotite.database.entrez as entrez

temp_file = NamedTemporaryFile(suffix=".fasta", delete=False)
file_path = entrez.fetch_single_file(
    ["6BB5_A", "6BB5_B"], temp_file.name,
    db_name="protein", ret_type="fasta"
)
with open(file_path) as file:
    print(file.read())

>pdb|6BB5|A Chain A, Hemoglobin subunit alpha
LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVA
HVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKY

>pdb|6BB5|B Chain B, Hemoglobin subunit beta
HLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAF
SDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANAL
AHKYH

After downloading the FASTA file from the NCBI Entrez database, we can load its contents in the following way:

fasta_file = fasta.FastaFile.read(file_path)
for header, seq_string in fasta_file.items():
    print("Header:", header)
    # For sake of brevity, print only the start of the sequence
    print("Sequence:", seq_string[:50], "...")
    print()

Header: pdb|6BB5|A Chain A, Hemoglobin subunit alpha
Sequence: LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHG ...

Header: pdb|6BB5|B Chain B, Hemoglobin subunit beta
Sequence: HLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTP ...

Now this string could be used as input parameter for creation of a NucleotideSequence. If we want to spare ourselves some unnecessary work, there is already a convenience function for that:

dna_sequences = fasta.get_sequences(fasta_file)
for header, sequence in dna_sequences.items():
    print("Header:", header)
    # Note that this time we have a Sequence object instead of a string
    print("Sequence:", repr(sequence))
    print()

Header: pdb|6BB5|A Chain A, Hemoglobin subunit alpha
Sequence: ProteinSequence("LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKY")

Header: pdb|6BB5|B Chain B, Hemoglobin subunit beta
Sequence: ProteinSequence("HLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH")

The function even automatically recognizes, if the file contains a DNA or protein sequence and returns a NucleotideSequence or ProteinSequence, instance respectively. If we would be interested in only a single sequence (or the file contains only a single sequence anyway) we would use get_sequence() instead.

Sequences can be written into FASTA files in a similar way: either via dictionary-like access or using the set_sequence() convenience function.

# Create new empty FASTA file
fasta_file = fasta.FastaFile()
# PROTIP: Let your cat walk over the keyboard
dna_seq1 = seq.NucleotideSequence("ATCGGATCTATCGATGCTAGCTACAGCTAT")
dna_seq2 = seq.NucleotideSequence("ACGATCTACTAGCTGATGTCGTGCATGTACG")
# Append entries to file...
# ... via set_sequence()
fasta.set_sequence(fasta_file, dna_seq1, header="gibberish")
# .. or dictionary style
fasta_file["more gibberish"] = str(dna_seq2)
print(fasta_file)
temp_file = NamedTemporaryFile(suffix=".fasta", delete=False)
fasta_file.write(temp_file.name)
temp_file.close()

>gibberish
ATCGGATCTATCGATGCTAGCTACAGCTAT
>more gibberish
ACGATCTACTAGCTGATGTCGTGCATGTACG

Assessing sequencing quality using FASTQ files#

In addition to sequences FASTQ files store the Phred quality score for each base call from a sequencing run. Hence, the FastqFile class works also like a dictionary, but the value is tuple containing both, the sequence strings and the corresponding score array.

from io import StringIO
from textwrap import dedent
import biotite.sequence.io.fastq as fastq

# Sample FASTQ file from https://en.wikipedia.org/wiki/FASTQ_format
fastq_content = StringIO(dedent(
    """
    @SEQ_ID
    GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
    +
    !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
    """
))

fastq_file = fastq.FastqFile.read(fastq_content, offset="Sanger")
seq_string, scores = fastq_file["SEQ_ID"]
print(seq_string)
print(scores)

GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
[ 0  6  6  9  7  7  7  7  9  9  9 10  8  8  4  4  4 10 10  8  7  4  4  4
  4  8 13 16  9  9  9 12 10  9  6  6  8  8  9  9 20 20 34 34 37 29 29 29
 29 29 29 34 34 34 34 34 34 34 21 20]

Also similar to biotite.sequence.io.fasta, there are convenience functions for loading Sequence objects from FASTQ files.

seq, scores = fastq.get_sequence(fastq_file)
print(repr(seq))
print(scores)

NucleotideSequence("GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT", ambiguous=False)
[ 0  6  6  9  7  7  7  7  9  9  9 10  8  8  4  4  4 10 10  8  7  4  4  4
  4  8 13 16  9  9  9 12 10  9  6  6  8  8  9  9 20 20 34 34 37 29 29 29
 29 29 29 34 34 34 34 34 34 34 21 20]