FastqFile
#
- class biotite.sequence.io.fastq.FastqFile(offset, chars_per_line=None)[source]#
Bases:
TextFile
,MutableMapping
This class represents a file in FASTQ format.
A FASTQ file stores one or multiple sequences (base calls) along with sequencing quality scores. Each sequence is associated with an identifer string, beginning with an
@
.The quality scores are encoded as ASCII characters, with each actual score being the ASCII code subtracted by an offset value. The offset is format dependent. As the offset is not reliably deducible from the file contets, it must be provided explicitly, either as number or format (e.g.
'Illumina-1.8'
).Similar to the
FastaFile
class, this class implements theMutableMapping
interface: An identifier string (without the leading@
) is used as index to get and set the corresponding sequence and quality.del
removes an entry in the file.- Parameters:
- offsetint or {‘Sanger’, ‘Solexa’, ‘Illumina-1.3’, ‘Illumina-1.5’, ‘Illumina-1.8’}
This value is added to the quality score to obtain the ASCII code. Can either be directly the value, or a string that indicates the score format.
- chars_per_lineint, optional
The number characters in a line containing sequence data after which a line break is inserted. Only relevant, when adding sequences to a file. By default each sequence (and score string) is put into one line.
Examples
>>> import os.path >>> file = FastqFile(offset="Sanger") >>> file["seq1"] = str(NucleotideSequence("ATACT")), [0,3,10,7,12] >>> file["seq2"] = str(NucleotideSequence("TTGTAGG")), [15,13,24,21,28,38,35] >>> print(file) @seq1 ATACT + !$+(- @seq2 TTGTAGG + 0.96=GD >>> sequence, scores = file["seq1"] >>> print(sequence) ATACT >>> print(scores) [ 0 3 10 7 12] >>> del file["seq1"] >>> print(file) @seq2 TTGTAGG + 0.96=GD >>> file.write(os.path.join(path_to_directory, "test.fastq"))
- copy()#
Create a deep copy of this object.
- Returns:
- copy
A copy of this object.
- get_quality(identifier)#
Get the quality scores for the specified identifier.
- Parameters:
- identifierstr
The identifier of the quality scores.
- Returns:
- scoresndarray, dtype=int
The quality scores corresponding to the identifier.
- get_seq_string(identifier)#
Get the string representing the sequence for the specified identifier.
- Parameters:
- identifierstr
The identifier of the sequence.
- Returns:
- sequencestr
The sequence corresponding to the identifier.
- classmethod read(file, offset, chars_per_line=None)#
Read a FASTQ file.
- Parameters:
- filefile-like object or str
The file to be read. Alternatively a file path can be supplied.
- offsetint or {‘Sanger’, ‘Solexa’, ‘Illumina-1.3’, ‘Illumina-1.5’, ‘Illumina-1.8’}
This value is added to the quality score to obtain the ASCII code. Can either be directly the value, or a string that indicates the score format.
- chars_per_lineint, optional
The number characters in a line containing sequence data after which a line break is inserted. Only relevant, when adding sequences to a file. By default each sequence (and score string) is put into one line.
- Returns:
- file_objectFastqFile
The parsed file.
- static read_iter(file, offset)#
Create an iterator over each sequence (and corresponding scores) of the given FASTQ file.
- Parameters:
- filefile-like object or str
The file to be read. Alternatively a file path can be supplied.
- offsetint or {‘Sanger’, ‘Solexa’, ‘Illumina-1.3’, ‘Illumina-1.5’, ‘Illumina-1.8’}
This value that is added to the quality score to obtain the ASCII code. Can either be directly the value, or a string that indicates the score format.
- Yields:
- identifierstr
The identifier of the current sequence.
- sequencetuple(str, ndarray)
The current sequence as string and its corresponding quality scores as
ndarray
.
Notes
This approach gives the same results as FastqFile.read(file, offset).items(), but is slightly faster and much more memory efficient.
- write(file)#
Write the contents of this object into a file (or file-like object).
- Parameters:
- filefile-like object or str
The file to be written to. Alternatively a file path can be supplied.
- static write_iter(file, items, offset, chars_per_line=None)#
Iterate over the given items and write each item into the specified file.
In contrast to
write()
, the lines of text are not stored in an intermediateTextFile
, but are directly written to the file. Hence, this static method may save a large amount of memory if a large file should be written, especially if the items are provided as generator.- Parameters:
- filefile-like object or str
The file to be written to. Alternatively a file path can be supplied.
- itemsgenerator or array-like of tuple(str, tuple(str, ndarray))
The entries to be written into the file. Each entry consists of an identifier string and a tuple containing a sequence (as string) and a score array.
- offsetint or {‘Sanger’, ‘Solexa’, ‘Illumina-1.3’, ‘Illumina-1.5’, ‘Illumina-1.8’}
This value is added to the quality score to obtain the ASCII code. Can either be directly the value, or a string that indicates the score format.
- chars_per_lineint, optional
The number characters in a line containing sequence data after which a line break is inserted. Only relevant, when adding sequences to a file. By default each sequence (and score string) is put into one line.
Notes
This method does not test, whether the given identifiers are unambiguous.