biotite.sequence.io.fastq.FastqFile

class biotite.sequence.io.fastq.FastqFile(offset, chars_per_line=None)[source]

Bases: TextFile, MutableMapping

This class represents a file in FASTQ format.

A FASTQ file stores one or multiple sequences (base calls) along with sequencing quality scores. Each sequence is associated with an identifer string, beginning with an @.

The quality scores are encoded as ASCII characters, with each actual score being the ASCII code subtracted by an offset value. The offset is format dependent. As the offset is not reliably deducible from the file contets, it must be provided explicitly, either as number or format (e.g. 'Illumina-1.8').

Similar to the FastaFile class, this class implements the MutableMapping interface: An identifier string (without the leading @) is used as index to get and set the corresponding sequence and quality. del removes an entry in the file.

Parameters
offsetint or {‘Sanger’, ‘Solexa’, ‘Illumina-1.3’, ‘Illumina-1.5’, ‘Illumina-1.8’}

This value is added to the quality score to obtain the ASCII code. Can either be directly the value, or a string that indicates the score format.

chars_per_lineint, optional

The number characters in a line containing sequence data after which a line break is inserted. Only relevant, when adding sequences to a file. By default each sequence (and score string) is put into one line.

Examples

>>> import os.path
>>> file = FastqFile(offset="Sanger")
>>> file["seq1"] = str(NucleotideSequence("ATACT")), [0,3,10,7,12]
>>> file["seq2"] = str(NucleotideSequence("TTGTAGG")), [15,13,24,21,28,38,35]
>>> print(file)
@seq1
ATACT
+
!$+(-
@seq2
TTGTAGG
+
0.96=GD
>>> sequence, scores = file["seq1"]
>>> print(sequence)
ATACT
>>> print(scores)
[ 0  3 10  7 12]
>>> del file["seq1"]
>>> print(file)
@seq2
TTGTAGG
+
0.96=GD
>>> file.write(os.path.join(path_to_directory, "test.fastq"))
clear() None.  Remove all items from D.
copy()

Create a deep copy of this object.

Returns
copy

A copy of this object.

get(k[, d]) D[k] if k in D, else d.  d defaults to None.
get_quality(identifier)

Get the quality scores for the specified identifier.

Parameters
identifierstr

The identifier of the quality scores.

Returns
scoresndarray, dtype=int

The quality scores corresponding to the identifier.

get_seq_string(identifier)

Get the string representing the sequence for the specified identifier.

Parameters
identifierstr

The identifier of the sequence.

Returns
sequencestr

The sequence corresponding to the identifier.

get_sequence(identifier)

Get the sequence for the specified identifier.

DEPRECATED: Use get_seq_string() or get_sequence() instead.

Parameters
identifierstr

The identifier of the sequence.

Returns
sequenceNucleotideSequence

The sequence corresponding to the identifier.

items() a set-like object providing a view on D's items
keys() a set-like object providing a view on D's keys
pop(k[, d]) v, remove specified key and return the corresponding value.

If key is not found, d is returned if given, otherwise KeyError is raised.

popitem() (k, v), remove and return some (key, value) pair

as a 2-tuple; but raise KeyError if D is empty.

classmethod read(file, offset, chars_per_line=None)

Read a FASTQ file.

Parameters
filefile-like object or str

The file to be read. Alternatively a file path can be supplied.

offsetint or {‘Sanger’, ‘Solexa’, ‘Illumina-1.3’, ‘Illumina-1.5’, ‘Illumina-1.8’}

This value is added to the quality score to obtain the ASCII code. Can either be directly the value, or a string that indicates the score format.

chars_per_lineint, optional

The number characters in a line containing sequence data after which a line break is inserted. Only relevant, when adding sequences to a file. By default each sequence (and score string) is put into one line.

Returns
file_objectFastqFile

The parsed file.

static read_iter(file, offset)

Create an iterator over each sequence (and corresponding scores) of the given FASTQ file.

Parameters
filefile-like object or str

The file to be read. Alternatively a file path can be supplied.

offsetint or {‘Sanger’, ‘Solexa’, ‘Illumina-1.3’, ‘Illumina-1.5’, ‘Illumina-1.8’}

This value that is added to the quality score to obtain the ASCII code. Can either be directly the value, or a string that indicates the score format.

Yields
identifierstr

The identifier of the current sequence.

sequencetuple(str, ndarray)

The current sequence as string and its corresponding quality scores as ndarray.

Notes

This approach gives the same results as FastqFile.read(file, offset).items(), but is slightly faster and much more memory efficient.

setdefault(k[, d]) D.get(k,d), also set D[k]=d if k not in D
update([E, ]**F) None.  Update D from mapping/iterable E and F.

If E present and has a .keys() method, does: for k in E: D[k] = E[k] If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k, v in F.items(): D[k] = v

values() an object providing a view on D's values
write(file)

Write the contents of this object into a file (or file-like object).

Parameters
filefile-like object or str

The file to be written to. Alternatively a file path can be supplied.

static write_iter(file, items, offset, chars_per_line=None)

Iterate over the given items and write each item into the specified file.

In contrast to write(), the lines of text are not stored in an intermediate TextFile, but are directly written to the file. Hence, this static method may save a large amount of memory if a large file should be written, especially if the items are provided as generator.

Parameters
filefile-like object or str

The file to be written to. Alternatively a file path can be supplied.

itemsgenerator or array-like of tuple(str, tuple(str, ndarray))

The entries to be written into the file. Each entry consists of an identifier string and a tuple containing a sequence (as string) and a score array.

offsetint or {‘Sanger’, ‘Solexa’, ‘Illumina-1.3’, ‘Illumina-1.5’, ‘Illumina-1.8’}

This value is added to the quality score to obtain the ASCII code. Can either be directly the value, or a string that indicates the score format.

chars_per_lineint, optional

The number characters in a line containing sequence data after which a line break is inserted. Only relevant, when adding sequences to a file. By default each sequence (and score string) is put into one line.

Notes

This method does not test, whether the given identifiers are unambiguous.