.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "examples/gallery/sequence/sequencing/quality_control.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_examples_gallery_sequence_sequencing_quality_control.py: Quality control of sequencing data ================================== This script performs quality control on sequence reads similar to the `FastQC `_ software. This example is inspired by the `Galaxy tutorial `_. At first we fetch example sequencing data from the *NCBI* *sequence read archive* SRA. In this case we take data from whole genome sequencing of *Escherichia virus phiX174*. For the following script to work it is only important that each read has the exact same length. .. GENERATED FROM PYTHON SOURCE LINES 18-47 .. code-block:: Python # Code source: Patrick Kunzmann # License: BSD 3 clause import matplotlib.pyplot as plt import matplotlib.ticker as ticker import numpy as np from scipy.stats import binom import biotite import biotite.application.sra as sra import biotite.sequence as seq FIG_SIZE = (8.0, 6.0) app = sra.FastqDumpApp("ERR266411") app.start() app.join() # Each run can have multiple reads per spot # by selecting index 0 we take only the first read for every spot sequences_and_scores = app.get_sequences_and_scores()[0] sequence_codes = np.stack( [sequence.code for sequence, _ in sequences_and_scores.values()] ) scores = np.stack([scores for _, scores in sequences_and_scores.values()]) seq_count = scores.shape[0] seq_length = scores.shape[1] positions = np.arange(1, seq_length + 1) .. GENERATED FROM PYTHON SOURCE LINES 49-55 We begin our analysis by investigating how the quality scores develop within the reads. Hence, we create something similar to a box plot for each read position. For the plot we need the first, second (the median) and third quartile for each position. .. GENERATED FROM PYTHON SOURCE LINES 55-74 .. code-block:: Python first_quartile, median, third_quartile = np.quantile(scores, (0.25, 0.5, 0.75), axis=0) fig, ax = plt.subplots(figsize=FIG_SIZE) ax.bar( positions, bottom=first_quartile, height=third_quartile - first_quartile, width=1.0, facecolor=biotite.colors["brightorange"], label="Lower/upper quartile", ) ax.plot(positions, median, color=biotite.colors["dimorange"], label="Median") ax.set_xlim(positions[0], positions[-1]) ax.set_xlabel("Position in read") ax.set_ylabel("Phred score") ax.legend(loc="lower left") fig.tight_layout() .. image-sg:: /examples/gallery/sequence/sequencing/images/sphx_glr_quality_control_001.png :alt: quality control :srcset: /examples/gallery/sequence/sequencing/images/sphx_glr_quality_control_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 75-82 We can see that the Phred scores sharply increases the first few bases and then slowly drops with the length of the read, but overall, the quality is quite good. This behavior is quite common for *Illumina* sequencers. Now we have a look at the other dimension: How is the quality distributed over the reads? .. GENERATED FROM PYTHON SOURCE LINES 82-99 .. code-block:: Python BIN_NUMBER = 500 mean_scores = np.mean(scores, axis=1) fig, ax = plt.subplots(figsize=FIG_SIZE) ax.hist( # Definition range of Sanger Phred scores is 0 to 40 mean_scores, bins=np.linspace(0, 40, BIN_NUMBER), color=biotite.colors["lightorange"], ) ax.set_xlabel("Mean Phred score of sequence") ax.set_ylabel("Sequence count") ax.set_xlim(np.floor(np.min(mean_scores)), np.ceil(np.max(mean_scores))) fig.tight_layout() .. image-sg:: /examples/gallery/sequence/sequencing/images/sphx_glr_quality_control_002.png :alt: quality control :srcset: /examples/gallery/sequence/sequencing/images/sphx_glr_quality_control_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 100-106 This is a typical distribution. Now we want to see the appearance of each base over the length of the sequence reads. In a random library one would expect, that :math:`p(A) \approx p(T)` and :math:`p(G) \approx p(C)` for every position. .. GENERATED FROM PYTHON SOURCE LINES 106-127 .. code-block:: Python # Use ambiguous DNA alphabet, # as ambiguous bases might occur in some sequencing datasets alphabet = seq.NucleotideSequence.alphabet_amb counts = np.stack( [np.bincount(codes, minlength=len(alphabet)) for codes in sequence_codes.T], axis=-1 ) frequencies = counts / seq_count * 100 fig, ax = plt.subplots(figsize=FIG_SIZE) for character, freq in zip(alphabet.get_symbols(), frequencies): if (freq > 0).any(): ax.plot(positions, freq, label=character) ax.set_xlim(positions[0], positions[-1]) ax.xaxis.set_major_locator(ticker.AutoLocator()) ax.set_xlabel("Position in read") ax.set_ylabel("Frequency at position (%)") ax.legend(loc="upper left") fig.tight_layout() .. image-sg:: /examples/gallery/sequence/sequencing/images/sphx_glr_quality_control_003.png :alt: quality control :srcset: /examples/gallery/sequence/sequencing/images/sphx_glr_quality_control_003.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 128-134 Again we have a look at the other dimension: How is the GC content distributed over the reads? If each position in all reads had a random base, we would expect a binomial distribution with a peak at the overall GC content. We plot our actual distribution along with the idealized binomial distribution. .. GENERATED FROM PYTHON SOURCE LINES 134-166 .. code-block:: Python gc_count = np.count_nonzero( (sequence_codes == alphabet.encode("G")) | (sequence_codes == alphabet.encode("C")), axis=1, ) at_count = np.count_nonzero( (sequence_codes == alphabet.encode("A")) | (sequence_codes == alphabet.encode("T")), axis=1, ) gc_content = gc_count / (gc_count + at_count) # Exclusive range -> 0 to seq_length inclusive number_of_gc = np.arange(seq_length + 1) exp_gc_content = binom.pmf(k=number_of_gc, n=seq_length, p=np.mean(gc_content)) fig, ax = plt.subplots(figsize=FIG_SIZE) # Due to finite sequence length, the distribution is discrete # -> use bar() instead of hist() values, counts = np.unique(gc_content, return_counts=True) bin_width = 100 / seq_length ax.bar(values * 100, counts, width=bin_width, color=biotite.colors["brightorange"]) ax.plot( number_of_gc / seq_length * 100, exp_gc_content * seq_count, color=biotite.colors["dimorange"], linestyle=":", ) ax.set_xlim(0, 100) ax.set_xlabel("Sequence GC content (%)") ax.set_ylabel("Sequence count") fig.tight_layout() .. image-sg:: /examples/gallery/sequence/sequencing/images/sphx_glr_quality_control_004.png :alt: quality control :srcset: /examples/gallery/sequence/sequencing/images/sphx_glr_quality_control_004.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 167-177 In this case the actual data stick relatively good to the idealized curve. Strong deviations may indicate contaminations or a biased library. Finally, we investigate how often reads are duplicated in the dataset. For this purpose we create a dictionary that stores the sequences (or rather their sequence code) as keys and the number of duplications as values. Note that a duplication of number of ``1`` indicates that the sequence appears only once in the dataset. .. GENERATED FROM PYTHON SOURCE LINES 177-208 .. code-block:: Python duplications = {} for code in sequence_codes: # NumPy arrays cannot be hashed -> convert array into tuple code = tuple(code) if code in duplications: duplications[code] += 1 else: duplications[code] = 1 duplication_level_count = np.bincount(list(duplications.values())) duplication_level_freq = ( duplication_level_count * np.arange(len(duplication_level_count)) / seq_count * 100 ) max_duplication = len(duplication_level_count) - 1 print("Maximum duplication number:", max_duplication) fig, ax = plt.subplots(figsize=FIG_SIZE) ax.bar( np.arange(0, len(duplication_level_freq)), duplication_level_freq, width=0.6, color=biotite.colors["dimorange"], ) ax.set_xlim(0.5, len(duplication_level_freq) + 0.5) ax.xaxis.set_major_locator(ticker.MaxNLocator(10)) ax.set_xlabel("Number of duplications") ax.set_ylabel("Sequence percentage (%)") fig.tight_layout() plt.show() .. image-sg:: /examples/gallery/sequence/sequencing/images/sphx_glr_quality_control_005.png :alt: quality control :srcset: /examples/gallery/sequence/sequencing/images/sphx_glr_quality_control_005.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none Maximum duplication number: 187 .. GENERATED FROM PYTHON SOURCE LINES 209-214 The dataset has quite an unusual repetition profile: Usually one would expect, that most sequences occur only once and the following duplication numbers become decreasingly likely. However, in this case we have another peak at around 60 duplications. And one read is even repeated astonishing 161 times! .. _sphx_glr_download_examples_gallery_sequence_sequencing_quality_control.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: quality_control.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: quality_control.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: quality_control.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_