.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "examples/gallery/sequence/genome_assembly.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_examples_gallery_sequence_genome_assembly.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_examples_gallery_sequence_genome_assembly.py:


Comparative genome assembly of SARS-CoV-2 B.1.1.7 variant
=========================================================

.. currentmodule:: biotite.sequence.align

In the following script we will perform a comparative genome assembly of
the emerging SARS-CoV-2 B.1.1.7 variant in a simplified manner.
We will use publicly available sequencing data of the virus variant
produced from a *Oxford Nanopore MinION* and map these sequence snippets
(reads) to the reference SARS-CoV-2 genome.
Then we will create a single consensus sequence from the mapped reads
and analyze where the differences to the reference genome are
(variant calling).
At last, we will focus on the mutations in the *spike protein* sequence.

.. note:: As the methods shown here are focused on simplicity, the
   accuracy of the assembled genome may be lower than the output from
   more sophisticated assembly software/pipelines.

To begin with, we download the relevant sequencing data from the *NCBI*
*sequence read archive* (SRA) using *Biotite's* interface to the
*SRA Toolkit*.
The software stores the downloaded sequence reads in one or multiple
FASTQ files, one for each read per *spot*:
A spot is a 'location' on the sequencing device.
One spot may produce more than one read, e.g. *Illumina* sequencers use
`paired-end sequencing <https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/paired-end-vs-single-read.html>`_
to produce a read starting from both ends of the original sequence.
However, the *MinION* technology only creates one read per input
sequence, so we expect only a single FASTQ file.

A FASTQ file provides for each read it contains

    1. the sead sequence,
    2. associated *Phred* quality scores.

Phred scores :math:`Q` describe the accuracy of each base in the read in
terms of base-call error probability :math:`P` :footcite:`Cock2010`:

.. math::

    Q = -10 \log_{10} P

.. GENERATED FROM PYTHON SOURCE LINES 45-82

.. code-block:: Python

    # Code source: Patrick Kunzmann
    # License: BSD 3 clause

    import itertools
    import tempfile
    from concurrent.futures import ProcessPoolExecutor
    import numpy as np
    import matplotlib.pyplot as plt
    from matplotlib.lines import Line2D
    from matplotlib.colors import LinearSegmentedColormap
    import biotite
    import biotite.sequence as seq
    import biotite.sequence.align as align
    import biotite.sequence.io as seqio
    import biotite.sequence.io.fasta as fasta
    import biotite.sequence.io.fastq as fastq
    import biotite.sequence.io.genbank as gb
    import biotite.sequence.graphics as graphics
    import biotite.database.entrez as entrez
    import biotite.application.sra as sra


    # Download the sequencing data
    app = sra.FastqDumpApp("SRR13453793")
    app.start()
    app.join()

    # Load sequences and quality scores from the sequencing data
    # There is only one read per spot
    file_path = app.get_file_paths()[0]
    fastq_file = fastq.FastqFile.read(file_path, offset="Sanger")
    reads = [seq.NucleotideSequence(seq_str)
             for seq_str, score_array in fastq_file.values()]
    score_arrays = [score_array for seq_str, score_array in fastq_file.values()]

    print(f"Number of reads: {len(reads)}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Number of reads: 58147


.. GENERATED FROM PYTHON SOURCE LINES 83-88

General sequencing data analysis
--------------------------------

First we have a first glance on the quality of the sequencing
data: the length of the reads and the Phred scores.

.. GENERATED FROM PYTHON SOURCE LINES 88-113

.. code-block:: Python


    N_BINS = 200


    fig, (length_ax, score_ax) = plt.subplots(nrows=2, figsize=(8.0, 6.0))

    length_ax.hist(
        [len(score_array) for score_array in score_arrays],
        bins=np.logspace(1, 5, N_BINS), color="gray"
    )
    length_ax.set_xlabel("Read length")
    length_ax.set_ylabel("Number of reads")
    length_ax.set_xscale("log")
    length_ax.set_yscale("log")

    score_ax.hist(
        [np.mean(score_array) for score_array in score_arrays],
        bins=N_BINS, color="gray",
    )
    score_ax.set_xlim(0, 30)
    score_ax.set_xlabel("Phred score")
    score_ax.set_ylabel("Number of reads")

    fig.tight_layout()


.. image-sg:: /examples/gallery/sequence/images/sphx_glr_genome_assembly_001.png
   :alt: genome assembly
   :srcset: /examples/gallery/sequence/images/sphx_glr_genome_assembly_001.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 114-131

We can see the reads in the dataset are rather long, with most reads
longer than 1 kb.
This is one of the big advantages of the employed sequencing
technology.
Especially for *de novo* genome assembly (which we will not do here),
long reads facilitate the process.
However, the sequencing method also comes with a disadvantage:
The base-call accuracy is relatively low, as the Phred scores
indicate.
Keep in mind, that :math:`Q = 10` means that the called base at the
respective position has a probability of 10 % to be wrong!
Hence, a high sequencing depth, i.e. a large number of overlapping
reads at each sequence position, is required to achieve accurate
results.
The partially low accuracy becomes even more visible, when creating
a histogram over quality scores of individual bases, instead of
averaging the scores over each read.

.. GENERATED FROM PYTHON SOURCE LINES 131-149

.. code-block:: Python


    score_histogram = np.bincount(np.concatenate(score_arrays))

    fig, ax = plt.subplots(figsize=(8.0, 4.0))
    ax.fill_between(
        # Value in megabases -> 1e-6
        np.arange(len(score_histogram)), score_histogram * 1e-6,
        linewidth=0, color="gray"
    )
    ax.set_xlim(
        np.min(np.where(score_histogram > 0)[0]),
        np.max(np.where(score_histogram > 0)[0]),
    )
    ax.set_ylim(0, np.max(score_histogram * 1e-6) * 1.05)
    ax.set_xlabel("Phred score")
    ax.set_ylabel("Number of Mb")
    fig.tight_layout()


.. image-sg:: /examples/gallery/sequence/images/sphx_glr_genome_assembly_002.png
   :alt: genome assembly
   :srcset: /examples/gallery/sequence/images/sphx_glr_genome_assembly_002.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 150-166

Optionally, you could exclude or trim reads with exceptionally low
Phred scores :footcite:`Pabinger2014`.
But instead we rely on a high sequencing depth to filter out
erroneous base calls.

Read mapping
------------

In the next step we map each read to its respective position
in the reference genome.
An additional challenge is to find the correct sense of the read:
In the library preparation both, sense and complementary DNA, is
produced from the virus RNA.
For this reason we need to create a complementary copy for each read
and map both strands to the reference genome.
Later the *wrong* strand is discarded.

.. GENERATED FROM PYTHON SOURCE LINES 166-179

.. code-block:: Python


    # Download and read the reference SARS-CoV-2 genome
    orig_genome_file = entrez.fetch(
        "NC_045512", tempfile.gettempdir(), "gb",
        db_name="Nucleotide", ret_type="gb"
    )
    orig_genome = seqio.load_sequence(orig_genome_file)

    # Create complementary reads
    compl_reads = list(itertools.chain(
        *[(read, read.reverse(False).complement()) for read in reads]
    ))


.. GENERATED FROM PYTHON SOURCE LINES 180-201

To map the reads to their corresponding positions in the reference
genome, we need to align them to it.
Although we could use :func:`align_optimal()`
(Needleman-Wunsch algorithm :footcite:`Needleman1970`) for this
purpose, aligning this large number of reads to even a small virus
genome would take hours.

Instead we choose an heuristic alignment approach, similar to the
method used by software like *BLAST* :footcite:`Altschul1990`:
First we scan each read for *k-mer* matches with the reference
genome.
A *k-mer* match is a subsequence of length *k* that appears in both
the read an the reference genome.
If *k* is chosen sufficiently large, such matches most likely occur
in homologous sequence regions.
Then we can perform an alignment that is restricted to the region of
a match.

The :class:`KmerTable` class allows the fast *k-mer* match scanning
by tabulation of all *k-mer* positions in the reference genome
sequence.

.. GENERATED FROM PYTHON SOURCE LINES 201-214

.. code-block:: Python


    K = 12

    genome_table = align.KmerTable.from_sequences(K, [orig_genome])

    all_matches = []
    for i, read in enumerate(compl_reads):
        all_matches.append(genome_table.match(read))

    # k-mer tables use quite a large amount of RAM
    # and we do not need this object anymore
    del genome_table


.. GENERATED FROM PYTHON SOURCE LINES 215-235

However, we can expect a lot of consecutive *k-mer* match positions
for each read:
For example, for :math:`k = 3`, the nucleotide ``ACATT`` compared to
itself would give matches for the *k-mers* ``ACA``, ``CAT`` and
``ATT``.
The respective match positions would be `(0,0)`, `(1,1)` and `(2,2)`.
However, the diagonal :math:`D = j - i`, where *i* and *j* are
positions in the first and second sequence respectively, is always the
same in this case: It is 0.
The same applies for the read mapping:
For the homologous region between the read and the genome all matches
should be approximately on the same diagonal.
Small deviations may arise from deletions/insertions (indels).
As long as no indel occurs, the match diagonal should
always be the same.
However, we can expect to have some unspecific *k-mer* matches, too.
But the diagonal of unspecific matches differs significantly from the
diagonal of the 'correct' matches.
Therefore, we select the diagonal with the highest frequency as the
'correct' diagonal for each read.

.. GENERATED FROM PYTHON SOURCE LINES 235-274

.. code-block:: Python


    # Pick the matches for the 6th read as example
    INDEX = 5
    matches = all_matches[INDEX]
    read_length = len(compl_reads[INDEX])

    # Find the correct diagonal for the example read
    diagonals = matches[:,2] - matches[:,0]
    diag, counts = np.unique(diagonals, return_counts=True)
    correct_diagonal = diag[np.argmax(counts)]

    # Visualize the matches and the correct diagonal
    fig, ax = plt.subplots(figsize=(8.0, 8.0))
    ax.scatter(
        matches[:,0], matches[:,2],
        s=4, marker="o", color=biotite.colors["dimorange"], label="Match"
    )
    ax.plot(
        [0, read_length], [correct_diagonal, read_length+correct_diagonal],
        linestyle=":", linewidth=1.0, color="black", label="Correct diagonal"
    )
    ax.set_xlim(0, read_length)
    ax.set_xlabel("Read position")
    ax.set_ylabel("Reference genome position")
    ax.legend()
    fig.tight_layout()


    # Find the correct diagonal for all reads
    correct_diagonals = [None] * len(all_matches)
    for i, matches in enumerate(all_matches):
        diagonals = matches[:,2] - matches[:,0]
        unqiue_diag, counts = np.unique(diagonals, return_counts=True)
        if len(unqiue_diag) == 0:
            # If no match is found for this sequence, ignore this sequence
            continue
        correct_diagonals[i] = unqiue_diag[np.argmax(counts)]
    del matches


.. image-sg:: /examples/gallery/sequence/images/sphx_glr_genome_assembly_003.png
   :alt: genome assembly
   :srcset: /examples/gallery/sequence/images/sphx_glr_genome_assembly_003.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 275-323

As already outlined, we would like to limit the alignment search space
to the matching diagonal of the read and the reference genome
(including some buffer to account for indels) to reduce the
computation time.
Hence, we use :func:`align_banded()` to align the sequences.
This function aligns two sequences within a diagonal band, defined by
a lower diagonal :math:`D_L` and an upper diagonal :math:`D_U`
:footcite:`Chao1992`.
Two symbols at position *i* and *j* can only be
aligned to each other, if :math:`D_L \leq j - i \leq D_U`.
This also means, that the algorithm does not find an alignment,
where due to indels it would leave the defined
band.

We can safely center the band at the correct diagonal we obtained
in the previous step for each read, but how do we choose the maximum
deviation from the center of the band, i.e. the maximum number of
indels we allow in either direction?

Statistics may help us here.
As mentioned above, the utilized sequencing technique is relatively
error-prone.
Hence, let's assume that the number of true indels between the
original SARS-CoV-2 and the B.1.1.7 variant can be ignored compared to
the larger number of indels introduced by sequencing errors.
The indel error rates are approximately known for the
*MinION* :footcite:`Tyler2018`:
insertion rate :math:`p_i = 0.03`, deletion rate :math:`p_d = 0.05`.
Based on these probabilities we can define a band that will most
probably be broad enough to cover the number of appearing read
indels :footcite:`Gibrat2018`.
:math:`\sigma` gives the standard deviation from the correct diagonal
and can be calculated as

.. math::

   \sigma = \sqrt{4 N (p_i + p_d - p_i^2 - p_d^2)}

where :math:`N` is the read length.
We choose :math:`3 \sigma` as the deviation from the center of the
band, resulting in a :math:`< 0.3\%` chance that the optimal alignment
path would leave the band.

Although, the computation time is massively reduced by using
:func:`align_banded()`, the gapped alignment step is still the most
time-consuming one.
Therefore, we use multiprocessing to spread the task to multiple cores
on multi-core architectures.

.. GENERATED FROM PYTHON SOURCE LINES 323-346

.. code-block:: Python


    P_INDEL = 4 * (0.03 + 0.05 - 0.03**2 - 0.05**2)

    matrix = align.SubstitutionMatrix.std_nucleotide_matrix()

    def map_sequence(read, diag):
        deviation = int(3 * np.sqrt(len(read) * P_INDEL))
        if diag is None:
            return None
        else:
            return align.align_banded(
                read, orig_genome, matrix, gap_penalty=-10,
                band = (diag - deviation, diag + deviation),
                max_number = 1
            )[0]

    # Each process can be quite memory consuming
    # -> Cap to two processes to make it work on low-RAM commodity hardware
    with ProcessPoolExecutor(max_workers=2) as executor:
        alignments = list(executor.map(
            map_sequence, compl_reads, correct_diagonals, chunksize=1000
        ))


.. GENERATED FROM PYTHON SOURCE LINES 347-350

Now we have to select for each read, whether the original or
complementary strand is the one homologous to the reference genome.
We simply select the one with the higher score.

.. GENERATED FROM PYTHON SOURCE LINES 350-367

.. code-block:: Python


    for_alignments = [alignments[i] for i in range(0, len(alignments), 2)]
    rev_alignments = [alignments[i] for i in range(1, len(alignments), 2)]

    scores = np.stack((
        [ali.score if ali is not None else 0 for ali in for_alignments],
        [ali.score if ali is not None else 0 for ali in rev_alignments]
    ),axis=-1)

    correct_sense = np.argmax(scores, axis=-1)
    correct_alignments = [for_a if sense == 0 else rev_a for for_a, rev_a, sense
                          in zip(for_alignments, rev_alignments, correct_sense)]
    # If we use a reverse complementary read,
    # we also need to reverse the Phred score arrays
    correct_score_arrays = [score if sense == 0 else score[::-1] for score, sense
                            in zip(score_arrays, correct_sense)]


.. GENERATED FROM PYTHON SOURCE LINES 368-372

Now we know for each read where its corresponding position on the
reference genome is.
The mapping is complete.
Eventually, we visualize the mapping.

.. GENERATED FROM PYTHON SOURCE LINES 372-399

.. code-block:: Python


    # Find genome positions for the starts and ends of all reads
    starts = np.array(
        [ali.trace[ 0, 1] for ali in correct_alignments if ali is not None]
    )
    stops = np.array(
        [ali.trace[-1, 1] for ali in correct_alignments if ali is not None]
    )
    # For a nicer plot sort these by their start position
    order = np.argsort(starts)
    starts = starts[order]
    stops = stops[order]

    fig, ax = plt.subplots(figsize=(8.0, 12.0))
    ax.barh(
        np.arange(len(starts)), left=starts, width=stops-starts, height=1,
        color=biotite.colors["dimgreen"], linewidth=0
    )
    ax.set_ylim(0, len(starts)+1)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)
    ax.tick_params(left=False, labelleft=False)
    ax.set_xlabel("Sequence position")
    ax.set_title("Read mappings to reference genome")
    fig.tight_layout()


.. image-sg:: /examples/gallery/sequence/images/sphx_glr_genome_assembly_004.png
   :alt: Read mappings to reference genome
   :srcset: /examples/gallery/sequence/images/sphx_glr_genome_assembly_004.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 400-469

Variant calling
---------------

Variant calling is the process of identifying substitutions and indels
in the sequencing data compared to a reference genome.
Generally, this task is not necessarily straight-forward: For example,
the sequencing data might originate from to a diploid genome, so there
might be two variants for each position due to heterozygosity.
In our case we analyze a virus genome, so we expect only a single
variant, which makes the challenge much easier.

Sophisticated variant calling methods may take a lot of factors into
account, e.g. expected GC content, error rates, etc., to tackle the
problem of erroneous base calls from the sequencer.
In this script we take a rather simple approach.

Considering a single sequence location on the genome, we are
interested in finding the most probable base from the sequencing data,
or in other words the base that is least the result of a sequencing
error.
For a symbol (base) :math:`s \in \{ A, C, G, T\}` the probability
:math:`P` of having a genotype :math:`G \neq s` dependent on all base
calls :math:`c_i` is proportional to the product of the error
probabilities for each base call, because each base call is considered
an independent event.

.. math::

  P(G \neq s | c_1, ... c_n) \propto \prod_{i: c_i = s} p(G \neq s | c_i)

The proportionality instead of equality applies here, as this formula
ignores base calls where :math:`c_i \neq s`, because these cases do
not have an impact on which base is most probable.

As we consider the base that is least the result of a sequencing
error as most probable genotype, we need to find :math:`s_G`, where

.. math::

  s_G = {\arg\min}_s \left( \prod_{i: c_i = s} p(G \neq s | c_i) \right).

We can replace the base call error probability
:math:`p(G \neq s | c_i)`, as it is given by the Phred score.

.. math::

  s_G = {\arg\min}_s \left( \prod_{i: c_i = s} 10^{-\frac{Q_i}{10}} \right)

To simplify this equation we can take the logarithm of the product on
the right expression, as the logarithm is a monotonic function.

.. math::

  s_G = {\arg\min}_s \left( \sum_{i: c_i = s} log_{10}\left(10^{-\frac{Q_i}{10}}\right) \right)

  s_G = {\arg\min}_s \left( \sum_{i: c_i = s} -\frac{Q_i}{10} \right)

  s_G = {\arg\min}_s \left( \sum_{i: c_i = s} -Q_i \right)

  s_G = {\arg\max}_s \left( \sum_{i: c_i = s} Q_i \right)

This means we have to find the symbol with the maximum sum of
supporting Phred scores.
This approach is quite intuitive:
The more often a base has been called, weighted with the certainty of
the sequencer, the more likely this base is truly at this position.

.. note:: For the sake of brevity possible insertions into the
   reference genome are not considered in the method shown here.

.. GENERATED FROM PYTHON SOURCE LINES 469-538

.. code-block:: Python


    # There are four possible bases for each genome position
    phred_sum = np.zeros((len(orig_genome), 4), dtype=int)
    # Track the sequencing depth over the genome for visualization purposes
    sequencing_depth = np.zeros(len(orig_genome), dtype=int)
    # Also track how many reads have called a deletion
    # for each genome postion
    deletion_number = np.zeros(len(orig_genome), dtype=int)

    for alignment, score_array in zip(correct_alignments, correct_score_arrays):
        if alignment is not None:
            trace = alignment.trace

            no_gap_trace = trace[(trace[:,0] != -1) & (trace[:,1] != -1)]
            # Get the sequence code for the aligned read symbols
            seq_code = alignment.sequences[0].code[no_gap_trace[:,0]]
            # The sequence code contains the integers 0 - 3;
            # one for each possible base
            # Hence, we can use these integers directly to index the second
            # dimension of the Pred score sum
            # The index for the first dimension contains simply the genome
            # positions taken from the alignment trace
            phred_sum[no_gap_trace[:,1], seq_code] \
                += score_array[no_gap_trace[:,0]]

            sequencing_depth[
                trace[0,1] : trace[-1,1]
            ] += 1

            read_gap_trace = trace[trace[:,0] == -1]
            deletion_number[read_gap_trace[:,1]] += 1

    # Call the most probable base for each genome position according to the
    # formula above
    most_probable_symbol_codes = np.argmax(phred_sum, axis=1)


    # Visualize the sequencing depth and score sum over the genome
    max_phred_sum = phred_sum[
        np.arange(len(phred_sum)), most_probable_symbol_codes
    ]

    def moving_average(data_set, window_size):
        weights = np.full(window_size, 1/window_size)
        return np.convolve(data_set, weights, mode='valid')

    fig, ax = plt.subplots(figsize=(8.0, 4.0))
    ax.plot(
        moving_average(max_phred_sum, 100),
        color="lightgray", linewidth=1.0
    )
    ax2 = ax.twinx()
    ax2.plot(
        moving_average(sequencing_depth, 100),
        color=biotite.colors["dimorange"], linewidth=1.0
    )
    ax.axhline(0, color="silver", linewidth=0.5)
    ax.set_xlim(0, len(orig_genome))
    ax.set_xlabel("Genome postion")
    ax.set_ylabel("Phred score sum")
    ax2.set_ylabel("Sequencing depth")
    ax.legend(
        [Line2D([0], [0], color=c)
         for c in ("lightgray", biotite.colors["dimorange"])],
        ["Phred score sum", "Sequencing depth"],
        loc="upper left"
    )
    fig.tight_layout()


.. image-sg:: /examples/gallery/sequence/images/sphx_glr_genome_assembly_005.png
   :alt: genome assembly
   :srcset: /examples/gallery/sequence/images/sphx_glr_genome_assembly_005.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 539-547

We are finally reaching the last step of the assembly.
Until now we only covered substitutions, but we also need to cover
deletions.
The statistics are more complex here, as a missing base in a read has
of course no assigned Phred score.
For the purpose of this example script we simply define as threshold:
At least 60 % of all reads covering a certain location must call a
deletion for this location, otherwise the deletion is rejected

.. GENERATED FROM PYTHON SOURCE LINES 547-564

.. code-block:: Python


    DELETION_THRESHOLD = 0.6

    var_genome = seq.NucleotideSequence()
    var_genome.code = most_probable_symbol_codes
    # A deletion is called, if either enough reads include this deletion
    # or the sequence position is not covered by any read at all
    deletion_mask = (deletion_number > sequencing_depth * DELETION_THRESHOLD) \
                    | (sequencing_depth == 0)
    var_genome = var_genome[~deletion_mask]
    # Write the assembled genome into a FASTA file
    out_file = fasta.FastaFile()
    fasta.set_sequence(
        out_file, var_genome, header="SARS-CoV-2 B.1.1.7", as_rna=True
    )
    out_file.write(tempfile.NamedTemporaryFile("w"))


.. GENERATED FROM PYTHON SOURCE LINES 565-577

We have done it, the genome of the B.1.1.7 variant is assembled!
Now we would like to have a closer look on the difference between the
original and the B.1.1.7 genome.

Mutations in the B.1.1.7 variant
--------------------------------

To get an rough overview about the overall sequence identity between
the genomes and the locations of mutations in the B.1.1.7 variant,
we need to align the original genome to our assembled one.
As both genomes are expected to be highly similar, we can use a banded
alignment again using a very conservative band width.

.. GENERATED FROM PYTHON SOURCE LINES 577-587

.. code-block:: Python


    BAND_WIDTH = 1000

    genome_alignment = align.align_banded(
        var_genome, orig_genome, matrix,
        band=(-BAND_WIDTH//2, BAND_WIDTH//2), max_number=1
    )[0]
    identity = align.get_sequence_identity(genome_alignment, 'all')
    print(f"Sequence identity: {identity * 100:.2f} %")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Sequence identity: 99.76 %


.. GENERATED FROM PYTHON SOURCE LINES 588-593

Now we would like to have a closer look at the mutation locations.
To contextualize the locations we plot the mutation frequency along
with the gene locations.
The genomic coordinates for each gene can be extracted from the
already downloaded *GenBank* file of the reference genome.

.. GENERATED FROM PYTHON SOURCE LINES 593-645

.. code-block:: Python


    N_BINS = 50

    # Get genomic coordinates for all SARS-Cov-2 genes
    gb_file = gb.GenBankFile.read(orig_genome_file)
    annot_seq = gb.get_annotated_sequence(gb_file, include_only=["gene"])

    # Calculate the sequence identity within each bin
    bin_identities = np.zeros(N_BINS)
    edges = np.linspace(0, len(orig_genome), N_BINS+1)
    for i, (bin_start, bin_stop) in enumerate(zip(edges[:-1], edges[1:])):
        orig_genome_trace = genome_alignment.trace[:,1]
        excerpt = genome_alignment[
            (orig_genome_trace >= bin_start) & (orig_genome_trace < bin_stop)
        ]
        bin_identities[i] = align.get_sequence_identity(excerpt, "all")


    fig, (deviation_ax, feature_ax) = plt.subplots(nrows=2, figsize=(8.0, 5.0))

    # Plot the deviation = 1 - sequence identity
    deviation_ax.bar(
        edges[:-1], width=(edges[1:]-edges[:-1]),
        height=(1 - bin_identities),
        color=biotite.colors["dimorange"], align="edge"
    )
    deviation_ax.set_xlim(0, len(orig_genome))
    deviation_ax.set_ylabel("1 - Sequence identity")
    deviation_ax.set_title("Sequence deviation of SARS-CoV-2 B.1.1.7 variant")
    deviation_ax.set_yscale("log")
    deviation_ax.set_ylim(1e-3, 1e-1)

    # Plot genmic coordinates of the genes
    for i, feature in enumerate(sorted(
        annot_seq.annotation,
        key=lambda feature: min([loc.first for loc in feature.locs])
    )):
        for loc in feature.locs:
            feature_ax.barh(
                left=loc.first, width=loc.last-loc.first, y=i, height=1,
                color=biotite.colors["dimgreen"]
            )
            feature_ax.text(
                loc.last + 100, i, feature.qual["gene"],
                fontsize=8, ha="left", va="center"
            )
    feature_ax.set_ylim(i+0.5, -0.5)
    feature_ax.set_xlim(0, len(orig_genome))
    feature_ax.xaxis.set_visible(False)
    feature_ax.yaxis.set_visible(False)
    feature_ax.set_frame_on(False)


.. image-sg:: /examples/gallery/sequence/images/sphx_glr_genome_assembly_006.png
   :alt: Sequence deviation of SARS-CoV-2 B.1.1.7 variant
   :srcset: /examples/gallery/sequence/images/sphx_glr_genome_assembly_006.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 647-667

The *S* gene codes for the infamous *spike protein*: a membrane
protein on the surface of SARS-CoV-2 that drives the infiltration of
the host cell.
Let's have closer look on it.

Differences in the spike protein
--------------------------------

For the investigation of the spike protein differences between the
original and the variant SARS-CoV-2, we need to acquire the
corresponding protein sequences.
The location of the spike protein is annotated in the *GenBank* file
for the reference genome.
The homologous sequence for B.1.1.7 can be obtained by global
sequence alignment of the spike gene sequence with the variant genome.
Eventually, we can translate the gene sequences into protein sequences
and compare them with each other - again by aligning them.
To add meaning to the location of mutations we look at them in the
context of the spike protein features/domains, which are well
known :footcite:`Xia2021`.

.. GENERATED FROM PYTHON SOURCE LINES 667-762

.. code-block:: Python


    SYMBOLS_PER_LINE = 75
    SPACING = 3

    # The locations of some notable spike protein regions
    FEATURES = {
        # Signal peptide
        "SP":  (   1,   12),
        # N-terminal domain
        "NTD": (  14,  303),
        # Receptor binding domain
        "RBD": ( 319,  541),
        # Fusion peptide
        "FP":  ( 788,  806),
        # Transmembrane domain
        "TM":  (1214, 1234),
        # Cytoplasmatic tail
        "CT":  (1269, 1273),
    }

    # Get RNA sequence coding for spike protein from the reference genome
    for feature in annot_seq.annotation:
        if feature.qual["gene"] == "S":
            orig_spike_seq = annot_seq[feature]

    # Align spike protein sequence to variant genome to get the B.1.1.7
    # spike protein sequence
    alignment = align.align_optimal(
        var_genome, orig_spike_seq, matrix, local=True, max_number=1
    )[0]
    var_spike_seq = var_genome[alignment.trace[alignment.trace[:,0] != -1, 0]]

    # Obtain protein sequences from RNA sequences
    orig_spike_prot_seq = orig_spike_seq.translate(complete=True).remove_stops()
    var_spike_prot_seq  =  var_spike_seq.translate(complete=True).remove_stops()

    # Align both protein sequences with each other for later comparison
    blosum_matrix = align.SubstitutionMatrix.std_protein_matrix()
    alignment = align.align_optimal(
        var_spike_prot_seq, orig_spike_prot_seq, blosum_matrix, max_number=1
    )[0]


    fig = plt.figure(figsize=(8.0, 10.0))
    ax = fig.add_subplot(111)

    # Plot alignment
    cmap = LinearSegmentedColormap.from_list(
        "custom", colors=[(1.0, 0.3, 0.3), (1.0, 1.0, 1.0)]
        #                    ^ reddish        ^ white
    )
    graphics.plot_alignment_similarity_based(
        ax, alignment, matrix=blosum_matrix, symbols_per_line=SYMBOLS_PER_LINE,
        labels=["B.1.1.7", "Reference"], show_numbers=True, label_size=9,
        number_size=9, symbol_size=7, spacing=SPACING, cmap=cmap
    )

    ## Add indicator for features to the alignment
    for row in range(1 + len(alignment) // SYMBOLS_PER_LINE):
        col_start = SYMBOLS_PER_LINE * row
        col_stop  = SYMBOLS_PER_LINE * (row + 1)
        if col_stop > len(alignment):
            # This happens in the last line
            col_stop = len(alignment)
        seq_start = alignment.trace[col_start, 1]
        seq_stop  = alignment.trace[col_stop-1,  1] + 1
        n_sequences = len(alignment.sequences)
        y_base = (n_sequences + SPACING) * row + n_sequences

        for feature_name, (first, last) in FEATURES.items():
            # Zero based sequence indexing
            start = first-1
            # Exclusive stop
            stop = last
            if start < seq_stop and stop > seq_start:
                # The feature is found in this line
                x_begin = np.clip(start - seq_start, 0, SYMBOLS_PER_LINE)
                x_end   = np.clip(stop - seq_start,  0, SYMBOLS_PER_LINE)
                x_mean = (x_begin + x_end) / 2
                y_line = y_base + 0.3
                y_text = y_base + 0.6
                ax.plot(
                    [x_begin, x_end], [y_line, y_line],
                    color="black", linewidth=2
                )
                ax.text(
                    x_mean, y_text, feature_name,
                    fontsize=8, va="top", ha="center"
                )
    # Increase y-limit to include the feature indicators in the last line
    ax.set_ylim(y_text, 0)
    fig.tight_layout()

    plt.show()


.. image-sg:: /examples/gallery/sequence/images/sphx_glr_genome_assembly_007.png
   :alt: genome assembly
   :srcset: /examples/gallery/sequence/images/sphx_glr_genome_assembly_007.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 763-778

The most relevant mutations displayed here are the Δ69/70 deletion and
the N501Y and D614G substitutions. Δ69/70 might allosterically change
the protein conformation :footcite:`Xie2021`.
D614G :footcite:`Daniloski2021` and N501Y :footcite:`Tian2021`
increase the efficiency of host cell infection.
For N501Y the reason is apparent:
Being located in the RBD, this residue interacts directly with the
human host cell receptor *angiotensin-converting enzyme 2* (ACE2).
Therefore, by increasing the binding affinity for ACE2, the infection
is also facilitated.

References
----------

.. footbibliography::


.. _sphx_glr_download_examples_gallery_sequence_genome_assembly.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: genome_assembly.ipynb <genome_assembly.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: genome_assembly.py <genome_assembly.py>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_