.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "examples/gallery/sequence/genome_search.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_examples_gallery_sequence_genome_search.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_examples_gallery_sequence_genome_search.py:


Finding homologs of a gene in a genome
======================================

.. currentmodule:: biotite.sequence.align

In this example we use a combination of a :class:`KmerTable`
and :func:`align_banded()` to search a genome for homologs to a gene
from another organism.
Specifically we take the gene coding for the *M1* RNA from
*Escherichia coli* BL21, an RNA subunit of the *ribonuclease P*,
and try to find the counterpart in the genome of *Salmonella enterica*.

This approach has a high performance compared to :func:`align_optimal()`
and is similar to the method used by software like *BLAST*.

At first the sequences for the *M1* coding gene and the *S. enterica*
genome are downloaded from *NCBI Entrez*. 

.. GENERATED FROM PYTHON SOURCE LINES 20-56

.. code-block:: Python


    # Code source: Patrick Kunzmann
    # License: BSD 3 clause

    import numpy as np
    import matplotlib.pyplot as plt
    from matplotlib.collections import LineCollection
    import biotite
    import biotite.sequence as seq
    import biotite.sequence.io.fasta as fasta
    import biotite.sequence.io.genbank as gb
    import biotite.sequence.graphics as seqgraphics
    import biotite.sequence.align as align
    import biotite.database.entrez as entrez
    import biotite.structure.graphics as strucgraphics
    import biotite.application.viennarna as viennarna


    # Download Escherichia coli BL21 and Salmonella enterica genome
    gb_file = gb.MultiFile.read(entrez.fetch_single_file(
        ["CP001509", "CP019649"], None, "nuccore", "gb"
    ))
    ec_file, se_file = tuple(gb_file)

    annot_seq = gb.get_annotated_sequence(ec_file, include_only=["ncRNA"])
    # Find M1 gene in E. coli genome via its annotation
    for feature in annot_seq.annotation:
        if "product" in feature.qual and "RNase P" in feature.qual["product"]:
            m1_sequence = annot_seq[feature]

    # Get S. enterica genome sequence
    se_genome = gb.get_sequence(se_file)

    # We want to search in the genome sequence and its reverse complement
    genomic_seqs = [se_genome, se_genome.reverse().complement()]


.. GENERATED FROM PYTHON SOURCE LINES 57-73

In an initial fast matching step, we look for matching *k-mers*
between *M1* and the *S. enterica* genome.
A matching *k-mer* is a length *k* subsequence, that appears in both
sequences.
In *Biotite* this task is carried out by a :class:`KmerTable`.

Later we will perform costly gapped alignments at the match positions.
To reduce the number of triggered gapped alignments at unspecific
matches, we add some additional filters:
One extra condition is that two non-overlapping matches must appear
on the same diagonal :math:`D = j - i`, where *i* and *j* are the
sequence positions in the match.
Furthermore, only one match is considered in a defined range of
diagonals, as small deviations from the diagonal appear due to indels.
Without this filter, multiple gapped alignments would be triggered in
basically the same region, leading to redundant alignments.

.. GENERATED FROM PYTHON SOURCE LINES 73-128

.. code-block:: Python


    K = 12
    MAX_MATCH_DISTANCE = 20
    DISCARD_RANGE = 50

    table = align.KmerTable.from_sequences(K, genomic_seqs)
    matches = table.match(m1_sequence)

    fig = plt.figure(figsize=(8.0, 8.0))

    trigger_matches = []
    # 0 represents the original genome sequence, 1 the reverse complement
    for strand in (0, 1):
        matches_for_strand = matches[matches[:,1] == strand]

        # Plot match positions
        ax = fig.add_subplot(1, 2, strand+1)
        ax.scatter(
            matches_for_strand[:,0], matches_for_strand[:,2] / 1e6,
            s=4, marker="o", color=biotite.colors["dimorange"]
        )
        ax.set_xlim(0, len(m1_sequence))
        ax.set_ylim(0, len(se_genome) / 1e6)
        ax.set_xlabel("E. coli M1 position (b)")
        if strand == 0:
            ax.set_ylabel("S. enterica genome position (Mb)")
        else: # strand == 1
            ax.set_ylabel("S. enterica genome position (Mb) (reverse complement)")
    
        # Check if there are two adjacent matches on the same diagonal
        diagonals = matches_for_strand[:,2] - matches_for_strand[:,0]
        unique_diag = np.unique(diagonals)
        trigger_diagonals = np.array([], dtype=int)
        for diag in unique_diag:
            if np.any(np.abs(trigger_diagonals - diag) < DISCARD_RANGE):
                # Discard the matches for this diagonal if this diagonal
                # is too close to a diagonal that already triggers
                # a gapped alignment
                continue
            matches_for_diagonal = matches_for_strand[diagonals == diag]
            m1_positions = matches_for_diagonal[:, 0]
            for pos in m1_positions:
                distances = m1_positions - pos
                # The other match on the same diagonal should not overlap
                # with this match and should be within a cutoff range
                if np.any((distances > K) & (distances < DISCARD_RANGE)):
                    trigger_matches.append((strand, pos, pos+diag))
                    trigger_diagonals = np.append(trigger_diagonals, diag)
                    # Only add one match per diagonal at maximum
                    break

    print("Matches that trigger gapped alignment:")
    print(trigger_matches)
    fig.tight_layout()


.. image-sg:: /examples/gallery/sequence/images/sphx_glr_genome_search_001.png
   :alt: genome search
   :srcset: /examples/gallery/sequence/images/sphx_glr_genome_search_001.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Matches that trigger gapped alignment:
    [(1, 42, 1563462)]


.. GENERATED FROM PYTHON SOURCE LINES 129-136

From hundreds of initial matches, only a single one remains in this
case.
The diagonal of this match can be seen in the figure:
It is the almost continuous line on the right side.

For the gapped alignment we use :func:`align_banded()`, which reduces
the alignment search space to a narrow diagonal band.

.. GENERATED FROM PYTHON SOURCE LINES 136-154

.. code-block:: Python


    BAND_WIDTH = 1000

    alignments = []
    matrix = align.SubstitutionMatrix.std_nucleotide_matrix()
    for strand, m1_pos, genome_pos in trigger_matches:
        genome = genomic_seqs[strand]
        diagonal = genome_pos - m1_pos
        alignment = align.align_banded(
            m1_sequence, genome, matrix,
            band=(diagonal - BAND_WIDTH, diagonal + BAND_WIDTH), max_number=1
        )[0]
        alignments.append((strand, alignment))
    
    strand, best_alignment = max(
        alignments, key=lambda strand_alignment: alignment[1].score
    )


.. GENERATED FROM PYTHON SOURCE LINES 155-160

For visualization purposes we have to apply a renumbering function
for the genomic sequence,
since the indices in the alignment trace refer to the reverse
complement sequence, but we want the numbers to refer to the original
genomic sequence.

.. GENERATED FROM PYTHON SOURCE LINES 160-174

.. code-block:: Python


    # Reverse sequence numbering for second sequence (genome) in alignment
    number_funcs = [None,   lambda x: len(best_alignment.sequences[1]) - x]
    # Visualize alignment, use custom color
    fig = plt.figure(figsize=(8.0, 4.0))
    ax = fig.add_subplot(111)
    seqgraphics.plot_alignment_similarity_based(
        ax, best_alignment, matrix=matrix,
        labels=["E. coli M1 coding gene", "S. enterica genome"], show_numbers=True,
        number_functions=number_funcs, show_line_position=True,
        color=biotite.colors["brightorange"]
    )
    fig.tight_layout()


.. image-sg:: /examples/gallery/sequence/images/sphx_glr_genome_search_002.png
   :alt: genome search
   :srcset: /examples/gallery/sequence/images/sphx_glr_genome_search_002.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 176-182

The results show, that *E. coli* and *S. enterica* *M1* are almost
identical.

Finally, we predict and plot the secondary structure of the *M1* RNA
with help from *ViennaRNA* and highlight mismatch position between
*E. coli* and *S. enterica* *M1*.

.. GENERATED FROM PYTHON SOURCE LINES 182-220

.. code-block:: Python


    app = viennarna.RNAfoldApp(m1_sequence)
    app.start()
    app.join()
    base_pairs = app.get_base_pairs()

    app = viennarna.RNAplotApp(base_pairs=base_pairs, length=len(m1_sequence))
    app.start()
    app.join()
    plot_coord = app.get_coordinates()

    codes = align.get_codes(best_alignment)
    m1_no_gap_codes = codes[codes[:, 0] != -1]
    identities = m1_no_gap_codes[0] == m1_no_gap_codes[1]

    fig = plt.figure(figsize=(8.0, 8.0))
    ax = fig.add_subplot(111)
    # Plot base connections
    ax.plot(*plot_coord.T, color="black", linewidth=1, zorder=1)
    # Plot base pairings
    ax.add_collection(LineCollection(
        [(plot_coord[i], plot_coord[j]) for i, j in base_pairs],
        color="silver", linewidth=1, zorder=1
    ))
    # Plot base markers
    ax.scatter(
        *plot_coord.T,
        s = 12,
        # Render markers over lines
        zorder = 2,
        # Display base marker color based on the identity in the alignment 
        color = ["forestgreen" if identity else "firebrick"
                 for identity in identities]
    )
    ax.set_aspect("equal")
    ax.axis("off")
    fig.tight_layout()

    plt.show()


.. image-sg:: /examples/gallery/sequence/images/sphx_glr_genome_search_003.png
   :alt: genome search
   :srcset: /examples/gallery/sequence/images/sphx_glr_genome_search_003.png
   :class: sphx-glr-single-img


.. _sphx_glr_download_examples_gallery_sequence_genome_search.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: genome_search.ipynb <genome_search.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: genome_search.py <genome_search.py>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_