.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "examples/gallery/sequence/lexa_conservation.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_examples_gallery_sequence_lexa_conservation.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_examples_gallery_sequence_lexa_conservation.py:


Conservation of *LexA* DNA-binding site
=======================================

The web page on sequence logos on
`Wikipedia <https://en.wikipedia.org/wiki/Sequence_logo#Consensus_logo>`_
shows the sequence logo of the *LexA*-binding motif of Gram-positive
bacteria.
In this example we look at the other side:
What is the amino acid sequence logo of the DNA-binding site of the LexA
repressor?
What is the consensus sequence?

We start by searching the NCBI Entrez database for *lexA* gene
entries in the UniProtKB database and downloading them afterwards as
GenPept file.
In order to ensure that the file contains the desired entries, we check
the entires for their definition (title) and source (species).

.. GENERATED FROM PYTHON SOURCE LINES 20-55

.. code-block:: Python


    # Code source: Patrick Kunzmann
    # License: BSD 3 clause

    import numpy as np
    import matplotlib.pyplot as plt
    import biotite.sequence as seq
    import biotite.sequence.io.fasta as fasta
    import biotite.sequence.io.genbank as gb
    import biotite.sequence.graphics as graphics
    import biotite.application.clustalo as clustalo
    import biotite.database.entrez as entrez
    # Search for protein products of LexA gene in UniProtKB/Swiss-Prot database
    query =   entrez.SimpleQuery("lexA", "Gene Name") \
            & entrez.SimpleQuery("srcdb_swiss-prot", "Properties")
    # Search for the first 200 hits
    # More than 200 UIDs are not recommended for the EFetch service
    # for a single fetch
    uids = entrez.search(query, db_name="protein", number=200)
    file = entrez.fetch_single_file(
        uids, None, db_name="protein", ret_type="gp"
    )
    # The file contains multiple concatenated GenPept files
    # -> Usage of MultiFile
    multi_file = gb.MultiFile.read(file)
    # Separate MultiFile into single GenBankFile instances
    files = [f for f in multi_file]
    print("Definitions:")
    for file in files[:20]:
        print(gb.get_definition(file))
    print()
    print("Sources:")
    for file in files[:20]:
        print(gb.get_source(file))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Definitions:
    RecName: Full=LexA repressor.
    RecName: Full=LexA repressor.
    RecName: Full=LexA repressor.
    RecName: Full=LexA repressor.
    RecName: Full=LexA repressor.
    RecName: Full=LexA repressor.
    RecName: Full=LexA repressor.
    RecName: Full=LexA repressor.
    RecName: Full=LexA repressor.
    RecName: Full=LexA repressor.
    RecName: Full=LexA repressor.
    RecName: Full=LexA repressor.
    RecName: Full=LexA repressor.
    RecName: Full=LexA repressor.
    RecName: Full=LexA repressor.
    RecName: Full=LexA repressor.
    RecName: Full=LexA repressor.
    RecName: Full=LexA repressor.
    RecName: Full=LexA repressor.
    RecName: Full=LexA repressor.

    Sources:
    Clostridioides difficile 630 (Clostridium difficile 630)
    Desulfatibacillum aliphaticivorans
    Clostridium botulinum Ba4 str. 657
    Azotobacter vinelandii DJ
    Desulfitobacterium hafniense DCB-2
    Clostridium botulinum A2 str. Kyoto
    Brucella melitensis ATCC 23457
    Brevibacillus brevis NBRC 100599
    Bifidobacterium longum subsp. infantis ATCC 15697 = JCM 1222 = DSM
    Bacillus cereus Q1
    Anaeromyxobacter dehalogenans 2CP-1
    Burkholderia multivorans ATCC 17616
    Cupriavidus taiwanensis LMG 19424
    Delftia acidovorans SPH-1
    Clostridium botulinum B1 str. Okra
    Bifidobacterium longum DJO10A
    Bordetella petrii DSM 12804
    Burkholderia cenocepacia J2315
    Clostridium botulinum A3 str. Loch Maree
    Brucella abortus S19


.. GENERATED FROM PYTHON SOURCE LINES 56-59

The names of the sources are too long to be properly displayed later
on. Therefore, we write a function that creates a proper abbreviation
for a species name.

.. GENERATED FROM PYTHON SOURCE LINES 59-71

.. code-block:: Python


    def abbreviate(species):
        # Remove possible brackets
        species = species.replace("[","").replace("]","")
        splitted_species= species.split()
        return "{:}. {:}".format(splitted_species[0][0], splitted_species[1])

    print("Sources:")
    all_sources = [abbreviate(gb.get_source(file)) for file in files]
    for source in all_sources[:20]:
        print(source)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Sources:
    C. difficile
    D. aliphaticivorans
    C. botulinum
    A. vinelandii
    D. hafniense
    C. botulinum
    B. melitensis
    B. brevis
    B. longum
    B. cereus
    A. dehalogenans
    B. multivorans
    C. taiwanensis
    D. acidovorans
    C. botulinum
    B. longum
    B. petrii
    B. cenocepacia
    C. botulinum
    B. abortus


.. GENERATED FROM PYTHON SOURCE LINES 72-88

Much better.
For the alignment (required for sequence logo) we need to extract the
slice of the sequence, that belongs to the DNA-binding site.
Hence, we simply index the each sequence with the feature for the
binding site and remove those sequences, that do not have a record
specifying the required feature.

But we have still an issue:
Some species seem to be overrepresented, as they show up multiple
times.
The reason for this is that some species, like *M. tuberculosis*, are
represented by multiple strains with (almost) equal *LexA* sequences.
To reduce this bias, we only want each species to occur only a single
time.
So we use a set to store the source name of sequences we already
listed and ignore all further occurences of that source species. 

.. GENERATED FROM PYTHON SOURCE LINES 88-121

.. code-block:: Python


    # List of sequences
    binding_sites = []
    # List of source species
    sources = []
    # Set for ignoring already listed sources
    listed_sources = set()
    for file, source in zip(files, all_sources):
        if source in listed_sources:
            # Ignore already listed species
            continue
        bind_feature = None
        annot_seq = gb.get_annotated_sequence(
            file, include_only=["Site"], format="gp"
        )
        # Find the feature for DNA-binding site
        for feature in annot_seq.annotation:
            # DNA binding site is a helix-turn-helix motif
            if "site_type" in feature.qual \
                and feature.qual["site_type"] == "DNA binding" \
                and "H-T-H motif" in feature.qual["note"]:
                    bind_feature = feature
        if bind_feature is not None:
            # If the feature is found,
            # get the sequence slice that is defined by the feature...
            binding_sites.append(annot_seq[bind_feature])
            # ...and save the respective source species
            sources.append(source)
            listed_sources.add(source)
    print("Binding sites:")
    for site in binding_sites[:20]:
        print(site)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Binding sites:
    VREICTAVGLRSTSTVHSHLN
    VRELCDELGFKSPNTAHFHLK
    VREICKAVGLSSTSSVHFHLK
    RAEIARELGFKSPNAAEEHLK
    VREIGDAVGLMSSSTVHGHLQ
    FDEMKEALDLASKSGIHRLIT
    VREIGEAVGLASSSTVHGHLA
    FREIGNAAGLKSPSSVKHQLQ
    VREIGQAVGLASSSTVHGHLS
    IREIGEALDIRSTNGVNDHLK
    RAEIAAELGFSSPNAAEEHLR
    RAEIAAEFGFSSPNAAEEHLR
    RAEIANTLGFKSANAAEEHLQ
    RAEIARALGFRSPNAAEDHLK
    RAEIAAELGFSSPNAAEEHLR
    FDEMKEALDLASKSGIHRLIT
    RAEIAAELGFSSPNAAEEHLR
    RAEIAAELGFSSPNAAEEHLR
    VREIGQAVGLASSSTVHGHLS
    RAEIARILGFKSANAAEEHIK


.. GENERATED FROM PYTHON SOURCE LINES 122-126

Now we can perform a multiple sequence alignment of the binding site
sequences. Here we use Clustal Omega to perform this task.
Since we have up to 200 sequences we visualize only a small portion of
the alignment.

.. GENERATED FROM PYTHON SOURCE LINES 126-137

.. code-block:: Python


    alignment = clustalo.ClustalOmegaApp.align(binding_sites)
    fig = plt.figure(figsize=(4.5, 4.0))
    ax = fig.add_subplot(111)
    graphics.plot_alignment_similarity_based(
        ax, alignment[:,:20], labels=sources[:20], symbols_per_line=len(alignment)
    )
    # Source names in italic
    ax.set_yticklabels(ax.get_yticklabels(), fontdict={"fontstyle":"italic"})
    fig.tight_layout()


.. image-sg:: /examples/gallery/sequence/images/sphx_glr_lexa_conservation_001.png
   :alt: lexa conservation
   :srcset: /examples/gallery/sequence/images/sphx_glr_lexa_conservation_001.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 138-139

Finally we can generate our sequence logo and the consensus sequence.

.. GENERATED FROM PYTHON SOURCE LINES 139-156

.. code-block:: Python


    profile = seq.SequenceProfile.from_alignment(alignment)

    print("Consensus sequence:")
    print(profile.to_consensus())

    fig = plt.figure(figsize=(8.0, 3.0))
    ax = fig.add_subplot(111)
    graphics.plot_sequence_logo(ax, profile, scheme="flower")
    ax.set_xticks([5,10,15,20])
    ax.set_xlabel("Residue position")
    ax.set_ylabel("Bits")
    # Only show left and bottom spine
    ax.spines["right"].set_visible(False)
    ax.spines["top"].set_visible(False)
    fig.tight_layout()

    plt.show()


.. image-sg:: /examples/gallery/sequence/images/sphx_glr_lexa_conservation_002.png
   :alt: lexa conservation
   :srcset: /examples/gallery/sequence/images/sphx_glr_lexa_conservation_002.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Consensus sequence:
    RAEIADALGLRSPSAAHEHLK


.. _sphx_glr_download_examples_gallery_sequence_lexa_conservation.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: lexa_conservation.ipynb <lexa_conservation.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: lexa_conservation.py <lexa_conservation.py>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_