Finding homologous regions in two genomes#

Presumably, plant cells obtained their ability for photosynthesis through endosymbiosis: In the past, eukaryotic cells probably have incorporated a cyanobacterium that evolved to the current chloroplasts in plant cells. As a side effect, chloroplasts contain their own genome, that has a high similarity to cyanobacteria. This example highlights regions in the chloroplast genome that have been conserved, by comparing the chloroplast genome of Arabidopsis thaliana to the genome of the cyanobacterium Synechocystis sp. PCC 6803.

To compare the genomes this script creates local alignments between both genome sequences using a k-mer based multi-step process, known from software like BLAST [1] or MMseqs [2]: Fast k-mer matching, ungapped alignment at the hit positions and a final time-consuming local gapped alignment. Between each step only promising hits are filtered and used in the next step.

At first, the genomic sequences are fetched and loaded.

# Code source: Patrick Kunzmann
# License: BSD 3 clause

import tempfile
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.patches import Rectangle
from matplotlib.ticker import MultipleLocator
import biotite
import biotite.application.tantan as tantan
import biotite.database.entrez as entrez
import biotite.sequence as seq
import biotite.sequence.align as align
import biotite.sequence.io as seqio
import biotite.sequence.io.genbank as gb

fasta_file = entrez.fetch(
    "NC_000932", tempfile.gettempdir(), "fasta", db_name="Nucleotide", ret_type="fasta"
)
chloroplast_seq = seqio.load_sequence(fasta_file)

fasta_file = entrez.fetch(
    "NC_000911", tempfile.gettempdir(), "fasta", db_name="Nucleotide", ret_type="fasta"
)
bacterium_seq = seqio.load_sequence(fasta_file)

For the k-mer matching step the genome of the cyanobacterium is indexed into a KmerTable. As homologous regions between both genomes may also appear on the complementary strand, both, the original genome sequence and its reverse complement, are indexed. Two additional techniques are used here: First, low complexity regions in the sequence are filtered out using Tantan [3] to reduce the number of spurious homologies. Second, spaced k-mers [4] are used instead of continuous ones to increase the sensitivity. However, not every spacing model performs equally well, so the proven one 111*1*11*1*11*111 [5] is used here.

repeat_mask = tantan.TantanApp.mask_repeats(bacterium_seq)
bacterium_seqs = [bacterium_seq, bacterium_seq.reverse(copy=False).complement()]

table = align.KmerTable.from_sequences(
    k=12,
    sequences=bacterium_seqs,
    spacing="111*1*11*1*11*111",
    ignore_masks=[repeat_mask, repeat_mask[::-1].copy()],
)

Now the sequence of the chloroplast genome is matched to the indexed sequences. Again, low complexity regions are removed.

matches = table.match(
    chloroplast_seq, ignore_mask=tantan.TantanApp.mask_repeats(chloroplast_seq)
)
print("Exemplary matches")
print(matches[:10])
print()
print("Number of hits:", len(matches))

Exemplary matches
[[      0       0 1850423]
 [      1       0 3240307]
 [      1       0 3554675]
 [      1       1 1850991]
 [      2       1 1558196]
 [      3       0  355498]
 [      3       0  479557]
 [      3       0  604150]
 [      3       0 1336224]
 [      3       0 2143919]]

Number of hits: 106643

Each row in the returned array represents one match: The first and the third column are the matched positions in the chloroplast and bacterial sequence, respectively. The second column indicates whether the match in the bacterial genome, refers to the original (0) or reverse complement (1) strand.

To reduce the number of alignments, that need to be created from these matches, only double hits progress into the next step. A match becomes a double hit, if there is at least one more match on the same diagonal. In a more sophisticated scenario, you could also require that the second match has a limited distance to the first one.

diagonals = matches[:, 2] - matches[:, 0]

# Store the indices to the match array
# for each combination of diagonal and strand on the bacterial genome
matches_for_diagonals = {}
for i, (diag, strand) in enumerate(zip(diagonals, matches[:, 1])):
    if (diag, strand) not in matches_for_diagonals:
        matches_for_diagonals[(diag, strand)] = [i]
    else:
        matches_for_diagonals[(diag, strand)].append(i)

# If a diagonal has more than one match,
# the first match on this diagonal is a double hit
double_hit_indices = [
    indices[0] for indices in matches_for_diagonals.values() if len(indices) > 1
]
double_hits = matches[double_hit_indices]
print("Number of double hits:", len(double_hits))

Number of double hits: 1231

The next step is a local ungapped alignment at the positions of the double hits using align_local_ungapped(): For each hit, the alignment is extended into both directions from the match until the similarity score drops more than a given threshold below the maximum score found. Only those hits, where the alignment exceeds a defined threshold score, are considered in the next step. Therefore, only the similarity score of the alignment is of interest, it is not necessary to return an actual alignment trace. As a result, the score_only=True parameter increases the performance drastically.

X_DROP = 20
ACCEPT_THRESHOLD = 100

matrix = align.SubstitutionMatrix.std_nucleotide_matrix()
ungapped_scores = np.array(
    [
        align.align_local_ungapped(
            chloroplast_seq,
            bacterium_seqs[strand],
            matrix,
            seed=(i, j),
            threshold=X_DROP,
            score_only=True,
        )
        for i, strand, j in double_hits
    ]
)

accepted_hits = double_hits[ungapped_scores > ACCEPT_THRESHOLD]
print("Number of accepted ungapped alignments:", len(accepted_hits))

Number of accepted ungapped alignments: 212

Finally the filtered match positions are used as seed for a gapped alignment using align_local_gapped(). For each match, this is by far the most time consuming step. However, since only a few matches remain, the total runtime is reasonable.

Again, only the score should be returned from the gapped alignments to improve the performance. Only those alignments that show ‘significant homology’ are aligned again for later evaluation of the actual alignment.

The significance of an homology is assessed by means of an E-value. The EValueEstimator computes the E-value from the similarity score. This calculation requires scoring scheme specific parameters that are estimated from a time-consuming sampling process.

X_DROP = 100
GAP_PENALTY = (-12, -4)
EVALUE_THRESHOLD = 0.1

# Initialize the estimator
# Use the symbol frequencies in the bacterial genome to sample
# the parameters
background = np.array(list(bacterium_seq.get_symbol_frequency().values()))
np.random.seed(0)
estimator = align.EValueEstimator.from_samples(
    chloroplast_seq.alphabet,
    # The scoring scheme must be the same as used for the alignment
    matrix,
    GAP_PENALTY,
    background,
)

# Compute similarity scores for each hit
gapped_scores = np.array(
    [
        align.align_local_gapped(
            chloroplast_seq,
            bacterium_seqs[strand],
            matrix,
            seed=(i, j),
            gap_penalty=GAP_PENALTY,
            threshold=X_DROP,
            score_only=True,
            max_table_size=100_000_000,
        )
        for i, strand, j in accepted_hits
    ]
)

# Calculate the E-values
# For numeric stability reasons the method returns the common logarithm
log_evalues = estimator.log_evalue(
    gapped_scores, len(chloroplast_seq), 2 * len(bacterium_seq)
)

# Align regions with significant homology again
# This time report the actual alignment
accepted_alignments = [
    (
        align.align_local_gapped(
            chloroplast_seq,
            bacterium_seqs[strand],
            matrix,
            seed=(i, j),
            gap_penalty=GAP_PENALTY,
            threshold=X_DROP,
        )[0],
        log_evalue,
    )
    for (i, strand, j), log_evalue in zip(accepted_hits, log_evalues)
    if log_evalue <= np.log10(EVALUE_THRESHOLD)
]

print("Number of accepted gapped alignments:", len(accepted_alignments))

Number of accepted gapped alignments: 118

Some of the obtained alignments may be duplicates that should be removed: although the previous filter for double hits limits the number of alignments to one per diagonal, different matches on similar diagonals may still result in the same alignment due to the insertion of gaps.

# Sort by significance, i.e. the E-value
accepted_alignments = sorted(accepted_alignments, key=lambda x: x[1])

# For each position in the chloroplast genome report only one alignment
# The most significant ones take precedence
unique_alignments = []
covered_range = np.zeros(len(chloroplast_seq), dtype=bool)
for alignment, log_evalue in accepted_alignments:
    # The start and exclusive end position of the aligned region
    # in chloroplast genome
    start = alignment.trace[0, 0]
    stop = alignment.trace[-1, 0]
    # If this region was not covered by any other alignment before,
    # accept it and mark the region as covered
    if not covered_range[start:stop].any():
        unique_alignments.append((alignment, log_evalue))
        covered_range[start:stop] = True

print("Number of unique alignments:", len(unique_alignments))

Number of unique alignments: 48

To take a closer look on the found homologous regions, they are viewed in its functional context. For this purpose, the aligned region is displayed for each local alignment (gray box) and the features of the chloroplast genome are plotted alongside. Functional RNAs and protein coding regions are shown in orange and green, respectively.

N_COL = 4
MAX_NAME_LENGTH = 30
EXCERPT_SIZE = 3000
MARGIN_SIZE = 250

COLORS = {
    "CDS": biotite.colors["dimgreen"],
    "tRNA": biotite.colors["orange"],
    "rRNA": biotite.colors["orange"],
}


# Fetch features of the chloroplast genome
gb_file = gb.GenBankFile.read(
    entrez.fetch("NC_000932", None, "gb", db_name="Nucleotide", ret_type="gb")
)
annotation = gb.get_annotation(gb_file, include_only=["CDS", "rRNA", "tRNA"])


def draw_arrow(ax, feature, loc):
    x = loc.first
    dx = loc.last - loc.first + 1
    if loc.strand == seq.Location.Strand.FORWARD:
        x = loc.first
        dx = loc.last - loc.first + 1
    else:
        x = loc.last
        dx = loc.first - loc.last + 1

    # Create head with 90 degrees tip -> head width/length ratio = 1/2
    ax.add_patch(
        biotite.AdaptiveFancyArrow(
            x,
            0.5,
            dx,
            0,
            tail_width=0.4,
            head_width=0.7,
            head_ratio=0.5,
            draw_head=True,
            color=COLORS[feature.key],
            linewidth=0,
        )
    )

    label = feature.qual.get("gene")

    if label is not None:
        ax.text(x + dx / 2, 0.5, label, color="black", ha="center", va="center", size=8)


# Fetch features of the chloroplast genome
gb_file = gb.GenBankFile.read(
    entrez.fetch("NC_000932", None, "gb", db_name="Nucleotide", ret_type="gb")
)
annotation = gb.get_annotation(gb_file, include_only=["CDS", "rRNA", "tRNA"])

n_rows = int(np.ceil(len(unique_alignments) / N_COL))
fig, axes = plt.subplots(n_rows, N_COL, figsize=(8.0, 1.5 * n_rows))

for (alignment, log_evalue), ax in zip(unique_alignments, axes.flatten()):
    # Transform 0-based sequence index to 1-based sequence position
    first = alignment.trace[0, 0] + 1
    last = alignment.trace[-1, 0] + 1
    center = (first + last) // 2
    if last - first < EXCERPT_SIZE - MARGIN_SIZE * 2:
        excerpt_loc = (center - EXCERPT_SIZE // 2, center + EXCERPT_SIZE // 2)
    else:
        # Exceed excerpt size to show entire alignment range
        excerpt_loc = (first - MARGIN_SIZE, last + MARGIN_SIZE)
    # Move excerpt into bounds of the sequence
    if excerpt_loc[0] < 1:
        offset = -excerpt_loc[0] + 1
        excerpt_loc = (excerpt_loc[0] + offset, excerpt_loc[1] + offset)
    excerpt = annotation[excerpt_loc[0] : excerpt_loc[1] + 1]

    ax.axhline(0.5, color="lightgray", linewidth=2, zorder=-1)
    # Draw features
    for feature in excerpt:
        for loc in feature.locs:
            draw_arrow(ax, feature, loc)
    # Draw rectangle representing homologous region
    ax.add_patch(
        Rectangle(
            (first, 0.1),
            last - first + 1,
            1 - 2 * 0.1,
            facecolor="None",
            edgecolor="black",
            alpha=0.2,
            linewidth=1,
            clip_on=False,
        )
    )

    ax.xaxis.set_major_locator(MultipleLocator(1000))
    ax.tick_params(labelsize=6)
    ax.set_xlim(excerpt_loc)
    ax.set_ylim([0, 1])
    ax.set_frame_on(False)
    ax.get_yaxis().set_tick_params(left=False, right=False, labelleft=False)

    exponent = int(np.floor(log_evalue))
    mantissa = 10 ** (log_evalue - exponent)
    homolog_excerpt = annotation[first : last + 1]
    if len(homolog_excerpt) > 0:
        # Select the longest feature in range for name display in title
        representative_feature = max(
            homolog_excerpt,
            key=lambda feature: -np.subtract(*feature.get_location_range()),
        )
        feature_name = representative_feature.qual["product"]
    else:
        # No feature in homologous region -> no name
        feature_name = ""
    # Truncate feature name if it is too long
    if len(feature_name) > MAX_NAME_LENGTH:
        feature_name = feature_name[:MAX_NAME_LENGTH] + "..."

    ax.set_title(
        f"{feature_name}\n"
        rf"E-Value: ${mantissa:.2f} \times 10^{{{exponent}}}$"
        f"\nIdentity: {align.get_sequence_identity(alignment) * 100:3.1f} %",
        loc="left",
        size=8,
    )

# Hide empty axes
for ax in axes.flatten()[len(unique_alignments) :]:
    ax.axis("off")

fig.tight_layout(h_pad=3.0, w_pad=0.5)

$23S ribosomal RNA E-Value: $4.20 \times 10^{-349}$ Identity: 80.8 %, 23S ribosomal RNA E-Value: $4.20 \times 10^{-349}$ Identity: 80.8 %, photosystem II 44 kDa protein E-Value: $8.66 \times 10^{-231}$ Identity: 71.3 %, 16S ribosomal RNA E-Value: $5.12 \times 10^{-211}$ Identity: 80.7 %, 16S ribosomal RNA E-Value: $5.12 \times 10^{-211}$ Identity: 80.7 %, photosystem I P700 chlorophyll... E-Value: $7.78 \times 10^{-198}$ Identity: 67.1 %, photosystem II 47 kDa protein E-Value: $1.54 \times 10^{-123}$ Identity: 68.3 %, ribulose-1,5-bisphosphate carb... E-Value: $2.01 \times 10^{-120}$ Identity: 67.0 %, ATP synthase CF1 beta subunit E-Value: $2.40 \times 10^{-118}$ Identity: 68.1 %, ATP synthase CF1 alpha subunit E-Value: $1.62 \times 10^{-110}$ Identity: 67.3 %, photosystem II protein D1 E-Value: $1.20 \times 10^{-103}$ Identity: 71.9 %, NADH dehydrogenase subunit 7 E-Value: $5.33 \times 10^{-81}$ Identity: 64.9 %, NADH dehydrogenase subunit 5 E-Value: $8.07 \times 10^{-67}$ Identity: 61.5 %, NADH dehydrogenase subunit 4 E-Value: $6.67 \times 10^{-65}$ Identity: 59.0 %, RNA polymerase beta subunit E-Value: $6.01 \times 10^{-61}$ Identity: 60.1 %, NADH dehydrogenase subunit 1 E-Value: $4.90 \times 10^{-55}$ Identity: 62.6 %, cytochrome b6 E-Value: $2.32 \times 10^{-51}$ Identity: 68.3 %, photosystem I P700 chlorophyll... E-Value: $3.60 \times 10^{-45}$ Identity: 58.3 %, ATP synthase CF0 A subunit E-Value: $2.04 \times 10^{-40}$ Identity: 65.3 %, acetyl-CoA carboxylase beta su... E-Value: $4.67 \times 10^{-40}$ Identity: 63.8 %, cytochrome f E-Value: $1.67 \times 10^{-34}$ Identity: 59.8 %, cytochrome b6/f complex subuni... E-Value: $4.18 \times 10^{-34}$ Identity: 61.4 %, ribosomal protein L2 E-Value: $2.36 \times 10^{-28}$ Identity: 59.7 %, photosystem II protein V E-Value: $7.65 \times 10^{-22}$ Identity: 60.7 %, NADH dehydrogenase subunit J E-Value: $4.38 \times 10^{-20}$ Identity: 63.4 %, NADH dehydrogenase subunit 1 E-Value: $6.93 \times 10^{-20}$ Identity: 62.7 %, ribosomal protein L16 E-Value: $2.09 \times 10^{-19}$ Identity: 64.2 %, Ycf1 E-Value: $4.77 \times 10^{-18}$ Identity: 52.4 %, ribosomal protein L2 E-Value: $1.20 \times 10^{-17}$ Identity: 59.3 %, ribosomal protein L14 E-Value: $1.58 \times 10^{-17}$ Identity: 64.7 %, ATP synthase CF0 C subunit E-Value: $1.19 \times 10^{-15}$ Identity: 68.5 %, ribosomal protein S12 E-Value: $2.71 \times 10^{-14}$ Identity: 71.7 %, ribosomal protein S12 E-Value: $2.71 \times 10^{-14}$ Identity: 71.7 %, ribosomal protein S3 E-Value: $5.13 \times 10^{-12}$ Identity: 56.4 %, photosystem II protein H E-Value: $1.39 \times 10^{-6}$ Identity: 65.4 %, photosystem I assembly protein... E-Value: $9.57 \times 10^{-6}$ Identity: 61.9 %, tRNA-Asn E-Value: $1.26 \times 10^{-5}$ Identity: 64.0 %, tRNA-Asn E-Value: $1.26 \times 10^{-5}$ Identity: 64.0 %, tRNA-Tyr E-Value: $3.47 \times 10^{-5}$ Identity: 91.8 %, tRNA-Asp E-Value: $3.15 \times 10^{-4}$ Identity: 71.6 %, ribosomal protein S14 E-Value: $8.68 \times 10^{-4}$ Identity: 52.7 %, tRNA-Gln E-Value: $1.99 \times 10^{-3}$ Identity: 71.4 %, tRNA-Arg E-Value: $7.89 \times 10^{-3}$ Identity: 80.6 %, tRNA-Arg E-Value: $7.89 \times 10^{-3}$ Identity: 80.6 %, tRNA-Pro E-Value: $1.65 \times 10^{-2}$ Identity: 71.7 %, tRNA-Phe E-Value: $2.17 \times 10^{-2}$ Identity: 87.7 %, tRNA-Ser E-Value: $5.97 \times 10^{-2}$ Identity: 66.4 %, tRNA-Cys E-Value: $6.54 \times 10^{-2}$ Identity: 72.4 %$

This plot shows that the conserved regions sharply match the position of genes. Especially genes that are part of the gene expression machinery or participate in the composition of the photosystems seem to be highly conserved.

References#

Gallery generated by Sphinx-Gallery