biotite.sequence.NucleotideSequence¶
- class biotite.sequence.NucleotideSequence(sequence=[], ambiguous=None)[source]¶
Bases:
Sequence
Representation of a nucleotide sequence (DNA or RNA).
This class may have one of two different alphabets:
unambiguous_alphabet()
contains only the unambiguous DNA letters ‘A’, ‘C’, ‘G’ and ‘T’.ambiguous_alphabet()
uses an extended alphabet for ambiguous letters.- Parameters:
- sequenceiterable object, optional
The initial DNA sequence. This may either be a list or a string. May take upper or lower case letters. By default the sequence is empty.
- ambiguousbool, optional
If true, the ambiguous alphabet is used. By default the object tries to use the unambiguous alphabet. If this fails due ambiguous letters in the sequence, the ambiguous alphabet is used.
- static ambiguous_alphabet()¶
Get the ambiguous nucleotide alphabet containing the symbols
A
,C
,G
andT
and symbols describing ambiguous combinations of these.- Returns:
- alphabetLetterAlphabet
The ambiguous nucleotide alphabet.
- complement()¶
Get the complement nucleotide sequence.
- Returns:
- complementNucleotideSequence
The complement sequence.
Examples
>>> dna_seq = NucleotideSequence("ACGCTT") >>> print(dna_seq.complement()) TGCGAA >>> print(dna_seq.reverse().complement()) AAGCGT
- copy(new_seq_code=None)¶
Copy the object.
- Parameters:
- new_seq_codendarray, optional
If this parameter is set, the sequence code is set to this value, rather than the original sequence code.
- Returns:
- copy
A copy of this object.
- static dtype(alphabet_size)¶
Get the sequence code dtype required for the given size of the alphabet.
- get_alphabet()¶
Get the
Alphabet
of theSequence
.This method must be overwritten, when subclassing
Sequence
.- Returns:
- alphabetAlphabet
Sequence
alphabet.
- get_symbol_frequency()¶
Get the number of occurences of each symbol in the sequence.
If a symbol does not occur in the sequence, but it is in the alphabet, its number of occurences is 0.
- Returns:
- frequencydict
A dictionary containing the symbols as keys and the corresponding number of occurences in the sequence as values.
- is_valid()¶
Check, if the sequence contains a valid sequence code.
A sequence code is valid, if at each sequence position the code is smaller than the size of the alphabet.
Invalid code means that the code cannot be decoded into symbols. Furthermore invalid code can lead to serious errors in alignments, since the substitution matrix is indexed with an invalid index.
- Returns:
- validbool
True, if the sequence is valid, false otherwise.
- reverse(copy=True)¶
Reverse the
Sequence
.- Parameters:
- copybool, optional
If set to False, the code
ndarray
of the returned sequence is an array view to the sequence code of this object. In this case, manipulations on the returned sequence would also affect this object. Otherwise, the sequence code is copied.
- Returns:
- reversedSequence
The reversed
Sequence
.
Examples
>>> dna_seq = NucleotideSequence("ACGTA") >>> dna_seq_rev = dna_seq.reverse() >>> print(dna_seq_rev) ATGCA
- translate(complete=False, codon_table=None, met_start=False)¶
Translate the nucleotide sequence into a protein sequence.
If complete is true, the entire sequence is translated, beginning with the first codon and ending with the last codon, even if stop codons occur during the translation.
Otherwise this method returns possible ORFs in the sequence, even if not stop codon occurs in an ORF.
- Parameters:
- completebool, optional
If true, the complete sequence is translated. In this case the sequence length must be a multiple of 3. Otherwise all ORFs are translated. (Default: False)
- codon_tableCodonTable, optional
The codon table to be used. By default the default table will be used (NCBI “Standard” table with “ATG” as single start codon).
- met_startbool, optional
If true, the translation starts always with a ‘methionine’, even if the start codon codes for another amino acid. Otherwise the translation starts with the amino acid the codon codes for. Only applies, if complete is false. (Default: False)
- Returns:
- proteinProteinSequence or list of ProteinSequence
The translated protein sequence. If complete is true, only a single
ProteinSequence
is returned. Otherwise a list ofProteinSequence
is returned, which contains every ORF.- poslist of tuple (int, int)
Is only returned if complete is false. The list contains a tuple for each ORF. The first element of the tuple is the index of the
NucleotideSequence
, where the translation starts. The second element is the exclusive stop index, it represents the first nucleotide in theNucleotideSequence
after a stop codon.
Examples
>>> dna_seq = NucleotideSequence("AATGATGCTATAGAT") >>> prot_seq = dna_seq.translate(complete=True) >>> print(prot_seq) NDAID >>> prot_seqs, pos = dna_seq.translate(complete=False) >>> for seq in prot_seqs: ... print(seq) MML* ML*
- static unambiguous_alphabet()¶
Get the unambiguous nucleotide alphabet containing the symbols
A
,C
,G
andT
.- Returns:
- alphabetLetterAlphabet
The unambiguous nucleotide alphabet.
Gallery¶

Comparative genome assembly of SARS-CoV-2 B.1.1.7 variant

Genome comparison between chloroplasts and cyanobacteria