Accessing sequence data in NCBI Entrez#

An important source of biological sequences including their annotations is the NCBI Entrez database, which is commonly known as ‘the NCBI’. To download data we need to provide the unique record identifier (UID) of the entry. This can either be the Accession or GI, which are parallel identification systems. Furthermore, we need

  • the database name from which we would like to download the record, which can either be the internal name (e.g. 'nuccore') or the user-facing name (e.g. 'Nucleotide'),

  • and the retrieval type, which is the file format of the downloaded data (e.g. 'fasta').

A list of valid combinations can be found here. In the following case we will download the protein sequence of hemoglobin.

from tempfile import gettempdir, NamedTemporaryFile
import biotite.database.entrez as entrez

file_path = entrez.fetch(
    "6BB5_A", gettempdir(), suffix="fa",
    db_name="protein", ret_type="fasta"
)
with open(file_path) as file:
    print(file.read())
>pdb|6BB5|A Chain A, Hemoglobin subunit alpha
LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVA
HVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKY


Note the subunit alpha in the header of the FASTA file: Hemoglobin is a tetramer, consisting of two alpha and two beta subunits. Hence, we will download the sequence of the beta subunit as well. We can download multiple records at once by providing a list of UIDs. In addition, now we are also interested in sequence annotation, like sequence ranges where some secondary structure is present. Therefore, we want to download the data in GenBank format.

from os.path import basename

file_paths = entrez.fetch(
    ["6BB5_A", "6BB5_B"], gettempdir(), suffix="fa",
    db_name="protein", ret_type="gb"
)
print([basename(path) for path in file_paths])
['6BB5_A.fa', '6BB5_B.fa']

File formats like GenBank or FASTA allow multiple records in a single file. Downloading such multi-record files is also possible.

temp_file = NamedTemporaryFile(suffix=".fasta", delete=False)
file_path = entrez.fetch_single_file(
    ["6BB5_A", "6BB5_B"], temp_file.name,
    db_name="protein", ret_type="fasta"
)
with open(file_path) as file:
    print(file.read())
>pdb|6BB5|A Chain A, Hemoglobin subunit alpha
LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVA
HVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKY

>pdb|6BB5|B Chain B, Hemoglobin subunit beta
HLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAF
SDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANAL
AHKYH


Searching for records#

Only rarely we know the UID of the record we are looking for upfront. Usually one has only some criteria, such as the name of a gene or the organism. biotite.database.entrez allows searching for UIDs satisfying certain criteria. The obtained list of UIDs can then be used to download the records as shown above.

# Search the Nucleotide database in all fields for the term "Lysozyme"
print(entrez.search(entrez.SimpleQuery("Lysozyme"), db_name="nuccore"))
['3144184022', '3144143239', '3144143182', '3143988922', '3143691305', '3143691303', '3143686089', '3143686087', '3143686085', '3143666002', '3143666000', '3143635791', '3143311536', '3141414784', '3140186849', '3140186840', '3140186817', '3140186785', '3143188452', '3143188372']

search() takes a Query and returns a list of UIDs. Note that by default only 20 results are returned. To increase or decrease this value, you can adjust the number parameter.

Instead of searching in all fields, we can also search for a term in a specific field. Furthermore, we can logically combine multiple Query objects using |, & and ^, that represent OR, AND and NOT linkage, respectively. The Query can be converted into a string representation, that would also work in search bar on the NCBI website.

composite_query = (
    entrez.SimpleQuery("50:100", field="Sequence Length") &
    (
        entrez.SimpleQuery("Escherichia coli", field="Organism") |
        entrez.SimpleQuery("Bacillus subtilis", field="Organism")
    )
)
print(composite_query)
print(entrez.search(composite_query, db_name="nuccore"))
(50:100[Sequence Length]) AND (("Escherichia coli"[Organism]) OR ("Bacillus subtilis"[Organism]))
['3144187317', '3144187216', '3144187195', '3144187117', '3144187048', '3144186983', '3144186934', '3144186845', '3144186840', '3144186567', '3144186477', '3144186467', '3143798037', '3143631776', '3143594027', '3143585088', '3143581833', '3143513349', '3142948689', '3142948553']

Increasing the request limit#

The NCBI Entrez database has a quite conservative request limit. Hence, frequent accesses to the database may raise a RequestError. The limit can be greatly increased by providing an NCBI API key.

api_key = "api_key_placeholder"
entrez.set_api_key(api_key)