National Center for Biotechnology Information

The National Center for Biotechnology Information (NCBI)[1][2] is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper.

The NCBI houses a series of databases relevant to biotechnology and biomedicine and is an important resource for bioinformatics tools and services. Major databases include GenBank for DNA sequences and PubMed, a bibliographic database for the biomedical literature. Other databases include the NCBI Epigenomics database. All these databases are available online through the Entrez search engine. NCBI was directed by David Lipman,[2] one of the original authors of the BLAST sequence alignment program[3] and a widely respected figure in bioinformatics. He also led an intramural research program, including groups led by Stephen Altschul (another BLAST co-author), David Landsman, Eugene Koonin, John Wilbur, Teresa Przytycka, and Zhiyong Lu. David Lipman stood down from his post in May 2017.[4]

National Center for Biotechnology Information
HeadquartersBethesda, Maryland, U.S.
Coordinates38°59′45″N 77°05′56″W / 38.995872°N 77.098811°WCoordinates: 38°59′45″N 77°05′56″W / 38.995872°N 77.098811°W


NCBI has had responsibility for making available the GenBank DNA sequence database since 1992.[5] GenBank coordinates with individual laboratories and other sequence databases such as those of the European Molecular Biology Laboratory (EMBL) and the DNA Data Bank of Japan (DDBJ).[5]

Since 1992, NCBI has grown to provide other databases in addition to GenBank. NCBI provides Gene, Online Mendelian Inheritance in Man, the Molecular Modeling Database (3D protein structures), dbSNP (a database of single-nucleotide polymorphisms), the Reference Sequence Collection, a map of the human genome, and a taxonomy browser, and coordinates with the National Cancer Institute to provide the Cancer Genome Anatomy Project. The NCBI assigns a unique identifier (taxonomy ID number) to each species of organism.[6]

The NCBI has software tools that are available by WWW browsing or by FTP. For example, BLAST is a sequence similarity searching program. BLAST can do sequence comparisons against the GenBank DNA database in less than 15 seconds.

NCBI Bookshelf

The "NCBI Bookshelf[7] is a collection of freely accessible, downloadable, on-line versions of selected biomedical books. The Bookshelf covers a wide range of topics including molecular biology, biochemistry, cell biology, genetics, microbiology, disease states from a molecular and cellular point of view, research methods, and virology. Some of the books are online versions of previously published books, while others, such as Coffee Break, are written and edited by NCBI staff. The Bookshelf is a complement to the Entrez PubMed repository of peer-reviewed publication abstracts in that Bookshelf contents provide established perspectives on evolving areas of study and a context in which many disparate individual pieces of reported research can be organized.

Basic Local Alignment Search Tool (BLAST)

BLAST is an algorithm used for calculating sequence similarity between biological sequences such as nucleotide sequences of DNA and amino acid sequences of proteins.[8] BLAST is a powerful tool for finding sequences similar to the query sequence within the same organism or in different organisms. It searches the query sequence on NCBI databases and servers and post the results back to the person's browser in chosen format. Input sequences to the BLAST are mostly in FASTA or Genbank format while output could be delivered in variety of formats such as HTML, XML formatting and plain text. HTML is the default output format for NCBI's web-page. Results for NCBI-BLAST are presented in graphical format with all the hits found, a table with sequence identifiers for the hits having scoring related data, along with the alignments for the sequence of interest and the hits received with analogous BLAST scores for these[9]


The Entrez Global Query Cross-Database Search System is used at NCBI for all the major databases such as Nucleotide and Protein Sequences, Protein Structures, PubMed, Taxonomy, Complete Genomes, OMIM, and several others.[10] Entrez is both indexing and retrieval system having data from various sources for biomedical research. NCBI distributed the first version of Entrez in 1991, composed of nucleotide sequences from PDB and GenBank, protein sequences from SWISS-PROT, translated GenBank, PIR, PRF , PDB and associated abstracts and citations from PubMed. Entrez is specially designed to integrate the data from several different sources, databases and formats into a uniform information model and retrieval system which can efficiently retrieve that relevant references, sequences and structures.[11]


Gene has been implemented at NCBI to characterize and organize the information about genes. It serves as a major node in the nexus of genomic map, expression, sequence, protein function, structure and homology data. A unique GeneID is assigned to each gene record that can be followed through revision cycles. Gene records for known or predicted genes are established here and are demarcated by map positions or nucleotide sequence. Gene has several advantages over its predecessor, LocusLink, including, better integration with other databases in NCBI, broader taxonomic scope, and enhanced options for query and retrieval provided by Entrez system.[12]


Protein database maintains the text record for individual protein sequences, derived from many different resources such as NCBI Reference Sequence (RefSeq) project, GenBank, PDB and UniProtKB/SWISS-Prot. Protein records are present in different formats including FASTA and XML and are linked to other NCBI resources. Protein provides the relevant data to the users such as genes, DNA/RNA sequences, biological pathways, expression and variation data and literature. It also provides the pre-determined sets of similar and identical proteins for each sequence as computed by the BLAST. The Structure database of NCBI contains 3D coordinate sets for experimentally-determined structures in PDB that are imported by NCBI. The Conserved Domain database (CDD) of protein contains sequence profiles that characterize highly conserved domains within protein sequences. It also has records from external resources like SMART and Pfam. There is another database in protein known as Protein Clusters database which contains sets of proteins sequences that are clustered according to the maximum alignments between the individual sequences as calculated by BLAST.[13]

Pubchem database

PubChem database of NCBI is a public resource for molecules and their activities against biological assays. PubChem is searchable and accessible by Entrez information retrieval system.[14]

Implications of low-price DNA sequencing

In 2008 The New York Times wrote "The cost of determining a person’s complete genetic blueprint is about to plummet again — to $5,000." and added that the long-term goal was "the $1,000 genome."[15] Today's beneficiaries are AncestryDNA, the 2006-founded 23andMe, and those who've used their services.

See also


  1. ^ "The Human Genome Project". The New York Times.
  2. ^ a b "Research Institute Posts Gene Data on Internet". The New York Times. June 26, 1997.
  3. ^ "Sense from Sequences: Stephen F. Altschul on Bettering BLAST". 2000.
  4. ^ "National Library of Medicine Announces Departure of NCBI Director Dr. David Lipman". Retrieved 2017-05-06.
  5. ^ a b Mizrachi, Ilene (22 August 2007). "GenBank: The Nucleotide Sequence Database". National Center for Biotechnology Information (US) – via
  6. ^ "Home - Taxonomy - NCBI".
  7. ^ USA (2019-05-06). "Home - Books - NCBI". Retrieved 2019-06-12.
  8. ^ Altschul Stephen; Gish Warren; Miller Webb; Myers Eugene; Lipman David (1990). "Basic local alignment search tool". Journal of Molecular Biology. 215 (3): 403–410. doi:10.1016/s0022-2836(05)80360-2. PMID 2231712.
  9. ^ Madden T. (2002). The NCBI handbook, 2nd edition, Chapter 16, The BLAST Sequence Analysis Tool
  10. ^ NCBI Resource Coordinators (2012). "Database resources of the National Center for Biotechnology Information". Nucleic Acids Research 41 (Database issue): D8–D20.
  11. ^ Ostell J. (2002). The NCBI handbook, 2nd edition, Chapter 15, The Entrez Search and Retrieval System
  12. ^ Maglott D. Pruitt K. & Tatusova T. (2005). The NCBI handbook, 2nd edition, Chapter 19, Gene: A Directory of Genes
  13. ^ Sayers E. (2013). The NCBI handbook, 2nd edition, NCBI Protein Resources
  14. ^ Wang Y. & Bryant S H. (2014). The NCBI handbook, 2nd edition, NCBI PubChem BioAssay Database
  15. ^ Catherine Hutchings (October 7, 2008). "Your DNA: What Can You Afford (Not) To Know?". The New York Times.

External links


The Acidimicrobiia are a class of Actinobacteria, in which three families, eight genera, and nine species have been described, Acidimicrobium ferrooxidans is the type species of the order.


Archaeoglobaceae are a family of the Archaeoglobales. All known genera within the Archaeoglobaceae are hyperthermophilic and can be found near undersea hydrothermal vents. Archaeoglobaceae are the only family in the order Archaeoglobales, which is the only order in the class Archaeoglobi.


Bacilli is a taxonomic class of bacteria that includes two orders, Bacillales and Lactobacillales, which contain several well-known pathogens such as Bacillus anthracis (the cause of anthrax). Bacilli are almost exclusively gram-positive bacteria.

Conradi–Hünermann syndrome

Conradi–Hünermann syndrome is a rare type of chondrodysplasia punctata. It is associated with the EBP gene and affects between one in 100,000 and one in 200,000 babies.

Conserved Domain Database

The Conserved Domain Database (CDD) is a database of well-annotated multiple sequence alignment models and derived database search models, for ancient domains and full-length proteins.

DNA Data Bank of Japan

The DNA Data Bank of Japan (DDBJ) is a biological database that collects DNA sequences. It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is also a member of the International Nucleotide Sequence Database Collaboration or INSDC. It exchanges its data with European Molecular Biology Laboratory at the European Bioinformatics Institute and with GenBank at the National Center for Biotechnology Information on a daily basis. Thus these three databanks contain the same data at any given time.

DDBJ began data bank activities in 1986 at NIG and remains the only nucleotide sequence data bank in Asia. Although DDBJ mainly receives its data from Japanese researchers, it can accept data from contributors from any other country. DDBJ is primarily funded by the Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT). DDBJ has an international advisory committee which consists of nine members, 3 members each from Europe, US, and Japan. This committee advises DDBJ about its maintenance, management and future plans once a year. Apart from this DDBJ also has an international collaborative committee which advises on various technical issues related to international collaboration and consists of working-level participants.


The Deltaproteobacteria are a class of Proteobacteria. All species of this group are, like all Proteobacteria, Gram-negative.

The Deltaproteobacteria comprise a branch of predominantly aerobic genera, the fruiting body-forming Myxobacteria which release myxospores in unfavorable environments, and a branch of strictly anaerobic genera, which contains most of the known sulfate- (Desulfovibrio, Desulfobacter, Desulfococcus, Desulfonema, etc.) and sulfur-reducing bacteria (e.g. Desulfuromonas spp.) alongside several other anaerobic bacteria with different physiology (e.g. ferric iron-reducing Geobacter spp. and syntrophic Pelobacter and Syntrophus spp.).

A pathogenic intracellular deltaproteobacterium has recently been identified.


The Entrez (pronounced ɒnˈtreɪ) Global Query Cross-Database Search System is a federated search engine, or web portal that allows users to search many discrete health sciences databases at the National Center for Biotechnology Information (NCBI) website. The NCBI is a part of the National Library of Medicine (NLM), which is itself a department of the National Institutes of Health (NIH), which in turn is a part of the United States Department of Health and Human Services. The name "Entrez" (a greeting meaning "Come in!" in French) was chosen to reflect the spirit of welcoming the public to search the content available from the NLM.

Entrez Global Query is an integrated search and retrieval system that provides access to all databases simultaneously with a single query string and user interface. Entrez can efficiently retrieve related sequences, structures, and references. The Entrez system can provide views of gene and protein sequences and chromosome maps. Some textbooks are also available online through the Entrez system.


The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. This database is produced and maintained by the National Center for Biotechnology Information (NCBI; a part of the National Institutes of Health in the United States) as part of the International Nucleotide Sequence Database Collaboration (INSDC).

GenBank and its collaborators receive sequences produced in laboratories throughout the world from more than 100,000 distinct organisms. The database started in 1982 by Walter Goad and Los Alamos National Laboratory. GenBank has become an important database for research in biological fields and has grown in recent years at an exponential rate by doubling roughly every 18 months.Release 194, produced in February 2013, contained over 150 billion nucleotide bases in more than 162 million sequences. GenBank is built by direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centers.


GeneReviews is an online database containing standardized peer-reviewed articles that describe specific heritable diseases. It was established in 1997 as GeneClinics by Roberta A Pagon (University of Washington) with funding from the National Institutes of Health. Its focus is primarily on single-gene disorders, providing current disorder-specific information on diagnosis, management, and genetic counseling. Links to disease-specific and/or general consumer resources are included in each article when available. The database is published on the National Center for Biotechnology Information Bookshelf site. Articles are updated every two or three years or as needed, and revised whenever significant changes in clinically relevant information occur. Articles are searchable by author, title, gene, and name of disease or protein, and are available free of charge.


HomoloGene, a tool of the United States National Center for Biotechnology Information (NCBI), is a system for automated detection of homologs (similarity attributable to descent from a common ancestor) among the annotated genes of several completely sequenced eukaryotic genomes.

The HomoloGene processing consists of the protein analysis from the input organisms. Sequences are compared using blastp, then matched up and put into groups, using a taxonomic tree built from sequence similarity, where closer related organisms are matched up first, and then further organisms are added to the tree. The protein alignments are mapped back to their corresponding DNA sequences, and then distance metrics as molecular distances Jukes and Cantor (1969), Ka/Ks ratio can be calculated.

The sequences are matched up by using a heuristic algorithm for maximizing the score globally, rather than locally, in a bipartite matching (see complete bipartite graph). And then it calculates the statistical significance of each match. Cutoffs are made per position and Ks values are set to prevent false "orthologs" from being grouped together. “Paralogs” are identified by finding sequences that are closer within species than other species.


Lentisphaerae is a phylum of bacteria closely related to Chlamydiae and Verrucomicrobia.It includes two monotypic orders Lentisphaerales and Victivallales. Phylum members can be aerobic or anaerobic and fall under two distinct phenotypes. One consists of terrestrial gut microbiota from mammals and birds. The other phenotype includes marine micro-organisms: sequences from fish and coral microbiomes and marine sediment.


PubChem is a database of chemical molecules and their activities against biological assays. The system is maintained by the National Center for Biotechnology Information (NCBI), a component of the National Library of Medicine, which is part of the United States National Institutes of Health (NIH). PubChem can be accessed for free through a web user interface. Millions of compound structures and descriptive datasets can be freely downloaded via FTP. PubChem contains substance descriptions and small molecules with fewer than 1000 atoms and 1000 bonds. More than 80 database vendors contribute to the growing PubChem database.

PubMed Central

PubMed Central (PMC) is a free digital repository that archives publicly accessible full-text scholarly articles that have been published within the biomedical and life sciences journal literature. As one of the major research databases within the suite of resources that have been developed by the National Center for Biotechnology Information (NCBI), PubMed Central is much more than just a document repository. Submissions into PMC undergo an indexing and formatting procedure which results in enhanced metadata, medical ontology, and unique identifiers which all enrich the XML structured data for each article on deposit. Content within PMC can easily be interlinked to many other NCBI databases and accessed via Entrez search and retrieval systems, further enhancing the public's ability to freely discover, read and build upon this portfolio of biomedical knowledge.PubMed Central is very distinct from PubMed. PubMed Central is a free digital archive of full articles, accessible to anyone from anywhere via a web browser (with varying provisions for reuse). Conversely, although PubMed is a searchable database of biomedical citations and abstracts, the full-text article physically resides elsewhere (in print or online, free or behind a subscriber paywall).

As of December 2018, the PMC archive contained over 5.2 million articles, with contributions coming directly from publishers or authors depositing their own manuscripts into the repository per the NIH Public Access Policy. Older data shows that from Jan 2013 to Jan 2014 author-initiated deposits exceeded 103,000 papers during this 12-month period. PMC also identifies about 4,000 journals which now participate in some capacity to automatically deposit their published content into the PMC repository. Some participating publishers will delay the release of their articles on PubMed Central for a set time after publication, this is often referred to as an "embargo period", and can range from a few months to a few years depending on the journal. (Embargoes of six to twelve months are the most common.) However, PubMed Central is a key example of "systematic external distribution by a third party" which is still prohibited by the contributor agreements of many publishers.


The Reference Sequence (RefSeq) database is an open access, annotated and curated collection of publicly available nucleotide sequences (DNA, RNA) and their protein products. This database is built by National Center for Biotechnology Information (NCBI), and, unlike GenBank, provides only a single record for each natural biological molecule (i.e. DNA, RNA or protein) for major organisms ranging from viruses to bacteria to eukaryotes.

For each model organism, RefSeq aims to provide separate and linked records for the genomic DNA, the gene transcripts, and the proteins arising from those transcripts. RefSeq is limited to major organisms for which sufficient data are available (more than 66,000 distinct “named” organisms as of September 2011), while GenBank includes sequences for any organism submitted (approximately 250,000 different named organisms).


Rubrobacter is a genus of Actinobacteria, given its own subclass (Rubrobacteridae). It is radiotolerant and may rival Deinococcus radiodurans in this regard.


The Thermodesulfobacteria are a phylum of thermophilic sulfate-reducing bacteria.


Treponema is a genus of spiral-shaped bacteria. The major treponeme species of human pathogens is Treponema pallidum, whose subspecies are responsible for diseases such as syphilis, bejel, and yaws. Treponema carateum is the cause of pinta. Treponema paraluiscuniculi is associated with syphilis in rabbits.

United States National Library of Medicine

The United States National Library of Medicine (NLM), operated by the United States federal government, is the world's largest medical library.Located in Bethesda, Maryland, the NLM is an institute within the National Institutes of Health. Its collections include more than seven million books, journals, technical reports, manuscripts, microfilms, photographs, and images on medicine and related sciences, including some of the world's oldest and rarest works.

The current director of the NLM is Patricia Flatley Brennan.

Related topics
New Jersey
New York


This page is based on a Wikipedia article written by authors (here).
Text is available under the CC BY-SA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.