The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. This database is produced and maintained by the National Center for Biotechnology Information (NCBI) as part of the International Nucleotide Sequence Database Collaboration (INSDC). The National Center for Biotechnology Information is a part of the National Institutes of Health in the United States.

GenBank and its collaborators receive sequences produced in laboratories throughout the world from more than 100,000 distinct organisms. The database started in 1982 by Walter Goad and Los Alamos National Laboratory. GenBank has become an important database for research in biological fields and has grown in recent years at an exponential rate by doubling roughly every 18 months.[2][3]

Release 194, produced in February 2013, contained over 150 billion nucleotide bases in more than 162 million sequences.[4] GenBank is built by direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centers.

DescriptionNucleotide sequences for more than 300,000 organisms with supporting bibliographic and biological annotation.
Data types
  • Nucleotide sequence
  • Protein sequence
Research centerNCBI
Primary citationPMID 21071399
Release date1982
Data format
Download URLncbi ftp
Web service URL


Only original sequences can be submitted to GenBank. Direct submissions are made to GenBank using BankIt, which is a Web-based form, or the stand-alone submission program, Sequin. Upon receipt of a sequence submission, the GenBank staff examines the originality of the data and assigns an accession number to the sequence and performs quality assurance checks. The submissions are then released to the public database, where the entries are retrievable by Entrez or downloadable by FTP. Bulk submissions of Expressed Sequence Tag (EST), Sequence-tagged site (STS), Genome Survey Sequence (GSS), and High-Throughput Genome Sequence (HTGS) data are most often submitted by large-scale sequencing centers. The GenBank direct submissions group also processes complete microbial genome sequences.


Walter Goad of the Theoretical Biology and Biophysics Group at Los Alamos National Laboratory and others established the Los Alamos Sequence Database in 1979, which culminated in 1982 with the creation of the public GenBank.[5] Funding was provided by the National Institutes of Health, the National Science Foundation, the Department of Energy, and the Department of Defense. LANL collaborated on GenBank with the firm Bolt, Beranek, and Newman, and by the end of 1983 more than 2,000 sequences were stored in it.

In the mid 1980s, the Intelligenetics bioinformatics company at Stanford University managed the GenBank project in collaboration with LANL.[6] As one of the earliest bioinformatics community projects on the Internet, the GenBank project started BIOSCI/Bionet news groups for promoting open access communications among bioscientists. During 1989 to 1992, the GenBank project transitioned to the newly created National Center for Biotechnology Information.[7]

NucleotideSequences 86 87.jpeg
Genbank and EMBL: NucleotideSequences 1986/1987 Volumes I to VII.
CDRom of Genbank v100


Growth of Genbank
Growth in GenBank base pairs, 1982 to 2018, on a semi-log scale

The GenBank release notes for release 162.0 (October 2007) state that "from 1982 to the present, the number of bases in GenBank has doubled approximately every 18 months".[4][8] As of 15 June 2018, GenBank release 226.0 has 209,775,348 loci, 263,957,884,539 bases, from 209,775,348 reported sequences.[4]

The GenBank database includes additional data sets that are constructed mechanically from the main sequence data collection, and therefore are excluded from this count.

Top organisms in GenBank (Release 191)[9]
Organism base pairs
Homo sapiens 16,310,774,187
Mus musculus 9,974,977,889
Rattus norvegicus 6,521,253,272
Bos taurus 5,386,258,455
Zea mays 5,062,731,057
Sus scrofa 4,887,861,860
Danio rerio 3,120,857,462
Strongylocentrotus purpuratus 1,435,236,534
Macaca mulatta 1,256,203,101
Oryza sativa Japonica Group 1,255,686,573
Nicotiana tabacum 1,197,357,811
Xenopus (Silurana) tropicalis 1,249,938,611
Drosophila melanogaster 1,119,965,220
Pan troglodytes 1,008,323,292
Arabidopsis thaliana 1,144,226,616
Canis lupus familiaris 951,238,343
Vitis vinifera 999,010,073
Gallus gallus 899,631,338
Glycine max 906,638,854
Triticum aestivum 898,689,329

Incomplete identifications

Public databases which may be searched using the National Center for Biotechnology Information Basic Local Alignment Search Tool (NCBI BLAST), lack peer-reviewed sequences of type strains and sequences of non-type strains. On the other hand, while commercial databases potentially contain high-quality filtered sequence data, there are a limited number of reference sequences.

A paper released in the Journal of Clinical Microbiology[10] evaluated the 16S rRNA gene sequencing results analyzed with GenBank in conjunction with other freely available, quality-controlled, web-based public databases, such as the EzTaxon-e ( and the BIBI ( databases. The results showed that analyses performed using GenBank combined with EzTaxon-e (kappa = 0.79) were more discriminative than using GenBank (kappa = 0.66) or other databases alone.

See also


  1. ^ The download page at UCSC says "NCBI places no restrictions on the use or distribution of the GenBank data. However, some submitters may claim patent, copyright, or other intellectual property rights in all or a portion of the data they have submitted. NCBI is not in a position to assess the validity of such claims, and therefore cannot provide comment or unrestricted permission concerning the use, copying, or distribution of the information contained in GenBank."
  2. ^ Benson D; Karsch-Mizrachi, I.; Lipman, D. J.; Ostell, J.; Wheeler, D. L.; et al. (2008). "GenBank". Nucleic Acids Research. 36 (Database): D25–D30. doi:10.1093/nar/gkm929. PMC 2238942. PMID 18073190.
  3. ^ Benson D; Karsch-Mizrachi, I.; Lipman, D. J.; Ostell, J.; Sayers, E. W.; et al. (2009). "GenBank". Nucleic Acids Research. 37 (Database): D26–D31. doi:10.1093/nar/gkn723. PMC 2686462. PMID 18940867.
  4. ^ a b c "GenBank release notes". NCBI.
  5. ^ Hanson, Todd (2000-11-21). "Walter Goad, GenBank founder, dies". Newsbulletin: obituary. Los Alamos National Laboratory.
  6. ^ LANL GenBank History
  7. ^ Benton D (1990). "Recent changes in the GenBank On-line Service". Nucleic Acids Research. 18 (6): 1517–1520. doi:10.1093/nar/18.6.1517. PMC 330520. PMID 2326192.
  8. ^ Benson, D. A.; Cavanaugh, M.; Clark, K.; Karsch-Mizrachi, I.; Lipman, D. J.; Ostell, J.; Sayers, E. W. (2012). "GenBank". Nucleic Acids Research. 41 (Database issue): D36–D42. doi:10.1093/nar/gks1195. PMC 3531190. PMID 23193287.
  9. ^ Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW (January 2011). "GenBank". Nucleic Acids Res. 39 (Database issue): D32–37. doi:10.1093/nar/gkq1079. PMC 3013681. PMID 21071399.
  10. ^ Kyung Sun Parka, Chang-Seok Kia, Cheol-In Kangb, Yae-Jean Kimc, Doo Ryeon Chungb, Kyong Ran Peckb, Jae-Hoon Songb and Nam Yong Lee (May 2012). "Evaluation of the GenBank, EzTaxon, and BIBI Services for Molecular Identification of Clinical Blood Culture Isolates That Were Unidentifiable or Misidentified by Conventional Methods". J. Clin. Microbiol. 50 (5): 1792–1795. doi:10.1128/JCM.00081-12. PMC 3347139. PMID 22403421.CS1 maint: Uses authors parameter (link)

External links


CAZy is a database of Carbohydrate-Active enZYmes (CAZymes). The database contains a classification and associated information about enzymes involved in the synthesis, metabolism, and recognition of complex carbohydrates, i.e. disaccharides, oligosaccharides, polysaccharides, and glycoconjugates. Included in the database are families of glycoside hydrolases, glycosyltransferases, polysaccharide lyases, carbohydrate esterases, and non-catalytic carbohydrate-binding modules. The CAZy database also includes a classification of Auxiliary Activity redox enzymes involved in the breakdown of lignocellulose.CAZy was established in 1999 in order to provide online and constantly updated access to the protein sequence-based family classification of CAZymes, which was originally developed in early 1990s to classify the glycoside hydrolases. New entries are added shortly after they appear in the daily releases of GenBank. The rapid evolution of high-throughput DNA sequencing has resulted in the continuing exponential growth of the CAZy database, which now covers hundreds of thousands of sequences. CAZy continues to be curated and developed by the Glycogenomics group at AFMB, a research centre affiliated with the French National Centre for Scientific Research and Aix-Marseille University.The CAZy database is coupled with CAZypedia, which was launched in 2007 as a research community-driven, wiki-based encyclopedia of CAZymes.

DNA Data Bank of Japan

The DNA Data Bank of Japan (DDBJ) is a biological database that collects DNA sequences. It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is also a member of the International Nucleotide Sequence Database Collaboration or INSDC. It exchanges its data with European Molecular Biology Laboratory at the European Bioinformatics Institute and with GenBank at the National Center for Biotechnology Information on a daily basis. Thus these three databanks contain the same data at any given time.

DDBJ began data bank activities in 1986 at NIG and remains the only nucleotide sequence data bank in Asia. Although DDBJ mainly receives its data from Japanese researchers, it can accept data from contributors from any other country. DDBJ is primarily funded by the Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT). DDBJ has an international advisory committee which consists of nine members, 3 members each from Europe, US, and Japan. This committee advises DDBJ about its maintenance, management and future plans once a year. Apart from this DDBJ also has an international collaborative committee which advises on various technical issues related to international collaboration and consists of working-level participants.

David J. Lipman

David J. Lipman is an American biologist who since 1989 to 2017 had been the Director of the National Center for Biotechnology Information (NCBI) at the National Institutes of Health. NCBI is the home of GenBank, the U.S. node of the International Sequence Database Consortium, and PubMed, one of the most heavily used sites in the world for the search and retrieval of biomedical information. Lipman is one of the original authors of the BLAST sequence alignment program, and a respected figure in bioinformatics. In May 2017, it was announced that he would be leaving NCBI and would be taking the position of Chief Science Officer at Impossible Foods.


The Entrez (pronounced ɒnˈtreɪ) Global Query Cross-Database Search System is a federated search engine, or web portal that allows users to search many discrete health sciences databases at the National Center for Biotechnology Information (NCBI) website. The NCBI is a part of the National Library of Medicine (NLM), which is itself a department of the National Institutes of Health (NIH), which in turn is a part of the United States Department of Health and Human Services. The name "Entrez" (a greeting meaning "Come in!" in French) was chosen to reflect the spirit of welcoming the public to search the content available from the NLM.

Entrez Global Query is an integrated search and retrieval system that provides access to all databases simultaneously with a single query string and user interface. Entrez can efficiently retrieve related sequences, structures, and references. The Entrez system can provide views of gene and protein sequences and chromosome maps. Some textbooks are also available online through the Entrez system.

Expressed sequence tag

In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to identify gene transcripts, and are instrumental in gene discovery and in gene-sequence determination. The identification of ESTs has proceeded rapidly, with approximately 74.2 million ESTs now available in public databases (e.g. GenBank 1 January 2013, all species).

An EST results from one-shot sequencing of a cloned cDNA. The cDNAs used for EST generation are typically individual clones from a cDNA library. The resulting sequence is a relatively low-quality fragment whose length is limited by current technology to approximately 500 to 800 nucleotides. Because these clones consist of DNA that is complementary to mRNA, the ESTs represent portions of expressed genes. They may be represented in databases as either cDNA/mRNA sequence or as the reverse complement of the mRNA, the template strand.

One can map ESTs to specific chromosome locations using physical mapping techniques, such as radiation hybrid mapping, Happy mapping, or FISH. Alternatively, if the genome of the organism that originated the EST has been sequenced, one can align the EST sequence to that genome using a computer.

The current understanding of the human set of genes (as of 2006) includes the existence of thousands of genes based solely on EST evidence. In this respect, ESTs have become a tool to refine the predicted transcripts for those genes, which leads to the prediction of their protein products and ultimately of their function. Moreover, the situation in which those ESTs are obtained (tissue, organ, disease state - e.g. cancer) gives information on the conditions in which the corresponding gene is acting. ESTs contain enough information to permit the design of precise probes for DNA microarrays that then can be used to determine the gene expression.

Some authors use the term "EST" to describe genes for which little or no further information exists besides the tag.Nagaraj et al. (2007) have reviewed the significance of ESTs, their properties, methods to analyze EST datasets and their applications in various areas of biology.

Faroese goose

The Faroese goose (Føroyska Gásin in Faroese) is probably the oldest form of tame goose in Europe and possibly the direct descendants of the tame geese that the Landnám folk brought from Scandinavia and the British Isles.

Since the Faroe Islands have no predator that can kill the geese, a special "goose culture" has developed in the Faroe Islands, which has no equivalent in neighboring countries.

From May to October one can see flocks of geese walking freely in the outfields, where they feed on the short summer grass without any supplementary feeding.

In winter the geese move freely in the cultivated infields of the villages, which in some cases is of such good quality that earlier the geese did not need complementary feed in the winter. In most places, however, caretakers provide supplementary food just before and during egg laying and when snow is on the ground. The properties of today's Faroese geese result from natural selection over centuries, where only the most well-adapted birds reproduced.


FishBase is a global species database of fish species (specifically finfish). It is the largest and most extensively accessed online database on adult finfish on the web. Over time it has "evolved into a dynamic and versatile ecological tool" that is widely cited in scholarly publications.FishBase provides comprehensive species data, including information on taxonomy, geographical distribution, biometrics and morphology, behaviour and habitats, ecology and population dynamics as well as reproductive, metabolic and genetic data. There is access to tools such as trophic pyramids, identification keys, biogeographical modelling and fishery statistics and there are direct species level links to information in other databases such as LarvalBase, GenBank, the IUCN Red List and the Catalog of Fishes.As of November 2018, FishBase included descriptions of 34,000 species and subspecies, 323,200 common names in almost 300 languages, 58,900 pictures, and references to 55,300 works in the scientific literature. The site has about 700,000 unique visitors per month.

Haplogroup I (mtDNA)

Haplogroup I is a human mitochondrial DNA (mtDNA) haplogroup. It is believed to have originated about 21,000 years ago, during the Last Glacial Maximum (LGM) period in West Asia ((Olivieri 2013); Terreros 2011; Fernandes 2012). The haplogroup is unusual in that it is now widely distributed geographically, but is common in only a few small areas of East Africa, West Asia and Europe. It is especially common among the El Molo and Rendille peoples of Kenya, various regions of Iran, the Lemko people of Slovakia, Poland and Ukraine, the island of Krk in Croatia, the department of Finistère in France and some parts of Scotland.

Influenza Genome Sequencing Project

The Influenza Genome Sequencing Project (IGSP), initiated in early 2004, seeks to investigate influenza evolution by providing a public data set of complete influenza genome sequences from collections of isolates representing diverse species distributions.

The project is funded by the National Institute of Allergy and Infectious Diseases (NIAID), a division of the National Institutes of Health (NIH), and has been operating out of the NIAID Microbial Sequencing Center at The Institute for Genomic Research (TIGR, which in 2006 became The Venter Institute).

Sequence information generated by the project has been continually placed into the public domain through GenBank.

International Nucleotide Sequence Database Collaboration

The International Nucleotide Sequence Database Collaboration (INSDC, consists of a joint effort to collect and disseminate databases containing DNA and RNA sequences. It involves the following computerized databases: DNA Data Bank of Japan (Japan), GenBank (USA) and the European Nucleotide Archive (UK). New and updated data on nucleotide sequences contributed by research teams to each of the three databases are synchronized on a daily basis through continuous interaction between the staff at each the collaborating organizations.

The DDBJ/EMBL/GenBank synchronization is maintained according to a number of guidelines which are produced and published by an International Advisory Board [1]. The guidelines consist of a common definition of the feature tables [2] for the databases, which regulate the content and syntax [3] of the database entries, in the form of a common DTD (Document Type Definition).

The syntax is called INSDSeq and its core consists of the letter sequence of the gene expression (amino acid sequence) and the letter sequence for nucleotide bases in the gene or decoded segment. In [4] a DBFetch operation shows a typical INSD entry at the EBI database; the same entry at NCBI is here [5].

List of alignment visualization software

This page is a subsection of the list of sequence alignment software.

Multiple alignment visualization tools typically serve four purposes:

Aid general understanding of large-scale DNA or protein alignments

Visualize alignments for figures and publication

Manually edit and curate automatically generated alignments

Analysis in depthThe rest of this article is focused on only multiple global alignments of homologous proteins. The first two are a natural consequence of most representations of alignments and their annotation being human-unreadable and best portrayed in the familiar sequence row and alignment column format, of which examples are widespread in the literature. The third is necessary because algorithms for both multiple sequence alignment and structural alignment use heuristics which do not always perform perfectly. The fourth is a great example of how interactive graphical tools enable a worker involved in sequence analysis to conveniently execute a variety if different computational tools to explore an alignment's phylogenetic implications; or, to predict the structure and functional properties of a specific sequence, e.g., comparative modelling.

Mycobacterium chelonae

Mycobacterium chelonae is a species of the phylum Actinobacteria (Gram-positive bacteria with high guanine and cytosine content, one of the dominant phyla of all bacteria), belonging to the genus Mycobacterium. Mycobacterium chelonae is a rapidly growing mycobacterium, that is found all throughout the environment including sewage and tap water. It can occasionally cause opportunistic infections of humans.

It is grouped in Runyon group IV.Type strain: strain CM 6388 = ATCC 35752 = CCUG 47445 = CIP 104535 = DSM 43804 = JCM 6388 = NCTC 946.

The complete genome sequence of M. chelonae CCUG 47445 type strain was deposited and published in DNA Data Bank of Japan, European Nucleotide Archive and GenBank in 2016 under the accession number CP007220.

Mycobacterium immunogenum

Mycobacterium immunogenum is a species of the phylum Actinobacteria (Gram-positive bacteria with high guanine and cytosine content, one of the dominant phyla of all bacteria), belonging to the genus Mycobacterium.

These non-tuberculous mycobacteria are sometimes found in fouling water-based cutting fluids, often causing hypersensitivity pneumonitis to the machinists in the affected grinding plants.The complete genome sequence of Mycobacterium immunogenum CCUG 47286T was deposited and published in DNA Data Bank of Japan, European Nucleotide Archive and GenBank in 2016 under the accession number CP011530.

National Center for Biotechnology Information

The National Center for biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper.

The NCBI houses a series of databases relevant to biotechnology and biomedicine and is an important resource for bioinformatics tools and services. Major databases include GenBank for DNA sequences and PubMed, a bibliographic database for the biomedical literature. Other databases include the NCBI Epigenomics database. All these databases are available online through the Entrez search engine.

NCBI was directed by David Lipman, one of the original authors of the BLAST sequence alignment program and a widely respected figure in bioinformatics. He also led an intramural research program, including groups led by Stephen Altschul (another BLAST co-author), David Landsman, Eugene Koonin, John Wilbur, Teresa Przytycka, and Zhiyong Lu. David Lipman stood down from his post in May 2017.


Plazi is a Swiss-based international non-profit association supporting and promoting the development of persistent and openly accessible digital bio-taxonomic literature. Plazi is maintaining a digital taxonomic literature repository to enable archiving of taxonomic treatments, enhances submitted taxonomic treatments by creating version in the XML formats TaxonX

and Taxpub, and educates about the importance of maintaining open access to scientific discourse and data. It is a contributor to the evolving e-taxonomy in the field of Biodiversity Informatics.The approach was originally developed in a binational National Science Foundation (NSF) and

German Research Foundation (DFG) digital library program to the American Museum of Natural History and the University of Karlsruhe, respectively, to create an XML schema modeling the content of bio-systematic literature. The TaxonX schema is applied to legacy publications using GoldenGATE, a semiautomatic editor. In its current state GoldenGATE is a complex mark up tool allowing community involvement in the process of rendering documents into semantically enhanced documents.

Plazi developed ways to make distribution records in published taxonomic literature accessible through a TAPIR service that is harvested by the Global Biodiversity Information Facility (GBIF). Similarly, the Species Page Model (SPM) transfer schema has been implemented to allow harvesting of treatments (the scientific descriptions of species and higher taxa) by third parties such as the Encyclopedia of Life (EOL). If available, the treatments are enhanced with links to external databases such as GenBank, The Hymenoptera Name Server for scientific names or ZooBank, the registry of zoological names.

Plazi claims it adheres to copyright law and argues that taxonomic treatments do not qualify as literary and artistic work. Plazi claims that such works are therefore in the public domain and can be freely used and disseminated (with scientific practice requiring appropriate citation).


The Reference Sequence (RefSeq) database is an open access, annotated and curated collection of publicly available nucleotide sequences (DNA, RNA) and their protein products. This database is built by National Center for Biotechnology Information (NCBI), and, unlike GenBank, provides only a single record for each natural biological molecule (i.e. DNA, RNA or protein) for major organisms ranging from viruses to bacteria to eukaryotes.

For each model organism, RefSeq aims to provide separate and linked records for the genomic DNA, the gene transcripts, and the proteins arising from those transcripts. RefSeq is limited to major organisms for which sufficient data are available (more than 66,000 distinct “named” organisms as of September 2011), while GenBank includes sequences for any organism submitted (approximately 250,000 different named organisms).

Small nucleolar RNA SNORD43

snoRNA U43 (also known as SNORD43) is a non-coding RNA (ncRNA) molecule which functions in the modification of other small nuclear RNAs (snRNAs). This type of modifying RNA is usually located in the nucleolus of the eukaryotic cell which is a major site of snRNA biogenesis. It is known as a small nucleolar RNA (snoRNA) and also often referred to as a guide RNA.

snoRNA U43 belongs to the C/D box class of snoRNAs which contain the conserved sequence motifs known as the C box (UGAUGA) and the D box (CUGA). Most of the members of the box C/D family function in directing site-specific 2'-O-methylation of substrate RNAs.U43 is encoded in intron 1 of the ribosomal protein L3 gene in human and cow. Three other snoRNAs ( U82, U83a and U83b) are also encoded in the same host gene but from different introns. The Arabidopsis thaliana homologue is called snoR41 in the public sequence databases (Genbank). The rice homologue is expressed from a cluster also containing snoR16.U43 is hypothesised to guide methylation of 2'-O-ribose residues on 18S ribosomal RNA.

Small nucleolar RNA SNORD46

snoRNA U46 (also known as SNORD46) is a non-coding RNA (ncRNA) molecule which functions in the modification of other small nuclear RNAs (snRNAs). This type of modifying RNA is usually located in the nucleolus of the eukaryotic cell which is a major site of snRNA biogenesis. It is known as a small nucleolar RNA (snoRNA) and also often referred to as a guide RNA.

snoRNA U46 belongs to the C/D box class of snoRNAs which contain the conserved sequence motifs known as the C box (UGAUGA) and the D box (CUGA). Most of the members of the box C/D family function in directing site-specific 2'-O-methylation of substrate RNAs.U46 is encoded in intron 2 of the ribosomal protein S8 gene in human, and is hypothesised to guide methylation of 2'-O-ribose residues on 28S ribosomal RNA (rRNA). The homologue of this snoRNA in Arabidopsis thaliana is called snoZ153. Some human U40 sequences have been annotated in the sequence databases (Genbank) as U46.

Zebrafish Information Network

The Zebrafish Information Network (ZFIN) is an online biological database of information about the zebrafish (Danio rerio). The zebrafish is a widely used model organism for genetic, genomic, and developmental studies, and ZFIN provides an integrated interface for querying and displaying the large volume of data generated by this research. To facilitate use of the zebrafish as a model of human biology, ZFIN links these data to corresponding information about other model organisms (e.g., mouse) and to human disease databases. Abundant links to external sequence databases (e.g., GenBank) and to genome browsers are included. Gene product, gene expression, and phenotype data are annotated with terms from biomedical ontologies. ZFIN is based at the University of Oregon in the United States, with funding provided by the National Institutes of Health (NIH).

This page is based on a Wikipedia article written by authors (here).
Text is available under the CC BY-SA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.