Bacterial genome

Bacterial genomes are generally smaller and less variant in size among species when compared with genomes of animals and single cell eukaryotes. Bacterial genomes can range in size anywhere from about 130 kbp[1][2] to over 14 Mbp.[3] A study that included, but was not limited to, 478 bacterial genomes, concluded that as genome size increases, the number of genes increases at a disproportionately slower rate in eukaryotes than in non-eukaryotes. Thus, the proportion of non-coding DNA goes up with genome size more quickly in non-bacteria than in bacteria. This is consistent with the fact that most eukaryotic nuclear DNA is non-gene coding, while the majority of prokaryotic, viral, and organellar genes are coding.[4] Right now, we have genome sequences from 50 different bacterial phyla and 11 different archaeal phyla. Second-generation sequencing has yielded many draft genomes (close to 90% of bacterial genomes in GenBank are currently not complete); third-generation sequencing might eventually yield a complete genome in a few hours. The genome sequences reveal much diversity in bacteria. Analysis of over 2000 Escherichia coli genomes reveals an E. coli core genome of about 3100 gene families and a total of about 89,000 different gene families.[5] Genome sequences show that parasitic bacteria have 500–1200 genes, free-living bacteria have 1500–7500 genes, and archaea have 1500–2700 genes.[6] A striking discovery by Cole et al. described massive amounts of gene decay when comparing Leprosy bacillus to ancestral bacteria.[7] Studies have since shown that several bacteria have smaller genome sizes than their ancestors did.[8] Over the years, researchers have proposed several theories to explain the general trend of bacterial genome decay and the relatively small size of bacterial genomes. Compelling evidence indicates that the apparent degradation of bacterial genomes is owed to a deletional bias.

Methods and techniques

As of 2014, there are over 30,000 sequenced bacterial genomes publicly available and thousands of metagenome projects. Projects such as the Genomic Encyclopedia of Bacteria and Archaea (GEBA) intend to add more genomes.[5]

The single gene comparison is now being supplanted by more general methods. These methods have resulted in novel perspectives on genetic relationships that previously have only been estimated.[5]

A significant achievement in the second decade of bacterial genome sequencing was the production of metagenomic data, which covers all DNA present in a sample. Previously, there were only two metagenomic projects published.[5]

Bacterial genomes

Genome size vs protein count
Log-log plot of the total number of annotated proteins in genomes submitted to GenBank as a function of genome size. Based on data from NCBI genome reports.

Bacteria possess a compact genome architecture distinct from eukaryotes in two important ways: bacteria show a strong correlation between genome size and number of functional genes in a genome, and those genes are structured into operons.[9][10] The main reason for the relative density of bacterial genomes compared to eukaryotic genomes (especially multicellular eukaryotes) is the presence of noncoding DNA in the form of intergenic regions and introns.[10] Some notable exceptions include recently formed pathogenic bacteria. This was initially described in a study by Cole et al. in which Mycobacterium leprae was discovered to have a significantly higher percentage of pseudogenes to functional genes (~40%) than its free-living ancestors.[7]

Furthermore, amongst species of bacteria, there is relatively little variation in genome size when compared with the genome sizes of other major groups of life.[6] Genome size is of little relevance when considering the number of functional genes in eukaryotic species. In bacteria, however, the strong correlation between the number of genes and the genome size makes the size of bacterial genomes an interesting topic for research and discussion.[11]

The general trends of bacterial evolution indicate that bacteria started as free-living organisms. Evolutionary paths led some bacteria to become pathogens and symbionts. The lifestyles of bacteria play an integral role in their respective genome sizes. Free-living bacteria have the largest genomes out of the three types of bacteria; however, they have fewer pseudogenes than bacteria that have recently acquired pathogenicity.

Facultative and recently evolved pathogenic bacteria exhibit a smaller genome size than free-living bacteria, yet they have more pseudogenes than any other form of bacteria.

Obligate bacterial symbionts or pathogens have the smallest genomes and the fewest pseudogenes of the three groups.[12] The relationship between life-styles of bacteria and genome size raises questions as to the mechanisms of bacterial genome evolution. Researchers have developed several theories to explain the patterns of genome size evolution amongst bacteria.

Genome comparisons and phylogeny

As single-gene comparisons have largely given way to genome comparisons, phylogeny of bacterial genomes have improved in accuracy. The Average Nucleotide Identity method quantifies genetic distance between entire genomes by taking advantage of regions of about 10,000 bp. With enough data from genomes of one genus, algorithms are executed to categorize species. This has been done for the Pseudomonas avellanae species in 2013.[5]

To extract information about bacterial genomes, core- and pan-genome sizes have been assessed for several strains of bacteria. In 2012, the number of core gene families was about 3000. However, by 2015, with an over tenfold increased in available genomes, the pan-genome has increased as well. There is roughly a positive correlation between the number of genomes added and the growth of the pan-genome. On the other hand, the core genome has remain static since 2012. Currently, the E. coli pan-genome is composed of about 90,000 gene families. About one-third of these exist only in a single genome. Many of these, however, are merely gene fragments and the result of calling errors. Still, there are probably over 60,000 unique gene families in E. coli.[5]

Theories of bacterial genome evolution

Bacteria lose a large amount of genes as they transition from free-living or facultatively parasitic life cycles to permanent host-dependent life. Towards the lower end of the scale of bacterial genome size are the mycoplasmas and related bacteria. Early molecular phylogenetic studies revealed that mycoplasmas represented an evolutionary derived state, contrary to prior hypotheses. Furthermore, it is now known that mycoplasmas are just one instance of many of genome shrinkage in obligately host-associated bacteria. Other examples are Rickettsia, Buchnera aphidicola, and Borrelia burgdorferi.[13]

Small genome size in such species is associated with certain particularities, such as rapid evolution of polypeptide sequences and low GC content in the genome. The convergent evolution of these qualities in unrelated bacteria suggests that an obligate association with a host promotes genome reduction.[13]

Given that over 80% of almost all of the fully sequenced bacterial genomes consist of intact ORFs, and that gene length is nearly constant at ~1 kb per gene, it is inferred that small genomes have few metabolic capabilities. While free-living bacteria, such as E. coli, Salmonella species, or Bacillus species, usually have 1500 to 6000 proteins encoded in their DNA, obligately pathogenic bacteria often have as few as 500 to 1000 such proteins.[13]

One candidate explanation is that reduced genomes maintain genes that are necessary for vital processes pertaining to cellular growth and replication, in addition to those genes that are required to survive in the bacteria's ecological niche. However, sequence data contradicts this hypothesis. The set of universal orthologs amongst eubacteria comprises only 15% of each genome. Thus, each lineage has taken a different evolutionary path to reduced size. Because universal cellular processes require over 80 genes, variation in genes imply that the same functions can be achieved by exploitation of nonhomologous genes.[13]

Host-dependent bacteria are able to secure many compounds required for metabolism from the host's cytoplasm or tissue. They can, in turn, discard their own biosynthetic pathways and associated genes. This removal explains many of the specific gene losses. For example, the Rickettsia species, which relies on specific energy substrate from its host, has lost many of its native energy metabolism genes. Similarly, most small genomes have lost their amino acid biosynthesizing genes, as these are found in the host instead. One exception is the Buchnera, an obligate maternally transmitted symbiont of aphids. It retains 54 genes for biosynthesis of crucial amino acids, but no longer has pathways for those amino acids that the host can synthesize. Pathways for nucleotide biosynthesis are gone from many reduced genomes. Those anabolic pathways that evolved through niche adaptation remain in particular genomes.[13]

The hypothesis that unused genes are eventually removed does not explain why many of the removed genes would indeed remain helpful in obligate pathogens. For example, many eliminated genes code for products that are involved in universal cellular processes, including replication, transcription, and translation. Even genes supporting DNA recombination and repair are deleted from every small genome. In addition, small genomes have fewer tRNAs, utilizing one for several amino acids. So, a single codon pairs with multiple codons, which likely yields less-than-optimal translation machinery. It is unknown why obligate intracellular pathogens would benefit by retaining fewer tRNAs and fewer DNA repair enzymes.[13]

Another factor to consider is the change in population that corresponds to an evolution towards an obligately pathogenic life. Such a shift in lifestyle often results in a reduction in the genetic population size of a lineage, since there is a finite number of hosts to occupy. This genetic drift may result in fixation of mutations that inactivate otherwise beneficial genes, or otherwise may decrease the efficiency of gene products. Hence, not will only useless genes be lost (as mutations disrupt them once the bacteria has settled into host dependency), but also beneficial genes may be lost if genetic drift enforces ineffective purifying selection.[13]

The number of universally maintained genes is small and inadequate for independent cellular growth and replication, so that small genome species must achieve such feats by means of varying genes. This is done partly through nonorthologous gene displacement. That is, the role of one gene is replaced by another gene that achieves the same function. Redundancy within the ancestral, larger genome is eliminated. The descendant small genome content depends on the content of chromosomal deletions that occur in the early stages of genome reduction.[13]

The very small genome of M. genitalium possesses dispensable genes. In a study in which single genes of this organism were inactivated using transposon-mediated mutagenesis, at least 129 of its 484 ORGs were not required for growth. A much smaller genome than that of the M. genitalium is therefore feasible.[13]

Doubling time

One theory predicts that bacteria have smaller genomes due to a selective pressure on genome size to ensure faster replication. The theory is based upon the logical premise that smaller bacterial genomes will take less time to replicate. Subsequently, smaller genomes will be selected preferentially due to enhanced fitness. A study done by Mira et al. indicated little to no correlation between genome size and doubling time.[14] The data indicates that selection is not a suitable explanation for the small sizes of bacterial genomes. Still, many researchers believe there is some selective pressure on bacteria to maintain small genome size.

Deletional bias

Selection is but one process involved in evolution. Two other major processes (mutation and genetic drift) can account for the genome sizes of various types of bacteria. A study done by Mira et al. examined the size of insertions and deletions in bacterial pseudogenes. Results indicated that mutational deletions tend to be larger than insertions in bacteria in the absence of gene transfer or gene duplication.[14] Insertions caused by horizontal or lateral gene transfer and gene duplication tend to involve transfer of large amounts of genetic material. Assuming a lack of these processes, genomes will tend to reduce in size in the absence of selective constraint. Evidence of a deletional bias is present in the respective genome sizes of free-living bacteria, facultative and recently derived parasites and obligate parasites and symbionts.

Free-living bacteria tend to have large population-sizes and are subject to more opportunity for gene transfer. As such, selection can effectively operate on free-living bacteria to remove deleterious sequences resulting in a relatively small number of pseudogenes. Continually, further selective pressure is evident as free-living bacteria must produce all gene-products independent of a host. Given that there is sufficient opportunity for gene transfer to occur and there are selective pressures against even slightly deleterious deletions, it is intuitive that free-living bacteria should have the largest bacterial genomes of all bacteria types.

Recently-formed parasites undergo severe bottlenecks and can rely on host environments to provide gene products. As such, in recently-formed and facultative parasites, there is an accumulation of pseudogenes and transposable elements due to a lack of selective pressure against deletions. The population bottlenecks reduce gene transfer and as such, deletional bias ensures the reduction of genome size in parasitic bacteria.

Obligatory parasites and symbionts have the smallest genome sizes due to prolonged effects of deletional bias. Parasites which have evolved to occupy specific niches are not exposed to much selective pressure. As such, genetic drift dominates the evolution of niche-specific bacteria. Extended exposure to deletional bias ensures the removal of most superfluous sequences. Symbionts occur in drastically lower numbers and undergo the most severe bottlenecks of any bacterial type. There is almost no opportunity for gene transfer for endosymbiotic bacteria, and thus genome compaction can be extreme. One of the smallest bacterial genomes ever to be sequenced is that of the endosymbiont Carsonella rudii.[15] At 160 kbp, the genome of Carsonella is one of the most streamlined examples of a genome examined to date.

Genomic reduction

Molecular phylogenetics has revealed that every clade of bacteria with genome sizes under 2 Mb was derived from ancestors with much larger genomes, thus refuting the hypothesis that bacteria evolved by the successive doubling of small-genomed ancestors.[16] Recent studies performed by Nilsson et al. examined the rates of bacterial genome reduction of obligate bacteria. Bacteria were cultured introducing frequent bottlenecks and growing cells in serial passage to reduce gene transfer so as to mimic conditions of endosymbiotic bacteria. The data predicted that bacteria exhibiting a one-day generation time lose as many as 1,000 kbp in as few as 50,000 years (a relatively short evolutionary time period). Furthermore, after deleting genes essential to the methyl-directed DNA mismatch repair (MMR) system, it was shown that bacterial genome size reduction increased in rate by as much as 50 times.[17] These results indicate that genome size reduction can occur relatively rapidly, and loss of certain genes can speed up the process of bacterial genome compaction.

This is not to suggest that all bacterial genomes are reducing in size and complexity. While many types of bacteria have reduced in genome size from an ancestral state, there are still a huge number of bacteria that maintained or increased genome size over ancestral states.[8] Free-living bacteria experience huge population sizes, fast generation times and a relatively high potential for gene transfer. While deletional bias tends to remove unnecessary sequences, selection can operate significantly amongst free-living bacteria resulting in evolution of new genes and processes.

Horizontal gene transfer

Unlike eukaryotes, which evolve mainly through the modification of existing genetic information, bacteria have acquired a large percentage of their genetic diversity by the horizontal transfer of genes. This creates quite dynamic genomes, in which DNA can be introduced into and removed from the chromosome.[18]

Bacteria have more variation in their metabolic properties, cellular structures, and lifestyles than can be accounted for by point mutations alone. For example, none of the phenotypic traits that distinguish E. coli from Salmonella enterica can be attributed to point mutation. On the contrary, evidence suggests that horizontal gene transfer has bolstered the diversification and speciation of many bacteria.[18]

Horizontal gene transfer is often detected via DNA sequence information. DNA segments obtained by this mechanism often reveal a narrow phylogenetic distribution between related species. Furthermore, these regions sometimes display an unexpected level of similarity to genes from taxa that are assumed to be quite divergent.[18]

Although gene comparisons and phylogenetic studies are helpful in investigating horizontal gene transfer, the DNA sequences of genes are even more revelatory of their origin and ancestry within a genome. Bacterial species differ widely in overall GC content, although the genes in any one species' genome are roughly identical with respect to base composition, patterns of codon usage, and frequencies of di- and trinucleotides. As a result, sequences that are newly acquired through lateral transfer can be identified via their characteristics, which remains that of the donor. For example, many of the S. enterica genes that are not present in E. coli have base compositions that differ from the overall 52% GC content of the entire chromosome. Within this species, some lineages have more than a megabase of DNA that is not present in other lineages. The base compositions of these lineage-specific sequences imply that at least half of these sequences were captured through lateral transfer. Furthermore, the regions adjacent to horizontally obtained genes often have remnants of translocatable elements, transfer origins of plasmids, or known attachment sites of phage integrases.[18]

In some species, a large proportion of laterally transferred genes originate from plasmid-, phage-, or transposon-related sequences.[18]

Although sequence-based methods reveal the prevalence of horizontal gene transfer in bacteria, the results tend to be underestimates of the magnitude of this mechanism, since sequences obtained from donors whose sequence characteristics are similar to those of the recipient will avoid detection.[18]

Comparisons of completely sequenced genomes confirm that bacterial chromosomes are amalgams of ancestral and laterally acquired sequences. The hyperthermophilic Eubacteria Aquifex aeolicus and Thermotoga maritima each has many genes that are similar in protein sequence to homologues in thermophilic Archaea. 24% of Thermotoga's 1,877 ORFs and 16% of Aquifex's 1,512 ORFs show high matches to an Archaeal protein, while mesophiles such as E. coli and B. subtilis have far lesser proportions of genes that are most like Archaeal homologues.[18]

Mechanisms of lateral transfer

The genesis of new abilities due to horizontal gene transfer has three requirements. First, there must exist a possible route for the donor DNA to be accepted by the recipient cell. Additionally, the obtained sequence must be integrated with the rest of the genome. Finally, these integrated genes must benefit the recipient bacterial organism. The first two steps can be achieved via three mechanisms: transformation, transduction and conjugation.[18]

Transformation involves the uptake of named DNA from the environment. Through transformation, DNA can be transmitted between distantly related organisms. Some bacterial species, such as Haemophilus influenzae and Neisseria gonorrhoeae, are continuously competent to accept DNA. Other species, such as Bacillus subtilis and Streptococcus pneumoniae, become competent when they enter a particular phase in their lifecycle.

Transformation in N. gonorrhoeae and H. influenzae is effective only if particular recognition sequences are found in the recipient genomes (5'-GCCGTCTGAA-3' and 5'-AAGTGCGGT-3'. respectively). Although the existence of certain uptake sequences improve transformation capability between related species, many of the inherently competent bacterial species, such as B. subtilis and S. pneumoniae, do not display sequence preference.

New genes may be introduced into bacteria by a bacteriophage that has replicated within a donor through generalized transduction or specialized transduction. The amount of DNA that can be transmitted in one event is constrained by the size of the phage capsid (although the upper limit is about 100 kilobases). While phages are numerous in the environment, the range of microorganisms that can be transduced depends on receptor recognition by the bacteriophage. Transduction does not require both donor and recipient cells to be present simultaneously in time nor space. Phage-encoded proteins both mediate the transfer of DNA into the recipient cytoplasm and assist integration of DNA into the chromosome.[18]

Conjugation involves physical contact between donor and recipient cells and is able to mediate transfers of genes between domains, such as between bacteria and yeast. DNA is transmitted from donor to recipient either by self-transmissible or mobilizable plasmid. Conjugation may mediate the transfer of chromosomal sequences by plasmids that integrate into the chromosome.

Despite the multitude of mechanisms mediating gene transfer among bacteria, the process's success is not guaranteed unless the received sequence is stably maintained in the recipient. DNA integration can be sustained through one of many processes. One is persistence as an episome, another is homologous recombination, and still another is illegitimate incorporation through lucky double-strand break repair.[18]

Traits introduced through lateral gene transfer

Antimicrobial resistance genes grant an organism the ability to grow its ecological niche, since it can now survive in the presence of previously lethal compounds. As the benefit to a bacteria earned from receiving such genes are time- and space-independent, those sequences that are highly mobile are selected for. Plasmids are quite mobilizable between taxa and are the most frequent way by which bacteria acquire antibiotic resistance genes.

Adoption of a pathogenic lifestyle often yields a fundamental shift in an organism's ecological niche. The erratic phylogenetic distribution of pathogenic organisms implies that bacterial virulence is a consequence of the presence, or obtainment of, genes that are missing in avirulent forms. Evidence of this includes the discovery of large 'virulence' plasmids in pathogenic Shigella and Yersinia, as well as the ability to bestow pathogenic properties onto E. coli via experimental exposure to genes from other species.[18]

See also


  1. ^ McCutcheon, J. P.; Von Dohlen, C. D. (2011). "An Interdependent Metabolic Patchwork in the Nested Symbiosis of Mealybugs". Current Biology. 21 (16): 1366–1372. doi:10.1016/j.cub.2011.06.051. PMC 3169327. PMID 21835622.
  2. ^ Van Leuven, JT; Meister, RC; Simon, C; McCutcheon, JP (11 September 2014). "Sympatric speciation in a bacterial endosymbiont results in two genomes with the functionality of one". Cell. 158 (6): 1270–80. doi:10.1016/j.cell.2014.07.047. PMID 25175626.
  3. ^ Han, K; Li, ZF; Peng, R; Zhu, LP; Zhou, T; Wang, LG; Li, SG; Zhang, XB; Hu, W; Wu, ZH; Qin, N; Li, YZ (2013). "Extraordinary expansion of a Sorangium cellulosum genome from an alkaline milieu". Scientific Reports. 3: 2101. doi:10.1038/srep02101. PMC 3696898. PMID 23812535.
  4. ^ Hou, Lin. "Distinct Gene Number-Genome Size Relationships for Eukaryotes and Non-Eukaryotes: Gene Content Estimation for Dinoflagellate Genomes". PLoS One.
  5. ^ a b c d e f Land, et al. "Insights from 20 years of bacterial genome sequencing". Funct Integr Genomics. 2015; 15(2): 141-161 CC-BY icon.svg This article contains quotations from this source, which is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
  6. ^ a b Gregory, T. R. (2005). "Synergy between sequence and size in Large-scale genomics". Nature Reviews Genetics. 6 (9): 699–708. doi:10.1038/nrg1674. PMID 16151375.
  7. ^ a b Cole, S. T.; Eiglmeier, K.; Parkhill, J.; James, K. D.; Thomson, N. R.; Wheeler, P. R.; Honoré, N.; Garnier, T.; Churcher, C.; Harris, D.; Mungall, K.; Basham, D.; Brown, D.; Chillingworth, T.; Connor, R.; Davies, R. M.; Devlin, K.; Duthoy, S.; Feltwell, T.; Fraser, A.; Hamlin, N.; Holroyd, S.; Hornsby, T.; Jagels, K.; Lacroix, C.; MacLean, J.; Moule, S.; Murphy, L.; Oliver, K.; Quail, M. A. (2001). "Massive gene decay in the leprosy bacillus". Nature. 409 (6823): 1007–1011. doi:10.1038/35059006. PMID 11234002.
  8. ^ a b Ochman, H. (2005). "Genomes on the shrink". Proceedings of the National Academy of Sciences. 102 (34): 11959–11960. doi:10.1073/pnas.0505863102. PMC 1189353.
  9. ^ Gregory, T. Ryan (2005). The evolution of the genome. Burlington, MA: Elsevier Academic. ISBN 0123014638.
  10. ^ a b Koonin, E. V. (2009). "Evolution of genome architecture". The International Journal of Biochemistry & Cell Biology. 41 (2): 298–306. doi:10.1016/j.biocel.2008.09.015. PMC 3272702.
  11. ^ Kuo, C. -H.; Moran, N. A.; Ochman, H. (2009). "The consequences of genetic drift for bacterial genome complexity". Genome Research. 19 (8): 1450–1454. doi:10.1101/gr.091785.109. PMC 2720180. PMID 19502381.
  12. ^ Ochman, H.; Davalos, L. M. (2006). "The Nature and Dynamics of Bacterial Genomes". Science. 311 (5768): 1730–1733. doi:10.1126/science.1119966. PMID 16556833.
  13. ^ a b c d e f g h i Moran. 'Microbial Minimalism: Genome Reduction in Bacterial Pathogens'. Cell. Volume 108, Issue 5. 8 March 2002.
  14. ^ a b Mira, A.; Ochman, H.; Moran, N. A. (2001). "Deletional bias and the evolution of bacterial genomes". Trends in Genetics. 17 (10): 589–596. doi:10.1016/S0168-9525(01)02447-7. PMID 11585665.
  15. ^ Nakabachi, A.; Yamashita, A.; Toh, H.; Ishikawa, H.; Dunbar, H. E.; Moran, N. A.; Hattori, M. (2006). "The 160-Kilobase Genome of the Bacterial Endosymbiont Carsonella". Science. 314 (5797): 267. doi:10.1126/science.1134196. PMID 17038615.
  16. ^ Ochman. "Genomes on the shrink". PNAS. Vol. 102, no. 34.
  17. ^ Nilsson, A. I.; Koskiniemi, S.; Eriksson, S.; Kugelberg, E.; Hinton, J. C.; Andersson, D. I. (2005). "Bacterial genome size reduction by experimental evolution". Proceedings of the National Academy of Sciences. 102 (34): 12112–12116. doi:10.1073/pnas.0503654102. PMC 1189319. PMID 16099836.
  18. ^ a b c d e f g h i j k Ochman, Lawrence, and Groisman. "Lateral gene transfer and the nature of bacterial innovation". Nature. 18 May 2000.
100K Pathogen Genome Project

The 100K Pathogen Genome Project was launched in July 2012 by Bart Weimer (UC Davis) as an academic, public, and private partnership. It aims to sequence the genomes of 100,000 infectious microorganisms to create a database of bacterial genome sequences for use in public health, outbreak detection, and bacterial pathogen detection. This will speed up the diagnosis of foodborne illnesses and shorten infectious disease outbreaks.The 100K Pathogen Genome Project is a public-private collaborative project to sequence the genomes of 100,000 infectious microorganisms. The 100K Genome Project will provide a roadmap for developing tests to identify pathogens and trace their origins more quickly.

Partners announced in the launch of the project were UC Davis, Agilent Technologies, and the US Food and Drug Administration, with the US Centers for Disease Control and Prevention and the US Department of Agriculture noted as collaborators. As the project has proceeded, the partnership has evolved to include or replace these founding partners. The 100K Pathogen Genome Project was selected by the IBM/Mars Food Safety Consortium for metagenomic sequences.

The 100K Pathogen Genome Project is conducting high-throughput next-generation sequencing (NGS) to investigate the genomes of targeted microorganisms, with whole genome sequencing to be carried out on a small number of microorganisms for use as a reference genome. Most bacterial strains will be sequenced and assembled as draft genomes; however, the project has also produced closed genomes for a variety of enteric pathogens in the 100K bioproject. Data from this project is also available for download at the 100K Pathogen Genome Project [1] website.

This strategy enables worldwide collaboration to identify sets of genetic biomarkers associated with important pathogen traits. This five-year microbial pathogen project will result in a free, public database containing the sequence information for each pathogen's genome. The completed gene sequences will be stored in the National Institutes of Health (NIH)'s National Center for Biotechnology Information (NCBI)'s public database. Using the database, scientists will be able to develop new methods of controlling disease-causing bacteria in the food chain.


BASys (Bacterial Annotation System) is a freely available web server that can be used to perform automated, comprehensive annotation of bacterial genomes. With the advent of next generation DNA sequencing it is now possible to sequence the complete genome of a bacterium (typically ~4 million bases) within a single day. This has led to an explosion in the number of fully sequenced microbes. In fact, as of 2013, there were more than 2700 fully sequenced bacterial genomes deposited with GenBank. However, a continuing challenge with microbial genomics is finding the resources or tools for annotating the large number of newly sequenced genomes. BASys was developed in 2005 in anticipation of these needs. In fact, BASys was the world’s first publicly accessible microbial genome annotation web server. Because of its widespread popularity, the BASys server was updated in 2011 through the addition of multiple server nodes to handle the large number of queries it was receiving.

The BASys server is designed to accept either assembled genome data (raw DNA sequence data) or complete proteome assignments as input. If raw DNA sequence is provided, BASys employs Glimmer (version 2.1.3) to identify the genes. The output from BASys is a comprehensive genome-wide annotation (with ~60 annotation subfields for each gene) and a zoomable, hyperlinked genome map of the query genome. BASys uses nearly 30 different programs to determine and annotate gene/protein names, GO functions, COG functions, possible paralogues and orthologues, molecular weight, isoelectric point, operon structure, subcellular localization, signal peptides, transmembrane regions, secondary structure, 3D structure, reactions and pathways. The full list of programs used by BASys is given below:

In addition to its extensive annotation for each gene/protein in the query genome, BASys also generates colorful, clickable and fully zoomable circular maps of each input chromosome. These bacterial genome maps are generated used a program called CGView (Circular Genome Viewer) which was developed in 2004. The genome maps are designed to allow rapid navigation and detailed visualization of all the BASys-generated gene annotations. A complete BASys run takes approximately 16 h for an average bacterial chromosome (approximately 4 Megabases). BASys annotations may be viewed and downloaded anonymously or through a password protected access system. BASys will store its bacterial genome annotations on the server for a maximum of 180 days. BASys handles approximately 1000 submissions a year. BASys is accessible at


BacMap is a freely available web-accessible database containing fully annotated, fully zoomable and fully searchable chromosome maps from more than 2500 prokaryotic (archaebacterial and eubacterial) species. BacMap was originally developed in 2005 to address the challenges of viewing and navigating through the growing numbers of bacterial genomes that were being generated through large-scale sequencing efforts. Since it was first introduced, the number of bacterial genomes in BacMap has grown by more than 15X. Essentially BacMap functions as an on-line visual atlas of microbial genomes. All of the genome annotations in BacMap were generated through the BASys genome annotation system. BASys is a widely used microbial annotation infrastructure that performs comprehensive bionformatic analyses on raw (or labeled) bacterial genome sequence data. All of the genome (chromosome) maps in BacMap were constructed using the program known as CGView. CGView is a popular visualization program for generating interactive, web-compatible circular chromosome maps (Fig. 1). Each chromosome map in BacMap is extensively hyperlinked and each chromosome image can be interactively navigated, expanded and rotated using navigation buttons or hyperlinks. All identified genes in a BacMap chromosome map are colored according to coding directions and when sufficiently zoomed-in, gene labels are visible. Each gene label on a BacMap genome map is also hyperlinked to a 'gene card' (Fig. 2). The gene cards provide detailed information about the corresponding DNA and protein sequences. Each genome map in BacMap is searchable via BLAST and a gene name/synonym search.

Because of the growing interest in metagenomics and large-scale bacterial genome analysis, BacMap was extensively updated in 2012. With the latest update, all of BacMap’s bacterial genome maps now have separate prophage genome maps as well as separate tRNA and rRNA maps. Each bacterial chromosome entry in BacMap now contains graphs and tables on a variety of gene and protein statistics. All of the bacterial species listed in BacMap now have bacterial 'biography' cards, with corresponding information on the microbe’s taxonomy, phenotypic traits, other descriptions and electron microscopy or other high-resolution images of the microbe itself. BacMap also has a number of updated data browsing and text searching tools that allow filtering, sorting and more facile display of the chromosome maps and their contents.


CFP-10 also known as ESAT-6-like protein esxB or secreted antigenic protein MTSA-10 or 10 kDa culture filtrate antigen CFP-10 is a protein that is encoded by the esxB gene.CFP-10 is a 10 kDa secreted antigen from Mycobacterium tuberculosis. It forms a 1:1 heterodimeric complex with ESAT-6. Both genes are expressed from the RD1 region of the bacterial genome and play a key role in the virulence of the infection.

Fertility factor (bacteria)

The fertility factor (first named F by one of its discoverers Esther Lederberg; also called the sex factor in E. coli or the F sex factor; also called F-plasmid) allows genes to be transferred from one bacterium carrying the factor to another bacterium lacking the factor by conjugation. The F factor is carried on the F episome, the first episome to be discovered. Unlike other plasmids, F factor is constitutive for transfer proteins due to a mutation in the gene finO. The F plasmid belongs to a class of conjugative plasmids that control sexual functions of bacteria with a fertility inhibition (Fin) system.


GeneMark is a generic name for a family of ab initio gene prediction programs developed at the Georgia Institute of Technology in Atlanta. Developed in 1993, original GeneMark was used in 1995 as a primary gene prediction tool for annotation of the first completely sequenced bacterial genome of Haemophilus influenzae, and in 1996 for the first archaeal genome of Methanococcus jannaschii. The algorithm introduced inhomogeneous three-periodic Markov chain models of protein-coding DNA sequence that became standard in gene prediction as well as Bayesian approach to gene prediction in two DNA strands simultaneously. Species specific parameters of the models were estimated from training sets of sequences of known type (protein-coding and non-coding). The major step of the algorithm computes for a given DNA fragment posterior probabilities of either being "protein-coding" (carrying genetic code) in each of six possible reading frames (including three frames in complementary DNA strand) or being "non-coding". Original GeneMark (developed before the HMM era in Bioinformatics) is an HMM-like algorithm; it can be viewed as approximation to known in the HMM theory posterior decoding algorithm for appropriately defined HMM.


In the fields of molecular biology and genetics, a genome is the genetic material of an organism. It consists of DNA (or RNA in RNA viruses). The genome includes both the genes (the coding regions) and the noncoding DNA, as well as mitochondrial DNA and chloroplast DNA. The study of the genome is called genomics.

George Weinstock

George M. Weinstock (born February 6, 1949) is an American geneticist and microbiologist on the faculty of The Jackson Laboratory for Genomic Medicine, where he is a professor and the associate director for microbial genomics. Before joining The Jackson Laboratory, he taught at Washington University in St. Louis and served as associate director of The Genome Institute. Previously, Dr. Weinstock was Co-Director of the Human Genome Sequencing Center (HGSC) at Baylor College of Medicine in Houston, Texas, and Professor of Molecular and Human Genetics there.[1] He received his B.S. degree from the University of Michigan in 1970 and his Ph.D. from the Massachusetts Institute of Technology in 1977. He has spent most of his career taking genomic approaches to study fundamental biological processes.

Weinstock's parents met during the Manhattan Project in Los Alamos, New Mexico, and he grew up meeting many of the participants in the atomic bomb project and their colleagues. He performed his PhD thesis under David Botstein at MIT, studying the structure of phage P22 chromosome.

As a postdoctoral fellow with Dr. I. R. Lehman at Stanford University School of Medicine, Dr. Weinstock and Kevin McEntee discovered that the RecA protein of E. coli catalyzed strand transfer in genetic recombination. Later, as a faculty member at the University of Texas at Houston, he led one of the first bacterial genome projects, collaborating with The Institute for Genomic Research to sequence the entire genome of a bacterium, Treponema pallidum, the organism that causes syphilis. In 1999 he joined Richard Gibbs at the HGSC as one of the five main centers to work on the Human Genome Project. The HGSC produced sequences of human chromosomes 3, 12 and X. Dr. Weinstock was a principal investigator in projects producing genome sequences for rat, mouse, macaque, bovine, sea urchin, honey bee, fruit fly and many microbial genomes, as well as one of the first personal genome projects, sequencing Dr. James Watson’s genome using next-generation sequencing technology.He was a leader of the Human Microbiome Project, studying the collection of microbes that colonize the human body.

Hamilton O. Smith

Hamilton Othanel Smith (born August 23, 1931) is an American microbiologist and Nobel laureate.Smith was born on August 23, 1931, and graduated from University Laboratory High School of Urbana, Illinois. He attended the University of Illinois at Urbana-Champaign, but in 1950 transferred to the University of California, Berkeley, where he earned his B.A. in Mathematics in 1952 [1]. He received his medical degree from Johns Hopkins University in 1956. In 1975, he was awarded a Guggenheim Fellowship he spent at the University of Zurich.

In 1970, Smith and Kent W. Wilcox discovered the first type II restriction enzyme, that is now called as HindII. Smith went on to discover DNA methylases that constitute the other half of the bacterial host restriction and modification systems, as hypothesized by Werner Arber of Switzerland.He was awarded the Nobel Prize in Physiology or Medicine in 1978 for discovering type II restriction enzymes with Werner Arber and Daniel Nathans as co-recipients.

He later became a leading figure in the nascent field of genomics, when in 1995 he and a team at The Institute for Genomic Research sequenced the first bacterial genome, that of Haemophilus influenzae. H. influenza was the same organism in which Smith had discovered restriction enzymes in the late 1960s. He subsequently played a key role in the sequencing of many of the early genomes at The Institute for Genomic Research, and in the assembly of the human genome at Celera Genomics, which he joined when it was founded in 1998.

More recently, he has directed a team at the J. Craig Venter Institute that works towards creating a partially synthetic bacterium, Mycoplasma laboratorium. In 2003 the same group synthetically assembled the genome of a virus, Phi X 174 bacteriophage. Currently, Smith is scientific director of privately held Synthetic Genomics, which was founded in 2005 by Craig Venter to continue this work. Currently, Synthetic Genomics is working to produce biofuels on an industrial-scale using recombinant algae and other microorganisms.

Listeria monocytogenes non-coding RNA

Listeria monocytogenes is a gram positive bacterium and causes many food-borne infections such as Listeriosis. This bacteria is ubiquitous in the environment where it can act as either a saprophyte when free living within the environment or as a pathogen when entering a host organism. Many non-coding RNAs have been identified within the bacteria genome where several of these have been classified as novel non-coding RNAs and may contribute to pathogenesis.Tiling arrays and mutagenesis identified many non-coding RNAs within the L. monocytogenes genome and the location of these non-coding RNAs within the bacterial genome was confirmed by RACE (rapid amplification of cDNA ends) analysis. These studies showed that the expression of many non-coding RNAs was dependent on the environment and that several of these non-coding RNAs act as cis-regulatory elements. Comparisons between previously characterized non-coding RNAs and those present in the L. monocyotogenes genome identified 50 novel non-coding RNAs in L. monocyotogenes. An additional comparative study between the pathogenic L. monocytogenes strain and the non pathogenic L. innocua strain identified several non-coding RNAs that are only present within L. monocytogenes which suggests that these ncRNAs may have a role in pathogenesis. The tables below summarizes the location, flanking genes and also the characteristics of the novel small non-coding RNAs identified and the previously characterized non-coding RNAs present in L. monocytogenes

Novel Non-coding RNAs

aArrows indicate the sense of the gene on the genome. Bold arrows indicate gene absent from L. innocua.

Listeria monocytogenes EGD-e strain was used in these studies EMBL accession AL591824.1

Characterised non-coding RNAs

Mycoplasma genitalium

Mycoplasma genitalium (MG, commonly known as Mgen), is a sexually transmitted, small and pathogenic bacterium that lives on the skin cells of the urinary and genital tracts in humans. Mgen is a sexually transmitted infection, which is becoming increasingly common. Resistance to multiple antibiotics is occurring, including azithromycin which until recently was the most reliable line treatment. The bacteria was first isolated from urogenital tract of humans in 1981, and was eventually identified as a new species of Mycoplasma in 1983. It can cause negative health effects in men and women. It also increases the risk factor for HIV spread with higher occurrences in homosexual men and those previously treated with the azithromycin antibiotics.Specifically, it causes urethritis in both men and women, and also cervicitis and pelvic inflammation in women. It presents clinically similar symptoms to that of Chlamydia trachomatis infection and has shown higher incidence rates, compared to both Chlamydia trachomatis and Neisseria gonorrhoeae infections in some populations. Its complete genome sequence was published in 1995 (size 0.58 Mbp, with 475 genes). It was regarded as a cellular unit with the smallest genome size (in Mbp) until 2003 when a new species of Archaea, namely Nanoarchaeum equitans, was sequenced (0.49 Mbp, with 540 genes). However, Mgen still has the smallest genome of any known (naturally occurring) self-replicating organism and thus is often the organism of choice in minimal genome research.

The synthetic genome of Mgen named Mycoplasma genitalium JCVI-1.0 (after the research centre, J. Craig Venter Institute, where it was synthesised) was produced in 2008, becoming the first organism with a synthetic genome. In 2014, a protein was described called Protein M from M. genitalium.

Mycoplasma laboratorium

Mycoplasma laboratorium is a designed, partially synthetic species of bacterium derived from the genome of Mycoplasma genitalium. This effort in synthetic biology is being undertaken at the J. Craig Venter Institute by a team of approximately 20 scientists headed by Nobel laureate Hamilton Smith and including DNA researcher Craig Venter and microbiologist Clyde A. Hutchison III. Mycoplasma genitalium was chosen as it was the species with the smallest number of genes known at that time.

On May 21, 2010, Science reported that the Venter group had successfully synthesized the genome of the bacterium Mycoplasma mycoides from a computer record and transplanted it into an existing cell of Mycoplasma capricolum that had its DNA removed. The team used M. mycoides instead of M. genitalium because it grew faster. The new bacterium was viable—that is, capable of replicating billions of times—but not, strictly speaking, a truly synthetic life form.It is estimated that the synthetic genome cost US$40 million and 200 man-years to produce. Despite the controversy, Venter's company Synthetic Genomics has secured over $110 million in investment capital and inked a $300 million deal with Exxon Mobil to design algae for diesel fuel.


The myxobacteria ("slime bacteria") are a group of bacteria that predominantly live in the soil and feed on insoluble organic substances. The myxobacteria have very large genomes, relative to other bacteria, e.g. 9–10 million nucleotides except for Anaeromyxobacter and Vulgatibacter. One of the myxobacteria, Minicystis rosea, has the largest bacterial genome with over 16 million nucleotides. The second largest is another myxobacteria Sorangium cellulosum. Myxobacteria are included among the delta group of proteobacteria, a large taxon of Gram-negative forms.

Myxobacteria can move by gliding. They typically travel in swarms (also known as wolf packs), containing many cells kept together by intercellular molecular signals. Individuals benefit from aggregation as it allows accumulation of the extracellular enzymes that are used to digest food; this in turn increases feeding efficiency. Myxobacteria produce a number of biomedically and industrially useful chemicals, such as antibiotics, and export those chemicals outside the cell.


Pathogen infections are among the leading causes of infirmity and mortality among humans and other animals in the world. Until recently, it has been difficult to compile information to understand the generation of pathogen virulence factors as well as pathogen behaviour in a host environment. The study of pathogenomics attempts to utilize genomic and metagenomics data gathered from high through-put technologies (e.g. sequencing or DNA microarrays), to understand microbe diversity and interaction as well as host-microbe interactions involved in disease states. The bulk of pathogenomics research concerns itself with pathogens that affect human health; however, studies also exist for plant and animal infecting microbes.

Protospacer adjacent motif

Protospacer adjacent motif (PAM) is a 2-6 base pair DNA sequence immediately following the DNA sequence targeted by the Cas9 nuclease in the CRISPR bacterial adaptive immune system. PAM is a component of the invading virus or plasmid, but is not a component of the bacterial CRISPR locus. Cas9 will not successfully bind to or cleave the target DNA sequence if it is not followed by the PAM sequence. PAM is an essential targeting component (not found in bacterial genome) which distinguishes bacterial self from non-self DNA, thereby preventing the CRISPR locus from being targeted and destroyed by nuclease.

Site-specific recombination

Site-specific recombination, also known as conservative site-specific recombination, is a type of genetic recombination in which DNA strand exchange takes place between segments possessing at least a certain degree of sequence homology. Site-specific recombinases (SSRs) perform rearrangements of DNA segments by recognizing and binding to short DNA sequences (sites), at which they cleave the DNA backbone, exchange the two DNA helices involved and rejoin the DNA strands. While in some site-specific recombination systems just a recombinase enzyme and the recombination sites is enough to perform all these reactions, in other systems a number of accessory proteins and/or accessory sites are also needed. Multiple genome modification strategies, among these recombinase-mediated cassette exchange (RMCE), an advanced approach for the targeted introduction of transcription units into predetermined genomic loci, rely on the capacities of SSRs.

Site-specific recombination systems are highly specific, fast and efficient, even when faced with complex eukaryotic genomes. They are employed in a variety of cellular processes, including bacterial genome replication, differentiation and pathogenesis, and movement of mobile genetic elements (Nash 1996). For the same reasons, they present a potential basis for the development of genetic engineering tools.Recombination sites are typically between 30 and 200 nucleotides in length and consist of two motifs with a partial inverted-repeat symmetry, to which the recombinase binds, and which flank a central crossover sequence at which the recombination takes place. The pairs of sites between which the recombination occurs are usually identical, but there are exceptions (e.g. attP and attB of λ integrase, see lambda phage).

Sorangium cellulosum

Sorangium cellulosum is a soil-dwelling Gram-negative bacterium of the group myxobacteria. It is motile and shows gliding motility. Under stressful conditions this motility, as in other myxobacteria, the cells congregate to form fruiting bodies and differentiate into myxospores. These congregating cells make isolation of pure culture and colony counts on agar medium difficult as the bacterium spread and colonies merge. It has an unusually-large genome of 13,033,779 base pairs, making it the largest bacterial genome sequenced to date by roughly 4 Mb.

Totally drug-resistant tuberculosis

Totally drug-resistant tuberculosis (TDR-TB) is a generic term for tuberculosis strains that are resistant to a wider range of drugs than strains classified as extensively drug-resistant tuberculosis. TDR-TB has been identified in three countries; India, Iran, and Italy. The emergence of TDR-TB has been documented in four major publications. However, it is not recognised by the World Health Organization.

TDR-TB has resulted from further mutations within the bacterial genome to confer resistance, beyond those seen in XDR- and MDR-TB. Development of resistance is associated with poor management of cases. Drug resistance testing occurs in only 9% of TB cases worldwide. Without testing to determine drug resistance profiles, MDR- or XDR-TB patients may develop resistance to additional drugs. TDR-TB is relatively poorly documented, as many countries do not test patient samples against a broad enough range of drugs to diagnose such a comprehensive array of resistance. The United Nations' Special Programme for Research and Training in Tropical Diseases has set up a TDR Tuberculosis Specimen Bank to archive specimens of TDR-TB.

Vibrio regulatory RNA of OmpA

VrrA (Vibrio regulatory RNA of OmpA) is a non-coding RNA that is conserved across all Vibrio species of bacteria and acts as a repressor for the synthesis of the outer membrane protein OmpA. This non-coding RNA was initially identified from Tn5 transposon mutant libraries of Vibrio cholerae and its location within the bacterial genome was mapped to the intergenic region between genes VC1741 and VC1743 by RACE analysis.Outer membrane vesicles are secreted from the surface of gram-negative bacteria, where they are thought to aid in virulence. Little is known about how these vesicles aid virulence but it has been speculated that they may contribute by secreting toxins and help in the evasion of the immune system.Recent studies showed that VrrA expression is activated by the alternative stress sigma factor, sigma E; unlike other strains of bacteria such as E. coli and Salmonella, it does not require the Hfq protein to regulate the sigma factor. It was also shown that VrrA transcription increases on exposure to UV light and that over expression of VrrA resulted in an increase in outer membrane vesicles secreted. From these studies it has been suggested that VrrA acts to relieve outer membrane stress by limiting the synthesis of OmpA protein and that outer membrane vesicles provide the bacteria physical protect against UV light.

This page is based on a Wikipedia article written by authors (here).
Text is available under the CC BY-SA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.