A nucleic acid sequence is a succession of letters that indicate the order of nucleotides forming alleles within a DNA (using GACT) or RNA (GACU) molecule. By convention, sequences are usually presented from the 5' end to the 3' end. For DNA, the sense strand is used. Because nucleic acids are normally linear (unbranched) polymers, specifying the sequence is equivalent to defining the covalent structure of the entire molecule. For this reason, the nucleic acid sequence is also termed the primary structure.
The sequence has capacity to represent information. Biological deoxyribonucleic acid represents the information which directs the functions of a living thing.
Nucleic acids also have a secondary structure and tertiary structure. Primary structure is sometimes mistakenly referred to as primary sequence. Conversely, there is no parallel concept of secondary or tertiary sequence.
Nucleic acids consist of a chain of linked units called nucleotides. Each nucleotide consists of three subunits: a phosphate group and a sugar (ribose in the case of RNA, deoxyribose in DNA) make up the backbone of the nucleic acid strand, and attached to the sugar is one of a set of nucleobases. The nucleobases are important in base pairing of strands to form higher-level secondary and tertiary structure such as the famed double helix.
The possible letters are A, C, G, and T, representing the four nucleotide bases of a DNA strand — adenine, cytosine, guanine, thymine — covalently linked to a phosphodiester backbone. In the typical case, the sequences are printed abutting one another without gaps, as in the sequence AAAGTCTGAC, read left to right in the 5' to 3' direction. With regards to transcription, a sequence is on the coding strand if it has the same order as the transcribed RNA.
One sequence can be complementary to another sequence, meaning that they have the base on each position in the complementary (i.e. A to T, C to G) and in the reverse order. For example, the complementary sequence to TTAC is GTAA. If one strand of the double-stranded DNA is considered the sense strand, then the other strand, considered the antisense strand, will have the complementary sequence to the sense strand.
Comparing and determining % difference between two nucleotide sequences.
While A, T, C, and G represent a particular nucleotide at a position, there are also letters that represent ambiguity which are used when more than one kind of nucleotide could occur at that position. The rules of the International Union of Pure and Applied Chemistry (IUPAC) are as follows:
|B||not A (B comes after A)||C||G||T||3||V|
|D||not C (D comes after C)||A||G||T||H|
|H||not G (H comes after G)||A||C||T||D|
|V||not T (V comes after T and U)||A||C||G||B|
|N||any Nucleotide (not a gap)||A||C||G||T||4||N|
These symbols are also valid for RNA, except with U (uracil) replacing T (thymine).
Apart from adenine (A), cytosine (C), guanine (G), thymine (T) and uracil (U), DNA and RNA also contain bases that have been modified after the nucleic acid chain has been formed. In DNA, the most common modified base is 5-methylcytidine (m5C). In RNA, there are many modified bases, including pseudouridine (Ψ), dihydrouridine (D), inosine (I), ribothymidine (rT) and 7-methylguanosine (m7G). Hypoxanthine and xanthine are two of the many bases created through mutagen presence, both of them through deamination (replacement of the amine-group with a carbonyl-group). Hypoxanthine is produced from adenine, and xanthine is produced from guanine. Similarly, deamination of cytosine results in uracil.
In biological systems, nucleic acids contain information which is used by a living cell to construct specific proteins. The sequence of nucleobases on a nucleic acid strand is translated by cell machinery into a sequence of amino acids making up a protein strand. Each group of three bases, called a codon, corresponds to a single amino acid, and there is a specific genetic code by which each possible combination of three bases corresponds to a specific amino acid.
The central dogma of molecular biology outlines the mechanism by which proteins are constructed using information contained in nucleic acids. DNA is transcribed into mRNA molecules, which travels to the ribosome where the mRNA is used as a template for the construction of the protein strand. Since nucleic acids can bind to molecules with complementary sequences, there is a distinction between "sense" sequences which code for proteins, and the complementary "antisense" sequence which is by itself nonfunctional, but can bind to the sense strand.
DNA sequencing is the process of determining the nucleotide sequence of a given DNA fragment. The sequence of the DNA of a living thing encodes the necessary information for that living thing to survive and reproduce. Therefore, determining the sequence is useful in fundamental research into why and how organisms live, as well as in applied subjects. Because of the importance of DNA to living things, knowledge of a DNA sequence may be useful in practically any biological research. For example, in medicine it can be used to identify, diagnose and potentially develop treatments for genetic diseases. Similarly, research into pathogens may lead to treatments for contagious diseases. Biotechnology is a burgeoning discipline, with the potential for many useful products and services.
RNA is not sequenced directly. Instead, it is copied to a DNA by reverse transcriptase, and this DNA is then sequenced.
Current sequencing methods rely on the discriminatory ability of DNA polymerases, and therefore can only distinguish four bases. An inosine (created from adenosine during RNA editing) is read as a G, and 5-methyl-cytosine (created from cytosine by DNA methylation) is read as a C. With current technology, it is difficult to sequence small amounts of DNA, as the signal is too weak to measure. This is overcome by polymerase chain reaction (PCR) amplification.
Once a nucleic acid sequence has been obtained from an organism, it is stored in silico in digital format. Digital genetic sequences may be stored in sequence databases, be analyzed (see Sequence analysis below), be digitally altered and be used as templates for creating new actual DNA using artificial gene synthesis.
Digital genetic sequences may be analyzed using the tools of bioinformatics to attempt to determine its function.
The DNA in an organism's genome can be analyzed to diagnose vulnerabilities to inherited diseases, and can also be used to determine a child's paternity (genetic father) or a person's ancestry. Normally, every person carries two variations of every gene, one inherited from their mother, the other inherited from their father. The human genome is believed to contain around 20,000 - 25,000 genes. In addition to studying chromosomes to the level of individual genes, genetic testing in a broader sense includes biochemical tests for the possible presence of genetic diseases, or mutant forms of genes associated with increased risk of developing genetic disorders.
Genetic testing identifies changes in chromosomes, genes, or proteins. Usually, testing is used to find changes that are associated with inherited disorders. The results of a genetic test can confirm or rule out a suspected genetic condition or help determine a person's chance of developing or passing on a genetic disorder. Several hundred genetic tests are currently in use, and more are being developed.
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be due to functional, structural, or evolutionary relationships between the sequences. If two sequences in an alignment share a common ancestor, mismatches can be interpreted as point mutations and gaps as insertion or deletion mutations (indels) introduced in one or both lineages in the time since they diverged from one another. In sequence alignments of proteins, the degree of similarity between amino acids occupying a particular position in the sequence can be interpreted as a rough measure of how conserved a particular region or sequence motif is among lineages. The absence of substitutions, or the presence of only very conservative substitutions (that is, the substitution of amino acids whose side chains have similar biochemical properties) in a particular region of the sequence, suggest that this region has structural or functional importance. Although DNA and RNA nucleotide bases are more similar to each other than are amino acids, the conservation of base pairs can indicate a similar functional or structural role.
Computational phylogenetics makes extensive use of sequence alignments in the construction and interpretation of phylogenetic trees, which are used to classify the evolutionary relationships between homologous genes represented in the genomes of divergent species. The degree to which sequences in a query set differ is qualitatively related to the sequences' evolutionary distance from one another. Roughly speaking, high sequence identity suggests that the sequences in question have a comparatively young most recent common ancestor, while low identity suggests that the divergence is more ancient. This approximation, which reflects the "molecular clock" hypothesis that a roughly constant rate of evolutionary change can be used to extrapolate the elapsed time since two genes first diverged (that is, the coalescence time), assumes that the effects of mutation and selection are constant across sequence lineages. Therefore, it does not account for possible difference among organisms or species in the rates of DNA repair or the possible functional conservation of specific regions in a sequence. (In the case of nucleotide sequences, the molecular clock hypothesis in its most basic form also discounts the difference in acceptance rates between silent mutations that do not alter the meaning of a given codon and other mutations that result in a different amino acid being incorporated into the protein.) More statistically accurate methods allow the evolutionary rate on each branch of the phylogenetic tree to vary, thus producing better estimates of coalescence times for genes.
Frequently the primary structure encodes motifs that are of functional importance. Some examples of sequence motifs are: the C/D and H/ACA boxes of snoRNAs, Sm binding site found in spliceosomal RNAs such as U1, U2, U4, U5, U6, U12 and U3, the Shine-Dalgarno sequence, the Kozak consensus sequence and the RNA polymerase III terminator.
Peng et al  found the existence of long-range correlations in the non-coding base pair sequences of DNA. In contrast, such correlations seem not to appear in coding DNA sequences. This finding has been explained by Grosberg et al by the global spatial structure of the DNA.
In Bioinformatics, a sequence entropy, also known as sequence complexity or information profile, is a numerical sequence providing a quantitative measure of the local complexity of a DNA sequence, independently of the direction of processing. The manipulations of the information profiles enable the analysis of the sequences using alignment-free techniques, such as for example in motif and rearrangements detection. 
Apolipoprotein A-II is a protein that in humans is encoded by the APOA2 gene.Afovirsen
Afovirsen is an oligonucleotide capable of antisense interactions with mRNA of human papillomavirus. It has been investigated as a tool for diagnostics and therapeutics.Biopolymer
Biopolymers are polymers produced by living organisms; in other words, they are polymeric biomolecules. Biopolymers contain monomeric units that are covalently bonded to form larger structures. There are three main classes of biopolymers, classified according to the monomeric units used and the structure of the biopolymer formed: polynucleotides (RNA and DNA), which are long polymers composed of 13 or more nucleotide monomers; polypeptides, which are short polymers of amino acids; and polysaccharides, which are often linear bonded polymeric carbohydrate structures. Other examples of biopolymers include rubber, suberin, melanin and lignin.
Cellulose is the most common organic compound and biopolymer on Earth. About 33 percent of all plant matter is cellulose. The cellulose content of cotton is 90 percent, for wood it is 50 percent.Felsenstein's tree-pruning algorithm
In statistical genetics, Felsenstein's tree-pruning algorithm (or Felsenstein's tree-peeling algorithm), attributed to Joseph Felsenstein, is an algorithm for computing the likelihood of an evolutionary tree from nucleic acid sequence data. The algorithm is often used as a subroutine in a search for a maximum likelihood estimate for an evolutionary tree. Further, it can be used in a hypothesis test for whether evolutionary rates are constant (by using likelihood ratio tests). It can also be used to provide error estimates for the parameters describing an evolutionary tree.Fluorescence in situ hybridization
Fluorescence in situ hybridization (FISH) is a molecular cytogenetic technique that uses fluorescent probes that bind to only those parts of a nucleic acid sequence with a high degree of sequence complementarity. It was developed by biomedical researchers in the early 1980s to detect and localize the presence or absence of specific DNA sequences on chromosomes. Fluorescence microscopy can be used to find out where the fluorescent probe is bound to the chromosomes. FISH is often used for finding specific features in DNA for use in genetic counseling, medicine, and species identification. FISH can also be used to detect and localize specific RNA targets (mRNA, lncRNA and miRNA) in cells, circulating tumor cells, and tissue samples. In this context, it can help define the spatial-temporal patterns of gene expression within cells and tissues.Infectious hypodermal and hematopoietic necrosis
Infectious hypodermal and hematopoietic necrosis (IHHN) is a viral disease of penaeid shrimp that causes mass mortality (up to 90%) among the Western blue shrimp (Penaeus stylirostris) and severe deformations in the Pacific white shrimp (P. vannamei). It occurs in Pacific farmed and wild shrimp, but not in wild shrimp on the Atlantic coast of the Americas. The shrimp-farming industry has developed several broodstocks of both P. stylirostris and P. vannamei that are resistant against IHHN infection.The disease is caused by a single-stranded DNA virus of the species Decapod pestyldensovirus 1, earlier known as IHHN virus, the smallest of the known penaeid shrimp viruses (22 nm).Marker gene
In biology, a marker gene may have several meanings. In nuclear biology and molecular biology, a marker gene is a gene used to determine if a nucleic acid sequence has been successfully inserted into an organism's DNA. In particular, there are two sub-types of these marker genes: a selectable marker and a marker for screening. In metagenomics and phylogenetics, a marker gene is an orthologous gene group which can be used to delineate between taxonomic lineages.Molecular beacon
Molecular beacons are oligonucleotide hybridization probes that can report the presence of specific nucleic acids in homogenous solutions. The term more often used is molecular beacon probes. Molecular beacons are hairpin shaped molecules with an internally quenched fluorophore whose fluorescence is restored when they bind to a target nucleic acid sequence. This is a novel non-radioactive method for detecting specific sequences of nucleic acids. They are useful in situations where it is either not possible or desirable to isolate the probe-target hybrids from an excess of the hybridization probes.NASBA (molecular biology)
Nucleic acid sequence based amplification (NASBA) is a method in molecular biology which is used to amplify RNA sequences.Nucleic acid
Nucleic acids are the biopolymers, or small biomolecules, essential to all known forms of life. The term nucleic acid is the overall name for DNA and RNA. They are composed of nucleotides, which are the monomers made of three components: a 5-carbon sugar, a phosphate group and a nitrogenous base. If the sugar is a compound ribose, the polymer is RNA (ribonucleic acid); if the sugar is derived from ribose as deoxyribose, the polymer is DNA (deoxyribonucleic acid).
Nucleic acids are the most important of all biomolecules. They are found in abundance in all living things, where they function to create and encode and then store information in the nucleus of every living cell of every life-form organism on Earth. In turn, they function to transmit and express that information inside and outside the cell nucleus—to the interior operations of the cell and ultimately to the next generation of each living organism. The encoded information is contained and conveyed via the nucleic acid sequence, which provides the 'ladder-step' ordering of nucleotides within the molecules of RNA and DNA.
Strings of nucleotides are bonded to form helical backbones—typically, one for RNA, two for DNA—and assembled into chains of base-pairs selected from the five primary, or canonical, nucleobases, which are: adenine, cytosine, guanine, thymine, and uracil; note, thymine occurs only in DNA and uracil only in RNA. Using amino acids and the process known as protein synthesis, the specific sequencing in DNA of these nucleobase-pairs enables storing and transmitting coded instructions as genes. In RNA, base-pair sequencing provides for manufacturing new proteins that determine the frames and parts and most chemical processes of all life forms.Nucleic acid design
Nucleic acid design is the process of generating a set of nucleic acid base sequences that will associate into a desired conformation. Nucleic acid design is central to the fields of DNA nanotechnology and DNA computing. It is necessary because there are many possible sequences of nucleic acid strands that will fold into a given secondary structure, but many of these sequences will have undesired additional interactions which must be avoided. In addition, there are many tertiary structure considerations which affect the choice of a secondary structure for a given design.Nucleic acid design has similar goals to protein design: in both, the sequence of monomers is rationally designed to favor the desired folded or associated structure and to disfavor alternate structures. However, nucleic acid design has the advantage of being a much computationally simpler problem, since the simplicity of Watson-Crick base pairing rules leads to simple heuristic methods which yield experimentally robust designs. Computational models for protein folding require tertiary structure information whereas nucleic acid design can operate largely on the level of secondary structure. However, nucleic acid structures are less versatile than proteins in their functionality.Nucleic acid design can be considered the inverse of nucleic acid structure prediction. In structure prediction, the structure is determined from a known sequence, while in nucleic acid design, a sequence is generated which will form a desired structure.Nucleic acid thermodynamics
Nucleic acid thermodynamics is the study of how temperature affects the nucleic acid structure of double-stranded DNA (dsDNA). The melting temperature (Tm) is defined as the temperature at which half of the DNA strands are in the random coil or single-stranded (ssDNA) state. Tm depends on the length of the DNA molecule and its specific nucleotide sequence. DNA, when in a state where its two strands are dissociated (i.e., the dsDNA molecule exists as two independent strands), is referred to as having been denatured by the high temperature.Palindromic sequence
A palindromic sequence is a nucleic acid sequence in a double-stranded DNA or RNA molecule wherein reading in a certain direction (e.g. 5' to 3') on one strand matches the sequence reading in the same direction (e.g. 5' to 3') on the complementary strand. This definition of palindrome thus depends on complementary strands being palindromic of each other.
The meaning of palindrome in the context of genetics is slightly different from the definition used for words and sentences. Since a double helix is formed by two paired antiparallel strands of nucleotides that run in opposite directions, and the nucleotides always pair in the same way (adenine (A) with thymine (T) in DNA or uracil (U) in RNA; cytosine (C) with guanine (G)), a (single-stranded) nucleotide sequence is said to be a palindrome if it is equal to its reverse complement. For example, the DNA sequence ACCTAGGT is palindromic because its nucleotide-by-nucleotide complement is TGGATCCA, and reversing the order of the nucleotides in the complement gives the original sequence.
A palindromic nucleotide sequence is capable of forming a hairpin. Palindromic motifs are found in most genomes or sets of genetic instructions. They have been specially researched in bacterial chromosomes and in the so-called Bacterial Interspersed Mosaic Elements (BIMEs) scattered over them. In 2008, a genome sequencing project discovered that large portions of the human X and Y chromosomes are arranged as palindromes. A palindromic structure allows the Y chromosome to repair itself by bending over at the middle if one side is damaged.
Palindromes also appear to be found frequently in the peptide sequences that make up proteins, but their role in protein function is not clearly known. It has been suggested that the existence of palindromes in peptides might be related to the prevalence of low-complexity regions in proteins, as palindromes are frequently associated with low-complexity sequences. Their prevalence may also be related to the propensity of such sequences to form alpha helices or protein/protein complexes.Reference genome
A reference genome (also known as a reference assembly) is a digital nucleic acid sequence database, assembled by scientists as a representative example of a species' set of genes. As they are often assembled from the sequencing of DNA from a number of donors, reference genomes do not accurately represent the set of genes of any single person. Instead a reference provides a haploid mosaic of different DNA sequences from each donor. For example, GRCh37, the Genome Reference Consortium human genome (build 37) is derived from thirteen anonymous volunteers from Buffalo, New York. The ABO blood group system differs among humans, but the human reference genome contains only an O allele (although the other alleles are annotated).As the cost of DNA sequencing falls, and new full genome sequencing technologies emerge, more genome sequences continue to be generated. Reference genomes are typically used as a guide on which new genomes are built, enabling them to be assembled much more quickly and cheaply than the initial Human Genome Project. Most individuals with their entire genome sequenced, such as James D. Watson, had their genome assembled in this manner. For much of a genome, the reference provides a good approximation of the DNA of any single individual. But in regions with high allelic diversity, such as the major histocompatibility complex in humans and the major urinary proteins of mice, the reference genome may differ significantly from other individuals. Comparison between the reference (build 36) and Watson's genome revealed 3.3 million single nucleotide polymorphism differences, while about 1.4 percent of his DNA could not be matched to the reference genome at all. For regions where there is known to be large scale variation, sets of alternate loci are assembled alongside the reference locus.
Reference genomes can be accessed online at several locations, using dedicated browsers such as Ensembl or UCSC Genome Browser.Sequence homology
Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a speciation event (orthologs), or a duplication event (paralogs), or else a horizontal (or lateral) gene transfer event (xenologs).Homology among DNA, RNA, or proteins is typically inferred from their nucleotide or amino acid sequence similarity. Significant similarity is strong evidence that two sequences are related by evolutionary changes from a common ancestral sequence. Alignments of multiple sequences are used to indicate which regions of each sequence are homologous.Szybalski's rule
Szybalski's rule says that lower-protein particles like viruses contain more purines than pyrimidine in their nucleic acid sequence.
This is to prevent double-stranded RNA formation of one or two separate RNA strand that have complementary regions. The formation of a double-stranded RNA is not efficient for viruses as it may delay or stop RNA replication or protein formation.Terminator (genetics)
In genetics, a transcription terminator is a section of nucleic acid sequence that marks the end of a gene or operon in genomic DNA during transcription. This sequence mediates transcriptional termination by providing signals in the newly synthesized transcript RNA that trigger processes which release the transcript RNA from the transcriptional complex. These processes include the direct interaction of the mRNA secondary structure with the complex and/or the indirect activities of recruited termination factors. Release of the transcriptional complex frees RNA polymerase and related transcriptional machinery to begin transcription of new mRNAs.Tuberculous meningitis
Tuberculous meningitis is also known as TB meningitis or tubercular meningitis. Tuberculous meningitis is Mycobacterium tuberculosis infection of the meninges—the system of membranes which envelop the central nervous system.Ultra-conserved element
An ultra-conserved element (UCE) is a region of DNA that is identical in at least two different species.
One of the first studies of UCEs showed that certain human DNA sequences of length 200 nucleotides or greater were entirely conserved (identical nucleic acid sequence) in human, rats, and mice. Despite often being noncoding DNA, some ultra-conserved elements have been found to be transcriptionally active, giving non-coding RNA molecules.
|Nucleic acid structure|