Genes and Genomes

Genomic DNA is located in the cell nucleus of eukaryotes, as well as small amounts in mitochondria and chloroplasts. In prokaryotes, the DNA is held within an irregularly shaped body in the cytoplasm called the nucleoid.The genetic information in a genome is held within genes, and the complete set of this information in an organism is called its genotype. A gene is a unit of heredity and is a region of DNA that influences a particular characteristic in an organism. Genes contain an open reading frame that can be transcribed, as well as regulatory sequences such as promoters and enhancers, which control the transcription of the open reading frame.

In many species, only a small fraction of the total sequence of the genome encodes protein. For example, only about 1.5% of the human genome consists of protein-coding exons, with over 50% of human DNA consisting of non-coding repetitive sequences.The reasons for the presence of so much non-coding DNA in eukaryotic genomes and the extraordinary differences in genome size, or C-value, among species represent a long-standing puzzle known as the "C-value enigma".However, DNA sequences that do not code protein may still encode functional non-coding RNA molecules, which are involved in the regulation of gene expression.

Nucleic acid secondary structure

In biochemistry and structural biology, secondary structure is the general three-dimensional form of local segments of biopolymers such as proteins and nucleic acids (DNA/RNA). It does not, however, describe specific atomic positions in three-dimensional space, which are considered to be tertiary structure. Secondary structure is formally defined by the hydrogen bonds of the biopolymer, as observed in an atomic-resolution structure. In proteins, the secondary structure is defined by patterns of hydrogen bonds between backbone amide and carboxyl groups (sidechain-mainchain and sidechain-sidechain hydrogen bonds are irrelevant), where the DSSP definition of a hydrogen bond is used. In nucleic acids, the secondary structure is defined by the hydrogen bonding between the nitrogenous bases.

For proteins, however, the hydrogen bonding is correlated with other structural features, which has given rise to less formal definitions of secondary structure. For example, residues in protein helices generally adopt backbone dihedral angles in a particular region of the Ramachandran plot; thus, a segment of residues with such dihedral angles is often called a "helix", regardless of whether it has the correct hydrogen bonds. Many other less formal definitions have been proposed, often applying concepts from the differential geometry of curves, such as curvature and torsion. Least formally, structural biologists solving a new atomic-resolution structure will sometimes assign its secondary structure "by eye" and record their assignments in the corresponding PDB file.

The secondary structure of a nucleic acid molecule refers to the basepairing interactions within a single molecule or set of interacting molecules. The secondary structure of biological RNA's can often be uniquely decomposed into stems and loops. Frequently these elements, or combinations of them, can be further classified, for example, tetraloops, pseudoknots and stem-loops. There are many secondary structure elements of functional importance to biological RNA's; some famous examples are the Rho-independent terminator stem-loops and the tRNA cloverleaf.

Alternate DNA structures

DNA exists in many possible conformations that include A-DNA, B-DNA, and Z-DNA forms, although, only B-DNA and Z-DNA have been directly observed in functional organisms. The conformation that DNA adopts depends on the hydration level, DNA sequence, the amount and direction of supercoiling, chemical modifications of the bases, the type and concentration of metal ions, as well as the presence of polyamines in solution.

The first published reports of A-DNA X-ray diffraction patterns— and also B-DNA used analyses based on Patterson transforms that provided only a limited amount of structural information for oriented fibers of DNA. An alternate analysis was then proposed by Wilkins et al., in 1953, for the in vivo B-DNA X-ray diffraction/scattering patterns of highly hydrated DNA fibers in terms of squares of Bessel functions. In the same journal, James D. Watson and Francis Crick presented their molecular modeling analysis of the DNA X-ray diffraction patterns to suggest that the structure was a double-helix.

Although the `B-DNA form' is most common under the conditions found in cells, it is not a well-defined conformation but a family of related DNA conformations that occur at the high hydration levels present in living cells. Their corresponding X-ray diffraction and scattering patterns are characteristic of molecular paracrystals with a significant degree of disorder.

Protein quantification

For genes encoding proteins the expression level can be directly assessed by a number of means with some clear analogies to the techniques for mRNA quantification.

The most commonly used method is to perform a Western blot against the protein of interest - this gives information on the size of the protein in addition to its identity. A sample (often cellular lysate) is separated on a polyacrylamide gel, transferred to a membrane and then probed with an antibody to the protein of interest. The antibody can either be conjugated to a fluorophore or to horseradish peroxidase for imaging and/or quantification. The gel-based nature of this assay makes quantification less accurate but it has the advantage of being able to identify later modifications to the protein, for example proteolysis or ubiquitination, from changes in size.

By replacing the gene with a new version fused a green fluorescent protein (or similar) marker expression may be directly quantified in live cells. This is done by imaging using a fluorescence microscope. It is very difficult to clone a GFP-fused protein into its native location in the genome without affecting expression levels so this method often cannot be used to measure endogenous gene expression. It is, however, widely used to measure the expression of a gene artificially introduced into the cell, for example via an expression vector. It is important to note that by fusing a target protein to a fluorescent reporter the protein's behavior, including its cellular localization and expression level, can be significantly changed.

The enzyme-linked immunosorbent assay works by using antibodies immobilised on a microtiter plate to capture proteins of interest from samples added to the well. Using a detection antibody conjugated to an enzyme or fluorophore the quantity of bound protein can be accurately measured by fluorometric or colourimetric detection. The detection process is very similar to that of a Western blot, but by avoiding the gel steps more accurate quantification can be achieved.


Bioinformatics involves the manipulation, searching, and data mining of DNA sequence data. The development of techniques to store and search DNA sequences have led to widely applied advances in computer science, especially string searching algorithms, machine learning and database theory. String searching or matching algorithms, which find an occurrence of a sequence of letters inside a larger sequence of letters, were developed to search for specific sequences of nucleotides. In other applications such as text editors, even simple algorithms for this problem usually suffice, but DNA sequences cause these algorithms to exhibit near-worst-case behaviour due to their small number of distinct characters. The related problem of sequence alignment aims to identify homologous sequences and locate the specific mutations that make them distinct. These techniques, especially multiple sequence alignment, are used in studying phylogenetic relationships and protein function. Data sets representing entire genomes' worth of DNA sequences, such as those produced by the Human Genome Project, are difficult to use without annotations, which label the locations of genes and regulatory elements on each chromosome. Regions of DNA sequence that have the characteristic patterns associated with protein- or RNA-coding genes can be identified by gene finding algorithms, which allow researchers to predict the presence of particular gene products in an organism even before they have been isolated experimentally.

Functional structure of a gene

All genes have regulatory regions in addition to regions that explicitly code for a protein or RNA product. A regulatory region shared by almost all genes is known as the promoter, which provides a position that is recognized by the transcription machinery when a gene is about to be transcribed and expressed. A gene can have more than one promoter, resulting in RNAs that differ in how far they extend in the 5' end. Although promoter regions have a consensus sequence that is the most common sequence at this position, some genes have "strong" promoters that bind the transcription machinery well, and others have "weak" promoters that bind poorly. These weak promoters usually permit a lower rate of transcription than the strong promoters, because the transcription machinery binds to them and initiates transcription less frequently. Other possible regulatory regions include enhancers, which can compensate for a weak promoter. Most regulatory regions are "upstream"—that is, before or toward the 5' end of the transcription initiation site. Eukaryotic promoter regions are much more complex and difficult to identify than prokaryotic promoters.

Many prokaryotic genes are organized into operons, or groups of genes whose products have related functions and which are transcribed as a unit. By contrast, eukaryotic genes are transcribed only one at a time, but may include long stretches of DNA called introns which are transcribed but never translated into protein (they are spliced out before translation). Splicing can also occur in prokaryotic genes, but is less common than in eukaryotes.

Genetic recombination

A DNA helix usually does not interact with other segments of DNA, and in human cells the different chromosomes even occupy separate areas in the nucleus called "chromosome territories". This physical separation of different chromosomes is important for the ability of DNA to function as a stable repository for information, as one of the few times chromosomes interact is during chromosomal crossover when they recombine. Chromosomal crossover is when two DNA helices break, swap a section and then rejoin.

Recombination allows chromosomes to exchange genetic information and produces new combinations of genes, which increases the efficiency of natural selection and can be important in the rapid evolution of new proteins. Genetic recombination can also be involved in DNA repair, particularly in the cell's response to double-strand breaks.

The most common form of chromosomal crossover is homologous recombination, where the two chromosomes involved share very similar sequences. Non-homologous recombination can be damaging to cells, as it can produce chromosomal translocations and genetic abnormalities. The recombination reaction is catalyzed by enzymes known as recombinases, such as RAD51.The first step in recombination is a double-stranded break either caused by an endonuclease or damage to the DNA. A series of steps catalyzed in part by the recombinase then leads to joining of the two helices by at least one Holliday junction, in which a segment of a single strand in each helix is annealed to the complementary strand in the other helix. The Holliday junction is a tetrahedral junction structure that can be moved along the pair of chromosomes, swapping one strand for another. The recombination reaction is then halted by cleavage of the junction and re-ligation of the released DNA.


Nucleobases are heterocyclic aromatic organic compounds containing nitrogen atoms. Nucleobases are the parts of RNA and DNA involved in base pairing. Cytosine, guanine, adenine, thymine are found predominantly in DNA, while in RNA uracil replaces thymine. These are abbreviated as C, G, A, T, U, respectively.

Nucleobases are complementary, and when forming base pairs, must always join accordingly: cytosine-guanine, adenine-thymine (adenine-uracil when RNA). The strength of the interaction between cytosine and guanine is stronger than between adenine and thymine because the former pair has three hydrogen bonds joining them while the latter pair have only two. Thus, the higher the GC content of double-stranded DNA, the more stable the molecule and the higher the melting temperature.

Two main nucleobase classes exist, named for the molecule which forms their skeleton. These are the double-ringed purines and single-ringed pyrimidines. Adenine and guanine are purines (abbreviated as R), while cytosine, thymine, and uracil are all pyrimidines (abbreviated as Y).

Hypoxanthine and xanthine are mutant forms of adenine and guanine, respectively, created through mutagen presence, through deamination (replacement of the amine-group with a hydroxyl-group). These are abbreviated HX and X.

Protein structure prediction

Biomolecular structure prediction is the prediction of the three-dimensional structure of a protein from its amino acid sequence, or of a nucleic acid from its base sequence. In other words, it is the prediction of secondary and tertiary structure from its primary structure. Structure prediction is the inverse of biomolecular design.

Protein structure prediction is one of the most important goals pursued by bioinformatics and theoretical chemistry. Protein structure prediction is of high importance in medicine (for example, in drug design) and biotechnology (for example, in the design of novel enzymes). Every two years, the performance of current methods is assessed in the CASP experiment.

There has also been a significant amount of bioinformatics research directed at the RNA structure prediction problem. A common problem for researchers working with RNA is to determine the three-dimensional structure of the molecule given just the nucleic acid sequence. However, in the case of RNA much of the final structure is determined by the secondary structure or intra-molecular base-pairing interactions of the molecule. This is shown by the high conservation of base-pairings across diverse species.

Secondary structure of small nucleic acid molecules is largely determined by strong, local interactions such as hydrogen bonds and base stacking. Summing the free energy for such interactions, usually using a nearest-neighbor model, provides an approximation for the stability of given structure. The most straighforward way to find the lowest free energy structure would be to generate all possible structures and calculate the free energy for it, but the number of possible structures for a sequence increases exponentially with the length of the nucleic acid.[22] For longer molecules, the number of possible secondary structures is enormous.[21]

Sequence covariation methods rely on the existence of a data set composed of multiple homologous RNA sequences with related but dissimilar sequences. These methods analyze the covariation of individual base sites in evolution; maintenance at two widely separated sites of a pair of base-pairing nucleotides indicates the presence of a structurally required hydrogen bond between those positions. The general problem of pseudoknot prediction has been shown to be NP-complete.

Chemical structure

Deoxyribonucleic acid is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms. The main role of DNA molecules is the long-term storage of information and DNA is often compared to a set of blueprints, since it contains the instructions needed to construct other components of cells, such as proteins and RNA molecules. The DNA segments that carry this genetic information are called genes, but other DNA sequences have structural purposes, or are involved in regulating the use of this genetic information.

DNA is made of four types of nucleotides, containing different nucleobases: the pyrimidines cytosine and thymine, and the purines guanine and adenine. The nucleotides are attached to each other in a chain by bonds between their sugar and phosphate groups, forming a sugar-phosphate backbone. Two of these chains are held together by hydrogen bonding between complementary bases; the chains coil around each other, forming the DNA double helix.

DNA sequencing

Finding a single gene amid the vast stretches of DNA that make up the human genome - three billion base-pairs' worth - requires a set of powerful tools. The Human Genome Project (HGP) was devoted to developing new and better tools to make gene hunts faster, cheaper and practical for almost any scientist to accomplish.

These tools include genetic maps, physical maps and DNA sequence - which is a detailed description of the order of the chemical building blocks, or bases, in a given stretch of DNA. Indeed, the monumental achievement of the HGP was its successful sequencing of the entire length of human DNA, also referred to as the human genome.

Scientists need to know the sequence of bases because it tells them the kind of genetic information that is carried in a particular segment of DNA. For example, they can use sequence information to determine which stretches of DNA contain genes, as well as to analyze those genes for changes in sequence, called mutations, that may cause disease.

DNA Profiling

Forensic scientists can use DNA in blood, semen, skin, saliva or hair found at a crime scene to identify a matching DNA of an individual, such as a perpetrator. This process is called genetic fingerprinting, or more accurately, DNA profiling. In DNA profiling, the lengths of variable sections of repetitive DNA, such as short tandem repeats and minisatellites, are compared between people. This method is usually an extremely reliable technique for identifying a matching DNA. However, identification can be complicated if the scene is contaminated with DNA from several people. DNA profiling was developed in 1984 by British geneticist Sir Alec Jeffreys,and first used in forensic science to convict Colin Pitchfork in the 1988 Enderby murders case.
People convicted of certain types of crimes may be required to provide a sample of DNA for a database. This has helped investigators solve old cases where only a DNA sample was obtained from the scene. DNA profiling can also be used to identify victims of mass casualty incidents. On the other hand, many convicted people have been released from prison on the basis of DNA techniques, which were not available when a crime had originally been committed.

Expression cloning

One of the most basic techniques of molecular biology to study protein function is expression cloning. In this technique, DNA coding for a protein of interest is cloned (using PCR and/or restriction enzymes) into a plasmid (known as an expression vector). This plasmid may have special promoter elements to drive production of the protein of interest, and may also have antibiotic resistance markers to help follow the plasmid.

This plasmid can be inserted into either bacterial or animal cells. Introducing DNA into bacterial cells can be done by transformation (via uptake of naked DNA), conjugation (via cell-cell contact) or by transduction (via viral vector). Introducing DNA into eukaryotic cells, such as animal cells, by physical or chemical means is called transfection. Several different transfection techniques are available, such as calcium phosphate transfection, electroporation, microinjection and liposome transfection. DNA can also be introduced into eukaryotic cells using viruses or bacteria as carriers, the latter is sometimes called bactofection and in particular uses Agrobacterium tumefaciens. The plasmid may be integrated into the genome, resulting in a stable transfection, or may remain independent of the genome, called transient transfection.

In either case, DNA coding for a protein of interest is now inside a cell, and the protein can now be expressed. A variety of systems, such as inducible promoters and specific cell-signaling factors, are available to help express the protein of interest at high levels. Large quantities of a protein can then be extracted from the bacterial or eukaryotic cell. The protein can be tested for enzymatic activity under a variety of situations, the protein may be crystallized so its tertiary structure can be studied, or, in the pharmaceutical industry, the activity of new drugs against the protein can be studied.

DNA sequencing methods

The first methods for sequencing DNA were developed in the mid-1970s. At that time, scientists could sequence only a few base pairs per year, not nearly enough to sequence a single gene, much less the entire human genome. By the time the HGP began in 1990, only a few laboratories had managed to sequence a mere 100,000 bases, and the cost of sequencing remained very high. Since then, technological improvements and automation have increased speed and lowered cost to the point where individual genes can be sequenced routinely, and some labs can sequence well over 100 million bases per year.

Beginning in the late 1990s, the scientific community witnessed a remarkable climax of accomplishments related to DNA sequencing. In addition to the historic sequencing of the human genome, sequences have now been generated for the genomes of several key model organisms, including the mouse (Mus musculus); the rat (Rattus norvegicus); two fruit flies (Drosophila melanogaster and D. pseudoobscura); two roundworms (Caenorhabditis elegans and C. briggsae); yeast (Saccharomyces cerevisiae) and several other fungi; a malaria-carrying mosquito (Anopheles gambiae) along with a malaria-causing parasite (Plasmodium falciparum); two sea squirts (Ciona savignyi and C. intestinalis); a long list of microbes; and a couple of plants, including mustard weed (Arabidopsis thaliana) and rice (Oryza sativa). Sequencing work is well underway on the honey bee (Apis mellifera), and is just getting started or expected to begin soon on the chimpanzee (Pan troglodytes), the cow (Bos taurus), the dog (Canis familiaris) and the chicken (Gallus gallus).The relative genetic simplicity of many of these model organisms make them ideal terrain for future technology development.

Although providing a single reference sequence of the human genome is an extraordinary achievement, further advances in sequencing technology are necessary so large amounts of DNA can be manipulated and compared with other genomes quickly and cheaply. Comparing differences among long stretches of DNA - one million bases or more - taken from many individuals should yield an enormous amount of information about the role of inheritance in disease susceptibility, response to environmental influences and even evolution.

DNA nanotechnology

DNA nanotechnology uses the unique molecular recognition properties of DNA and other nucleic acids to create self-assembling branched DNA complexes with useful properties. DNA is thus used as a structural material rather than as a carrier of biological information. This has led to the creation of two-dimensional periodic lattices as well as three-dimensional structures in the shapes of polyhedra. Nanomechanical devices and algorithmic self-assembly have also been demonstrated,and these DNA structures have been used to template the arrangement of other molecules such as gold nanoparticles and streptavidin proteins.