+ All Categories
Home > Documents > Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is...

Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is...

Date post: 18-Jun-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
27
Bioinformatics Factsheet http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html 1 of 6 22.08.2006 13:36 The completion of a "working draft" of the human genome--an important milestone in the Human Genome Project--was announced in June 2000 at a press conference at the White House and was published in the February 15, 2001 issue of the journal Nature. National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Databases and Tools Human Genome Resources Model Organisms Guide Outreach and Education News About NCBI Site Map Science Primer: Genome Mapping Molecular Modeling SNPs ESTs Microarray Technology Molecular Genetics Pharmacogenomics Phylogenetics Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources BIOINFORMATICS Over the past few decades, major advances in the field of molecular biology, coupled with advances in genomic technologies, have led to an explosive growth in the biological information generated by the scientific community. This deluge of genomic information has, in turn, led to an absolute requirement for computerized databases to store, organize, and index the data and for specialized tools to view and analyze the data. What Is a Biological Database? A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system. A simple database might be a single file containing many records, each of which includes the same set of information. For example, a record associated with a nucleotide sequence database typically contains information such as contact name, the input sequence with a description of the type of molecule, the scientific name of the source organism from which it was isolated, and often, literature citations associated with the sequence. For researchers to benefit from the data stored in a database, two additional requirements must be met: easy access to the information
Transcript
Page 1: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

Bioinformatics Factsheet http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html

1 of 6 22.08.2006 13:36

The completion of a "working draft"of the human genome--an important milestone in the HumanGenome Project--was announced in June 2000 at a press conferenceat the White House and was published in the February 15, 2001issue of the journal Nature.

National Center for Biotechnology Information

About NCBI NCBI at a Glance A Science Primer Databases and

ToolsHuman Genome

ResourcesModel Organisms

GuideOutreach and

Education News

About NCBISite Map

Science Primer:

Genome Mapping

Molecular Modeling

SNPs

ESTs

MicroarrayTechnology

Molecular Genetics

Pharmacogenomics

Phylogenetics

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

BIOINFORMATICS

Over the past few decades, major advances in the field of molecular biology, coupledwith advances in genomic technologies, have led to anexplosive growth in the biological information generated by the scientificcommunity. This deluge of genomic information has, in turn, ledto an absolute requirement for computerized databases to store,organize, and index the data and for specialized tools to viewand analyze the data.

What Is a Biological Database?

A biological database is a large, organized body of persistentdata, usually associated with computerized software designed toupdate, query, and retrieve components of the data stored withinthe system. A simple database might be a single file containingmany records, each of which includes the same set of information. For example, a record associated with a nucleotidesequence database typically contains information such ascontact name, the input sequence with a description of the typeof molecule, the scientific name of the source organism fromwhich it was isolated, and often, literature citations associatedwith the sequence.

For researchers to benefit from the data stored in a database,two additional requirements must be met:

easy access to the information

Page 2: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

Bioinformatics Factsheet http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html

2 of 6 22.08.2006 13:36

The data inGenBank are madeavailable in a variety of ways, each tailored to aparticular use, such as data submission or sequencesearching.

Biology in the 21st century isbeing transformed from a purely lab-based science to aninformation science as well.

a method for extracting only that information needed to answer a specific biological question

At NCBI, many of our databases are linked through a unique search and retrievalsystem, called Entrez. Entrez (pronouncedahn' tray) allows a user to not only access and retrieve specific information from asingle database but to access integrated information from many NCBI databases.For example, the Entrez Protein databaseis cross-linked to the Entrez Taxonomy

database. This allows a researcher to find taxonomic information(taxonomy is a division of the natural sciences that deals withthe classification of animals and plants) for the species fromwhich a protein sequence was derived.

What Is Bioinformatics?

Bioinformatics is the field of science in which biology, computer science, andinformation technology merge to form a singlediscipline. The ultimate goal of the field is to enable the discovery of new biologicalinsights as well as to create a global perspective from which unifying principles inbiology can be discerned. At the beginning ofthe "genomic revolution", a bioinformatics concern was the creation and maintenance of a database tostore biological information, such as nucleotide and amino acidsequences. Development of this type of database involved notonly design issues but the development of complex interfaceswhereby researchers could both access existing data as well assubmit new or revised data.

Ultimately, however, all of this information must be combined toform a comprehensive picture of normal cellular activities so thatresearchers may study how these activities are altered in different disease states. Therefore, the field of bioinformatics hasevolved such that the most pressing task now involves theanalysis and interpretation of various types of data, includingnucleotide and amino acid sequences, protein domains, andprotein structures. The actual process of analyzing andinterpreting data is referred to as computational biology.Important sub-disciplines within bioinformatics and computational biology include:

the development and implementation of tools that enable efficient access to, and use and management of, varioustypes of information

Page 3: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

Bioinformatics Factsheet http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html

3 of 6 22.08.2006 13:36

Although a human disease may not befound in exactly the same form in animals,there may be sufficient data for an animal model that allowresearchers to make inferences about the process in humans.

NCBI's COGs databasehas been designed to simplify evolutionarystudies of complete genomes and to improve functionalassignment of individual proteins.

the development of new algorithms (mathematical formulas) and statistics with which to assess relationshipsamong members of large data sets, such as methods tolocate a gene within a sequence, predict protein structureand/or function, and cluster protein sequences into familiesof related sequences

Why Is Bioinformatics So Important?

The rationale for applying computational approaches to facilitate theunderstanding of various biological processes includes:

a more global perspective in experimental design

the ability to capitalize on the emerging technology of database-mining - the processby which testable hypotheses are generated regarding the functionor structure of a gene or protein of interest by identifying similarsequences in better characterized organisms

Evolutionary Biology

New insight into the molecular basis of a disease may come from investigating the function of homologs of a disease gene inmodel organisms. In this case, homology refers to two genes sharing a common evolutionary history. Scientists also use theterm homology, or homologous, to simply mean similar, regardless of the evolutionary relationship.

Equally exciting is the potential for uncovering evolutionary relationshipsand patterns between different forms of life. With the aid of nucleotide andprotein sequences, it should be possible to find the ancestral ties betweendifferent organisms. Thus far, experience has taught us that closely

related organisms have similar sequences and that moredistantly related organisms have more dissimilar sequences.Proteins that show a significant sequence conservation,

Page 4: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

Bioinformatics Factsheet http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html

4 of 6 22.08.2006 13:36

indicating a clear evolutionary relationship, are said to be fromthe same protein family. By studying protein folds (distinct protein building blocks) and families, scientists are able toreconstruct the evolutionary relationship between two speciesand to estimate the time of divergence between two organismssince they last shared a common ancestor.

Phylogenetics is the field of biology that deals withidentifying and understanding the relationshipsbetween the different kinds of life on earth.

Protein Modeling

The process of evolution has resulted in the production of DNAsequences that encode proteins with specific functions. In the absence of a protein structure that has been determined byX-ray crystallography or nuclear magnetic resonance (NMR)spectroscopy, researchers can try to predict thethree-dimensional structure using protein or molecular modeling. This method uses experimentally determined proteinstructures (templates) to predict the structure of another proteinthat has a similar amino acid sequence (target).

Although molecular modeling may not be as accurate at determining a protein's structure as experimental methods, it isstill extremely helpful in proposing and testing various biologicalhypotheses. Molecular modeling also provides a starting pointfor researchers wishing to confirm a structure through X-raycrystallography and NMR spectroscopy. Because the differentgenome projects are producing more sequences and becausenovel protein folds and families are being determined, proteinmodeling will become an increasingly important tool for scientistsworking to understand normal and disease-related processes inliving organisms.

The Four Steps of Protein Modeling

Identify the proteins with known three-dimensional structures that are related to the target sequence

Align the related three-dimensional structures with thetarget sequence and determine those structures that willbe used as templates

Construct a model for the target sequence based on itsalignment with the template structure(s)

Page 5: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

Bioinformatics Factsheet http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html

5 of 6 22.08.2006 13:36

Evaluate the model against a variety of criteria to determine if it is satisfactory

Genome Mapping

Genomic maps serve as a scaffold for orienting sequenceinformation. A few years ago, a researcher wanting to localize agene, or nucleotide sequence, was forced to manually map the genomic region of interest, a time-consuming and oftenpainstaking process. Today, thanks to new technologies and theinflux of sequence data, a number of high-quality, genome-widemaps are available to the scientific community for use in theirresearch.

Computerized maps make gene hunting faster, cheaper, and more practical for almost any scientist. In a nutshell, scientistswould first use a genetic map to assign a gene to a relativelysmall area of a chromosome. They would then use a physicalmap to examine the region of interest close up, to determine agene's precise location. In light of these advances, aresearcher's burden has shifted from mapping a genome orgenomic region of interest to navigating a vast number of Websites and databases.

Map Viewer: A Tool for Visualizing Whole Genomes or Single Chromosomes

NCBI's Map Viewer is a tool that allows a user to view anorganism's complete genome, integrated maps for eachchromosome (when available), and/or sequence data for agenomic region of interest. When using Map Viewer, aresearcher has the option of selecting either a "Whole-GenomeView" or a "Chromosome or Map View". The Genome Viewdisplays a schematic for all of an organism’s chromosomes,whereas the Map View shows one or more detailed maps for asingle chromosome. If more than one map exists for achromosome, Map Viewer allows a display of these mapssimultaneously.

Using Map Viewer, researchers can find answers to questions such as:

Where does a particular gene exist within an organism's genome?

Which genes are located on a particular

Page 6: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

Bioinformatics Factsheet http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html

6 of 6 22.08.2006 13:36

chromosome and in what order?

What is the corresponding sequence data for a gene that exists in a particularchromosomal region?

What is the distance between two genes?

The rapidly emerging field of bioinformatics promises to lead toadvances in understanding basic biological processes and, in turn, advances in the diagnosis, treatment, and prevention ofmany genetic diseases. Bioinformatics has transformed thediscipline of biology from a purely lab-based science to aninformation science as well. Increasingly, biological studies beginwith a scientist conducting vast numbers of database and Website searches to formulate specific hypotheses or to designlarge-scale experiments. The implications behind this change,for both science and medicine, are staggering. Back to top Revised: March 29, 2004.

NCBI NLM NIH

Privacy Statement Disclaimer Accessibility

Page 7: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

What Is a Genome http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html

1 of 22 22.08.2006 13:48

National Center for Biotechnology Information

About NCBI NCBI at a Glance A Science Primer Databases and

ToolsHuman Genome

ResourcesModel Organisms

GuideOutreach and

Education News

Site Map

Science Primer:

Bioinformatics

Genome Mapping

Molecular Modeling

SNPs

ESTs

MicroarrayTechnology

What Is a Cell

Molecular Genetics

Pharmacogenomics

Phylogenetics

A Basic Introduction to the Science Underlying NCBI Resources

WHAT IS A GENOME? Life is specified by genomes. Every organism, including humans,has a genome that contains all of the biological informationneeded to build and maintain a living example of that organism.The biological information contained in a genome is encoded in itsdeoxyribonucleic acid (DNA) and is divided into discrete unitscalled genes. Genes code for proteins that attach to the genomeat the appropriate positions and switch on a series of reactionscalled gene expression.

In 1909, Danish botanist Wilhelm Johanssen coined the word genefor the hereditary unit found on a chromosome. Nearly 50 yearsearlier, Gregor Mendel had characterized hereditary units asfactors— observable differences that were passed from parent tooffspring. Today we know that a single gene consists of a uniquesequence of DNA that provides the complete instructions to make afunctional product, called a protein. Genes instruct each cell type—such as skin, brain, and liver—to make discrete sets of proteins atjust the right times, and it is through this specificity that uniqueorganisms arise.

The Physical Structure of the Human Genome Nuclear DNA

Inside each of our cells lies a nucleus, a membrane-boundedregion that provides a sanctuary for genetic information. The nucleus contains long strands of DNA that encode this geneticinformation. A DNA chain is made up of four chemical bases: adenine (A) and guanine (G), which are called purines, and cytosine (C) and thymine (T), referred to as pyrimidines. Each base has a slightly different composition, or combination of

Page 8: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

What Is a Genome http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html

2 of 22 22.08.2006 13:48

oxygen, carbon, nitrogen, and hydrogen. In a DNA chain, everybase is attached to a sugar molecule (deoxyribose) and aphosphate molecule, resulting in a nucleic acid or nucleotide. Individual nucleotides are linked through the phosphate group,and it is the precise order, or sequence, of nucleotides thatdetermines the product made from that gene.

Figure 1. The four DNA bases.

Each DNA base is made up of the sugar 2'-deoxyribose linked to a phosphategroup and one of the four bases depicted above: adenine (top left), cytosine (top right), guanine (bottom left), and thymine (bottom right).

A DNA chain, also called a strand, has a sense of direction, in whichone end is chemically different than the other. The so-called 5' endterminates in a 5' phosphate group (-PO4); the 3' end terminates in a 3' hydroxyl group (-OH). This is important because DNA strandsare always synthesized in the 5' to 3' direction.

The DNA that constitutes a gene is a double-stranded molecule consisting of two chains running in opposite directions. Thechemical nature of the bases in double-stranded DNA creates aslight twisting force that gives DNA its characteristic gently coiledstructure, known as the double helix. The two strands areconnected to each other by chemical pairing of each base on onestrand to a specific partner on the other strand. Adenine (A) pairswith thymine (T), and guanine (G) pairs with cytosine (C). Thus,

Page 9: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

What Is a Genome http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html

3 of 22 22.08.2006 13:48

A-T and G-C base pairs are said to be complementary. Thiscomplementary base pairing is what makes DNA a suitablemolecule for carrying our genetic information—one strand of DNAcan act as a template to direct the synthesis of a complementarystrand. In this way, the information in a DNA sequence is readilycopied and passed on to the next generation of cells. Organelle DNA

Not all genetic information is found in nuclear DNA. Both plantsand animals have an organelle—a "little organ" within the cell—called the mitochondrion. Each mitochondrion has its own set ofgenes. Plants also have a second organelle, the chloroplast, which also has its own DNA. Cells often have multiplemitochondria, particularly cells requiring lots of energy, such asactive muscle cells. This is because mitochondria are responsiblefor converting the energy stored in macromolecules into a formusable by the cell, namely, the adenosine triphosphate (ATP) molecule. Thus, they are often referred to as the powergenerators of the cell.

Unlike nuclear DNA (the DNA found within the nucleus of a cell),half of which comes from our mother and half from our father,mitochondrial DNA is only inherited from our mother. This isbecause mitochondria are only found in the female gametes or"eggs" of sexually reproducing animals, not in the male gamete,or sperm. Mitochondrial DNA also does not recombine; there is noshuffling of genes from one generation to the other, as there iswith nuclear genes.

Large numbers of mitochondria are found in the tail of sperm,providing them with an engine that generates the energy needed forswimming toward the egg. However, when the sperm enters the eggduring fertilization, the tail falls off, taking away the father'smitochondria.

Why Is There a Separate Mitochondrial Genome?

The energy-conversion process that takes place in the mitochondria takes place aerobically, in the presence of oxygen. Other energy conversion processes in the cell take placeanaerobically, or without oxygen. The independent aerobicfunction of these organelles is thought to have evolved frombacteria that lived inside of other simple organisms in a mutuallybeneficial, or symbiotic, relationship, providing them with aerobiccapacity. Through the process of evolution, these tiny organismsbecame incorporated into the cell, and their genetic systems andcellular functions became integrated to form a single functioningcellular unit. Because mitochondria have their own DNA, RNA,and ribosomes, this scenario is quite possible. This theory is also

Page 10: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

What Is a Genome http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html

4 of 22 22.08.2006 13:48

In addition to mRNA, DNAcodes for other forms of RNA,including ribosomal RNAs(rRNAs), transfer RNAs(tRNAs), and small nuclearRNAs (snRNAs). rRNAs andtRNAs participate in proteinassembly whereas snRNAs aidin a process called splicing—the process of editing ofmRNA before it can be used asa template for protein synthesis.

supported by the existence of a eukaryotic organism, called theamoeba, which lacks mitochondria. Therefore, amoeba mustalways have a symbiotic relationship with an aerobic bacterium. Why Study Mitochondria?

There are many diseases caused by mutations in mitochondrialDNA (mtDNA). Because the mitochondria produce energy in cells, symptoms of mitochondrial diseases often involvedegeneration or functional failure of tissue. For example, mtDNAmutations have been identified in some forms of diabetes,deafness, and certain inherited heart diseases. In addition,mutations in mtDNA are able to accumulate throughout anindividual's lifetime. This is different from mutations in nuclearDNA, which has sophisticated repair mechanisms to limit theaccumulation of mutations. Mitochondrial DNA mutations can alsoconcentrate in the mitochondria of specific tissues. A variety ofdeadly diseases are attributable to a large number of accumulated mutations in mitochondria. There is even a theory,the Mitochondrial Theory of Aging, that suggests that accumulation of mutations in mitochondria contributes to, ordrives, the aging process. These defects are associated withParkinson's and Alzheimer's disease, although it is not knownwhether the defects actually cause or are a direct result of thediseases. However, evidence suggests that the mutationscontribute to the progression of both diseases.

In addition to the critical cellular energy-related functions,mitochondrial genes are useful to evolutionary biologists becauseof their maternal inheritance and high rate of mutation. By studying patterns of mutations, scientists are able to reconstructpatterns of migration and evolution within and between species.For example, mtDNA analysis has been used to trace themigration of people from Asia across the Bering Strait to Northand South America. It has also been used to identify an ancientmaternal lineage from which modern man evolved. Ribonucleic Acids

Just like DNA, ribonucleic acid (RNA) is a chain, or polymer, ofnucleotides with the same 5' to 3' direction of its strands. However,the ribose sugar component of RNA is slightly different chemicallythan that of DNA. RNA has a 2' oxygen atom that is not present inDNA. Other fundamental structuraldifferences exist. For example,uracil takes the place of thethymine nucleotide found in DNA,

Page 11: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

What Is a Genome http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html

5 of 22 22.08.2006 13:48

"DNA makes RNA, RNAmakes protein, and proteins make us."

Francis Crick

and RNA is, for the most part, a single-stranded molecule. DNAdirects the synthesis of a variety of RNA molecules, each with aunique role in cellular function. For example, all genes that codefor proteins are first made into an RNA strand in the nucleuscalled a messenger RNA (mRNA). The mRNA carries theinformation encoded in DNA out of the nucleus to the protein assembly machinery, called the ribosome, in the cytoplasm. The ribosome complex uses mRNA as a template to synthesize theexact protein coded for by the gene. Proteins

Although DNA is the carrier of genetic information in a cell, proteins do thebulk of the work. Proteins are long chains containing as many as 20different kinds of amino acids. Each cell contains thousands of different

proteins: enzymes that make new molecules and catalyze nearlyall chemical processes in cells; structural components that givecells their shape and help them move; hormones that transmitsignals throughout the body; antibodies that recognize foreign molecules; and transport molecules that carry oxygen. The genetic code carried by DNA is what specifies the order andnumber of amino acids and, therefore, the shape and function ofthe protein. The "Central Dogma"—a fundamental principle of molecularbiology—states that genetic information flows from DNA to RNAto protein. Ultimately, however, the genetic code resides in DNAbecause only DNA is passed from generation to generation. Yet,in the process of making a protein, the encoded information mustbe faithfully transmitted first to RNA then to protein. Transferringthe code from DNA to RNA is a fairly straightforward processcalled transcription. Deciphering the code in the resulting mRNAis a little more complex. It first requires that the mRNA leave thenucleus and associate with a large complex of specialized RNAsand proteins that, collectively, are called the ribosome. Here themRNA is translated into protein by decoding the mRNA sequence in blocks of three RNA bases, called codons, where each codon specifies a particular amino acid. In this way, the ribosomal complex builds a protein one amino acid at a time, with the orderof amino acids determined precisely by the order of the codons inthe mRNA.

In 1961, Marshall Nirenberg and Heinrich Matthaei correlated thefirst codon (UUU) with the amino acid phenylalanine. After that, itwas not long before the genetic code for all 20 amino acids wasdeciphered.

Page 12: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

What Is a Genome http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html

6 of 22 22.08.2006 13:48

A given amino acid can have more than one codon. These redundant codons usually differ at the third position. For example,the amino acid serine is encoded by UCU, UCC, UCA, and/orUCG. This redundancy is key to accommodating mutations thatoccur naturally as DNA is replicated and new cells are produced.By allowing some of the random changes in DNA to have noeffect on the ultimate protein sequence, a sort of genetic safetynet is created. Some codons do not code for an amino acid at allbut instruct the ribosome when to stop adding new amino acids. Table 1. RNA triplet codons and their corresponding amino acids.

U C A G

U

UUU PhenylalanineUUC PhenylalanineUUA LeucineUUG Leucine

UCU SerineUCC SerineUCA SerineUCG Serine

UAU TyrosineUAC TyrosineUAA StopUAG Stop

UGU CysteineUGC CysteineUGA StopUGG Tryptophan

C

CUU LeucineCUC LeucineCUA LeucineCUG Leucine

CCU ProlineCCC ProlineCCA ProlineCCG Proline

CAU HistidineCAC HistidineCAA GlutamineCAG Glutamine

CGU ArginineCGC ArginineCGA ArginineCGG Arginine

A

AUU IsoleucineAUC IsoleucineAUA IsoleucineAUG Methionine

ACU ThreonineACC ThreonineACA ThreonineACG Threonine

AAU AsparagineAAC AsparagineAAA LysineAAG Lysine

AGU SerineAGC SerineAGA ArginineAGG Arginine

G

GUU ValineGUC ValineGUA ValineGUG Valine

GCU AlanineGCC AlanineGCA AlanineGCG Alanine

GAU AspartateGAC AspartateGAA GlutamateGAG Glutamate

GGU GlycineGGC GlycineGGA GlycineGGG Glycine

A translation chart of the 64 RNA codons.

The Core Gene Sequence: Introns and Exons

Genes make up about 1 percent of the total DNA in our genome.In the human genome, the coding portions of a gene, called exons, are interrupted by intervening sequences, called introns.In addition, a eukaryotic gene does not code for a protein in one continuous stretch of DNA. Both exons and introns are"transcribed" into mRNA, but before it is transported to theribosome, the primary mRNA transcript is edited. This editingprocess removes the introns, joins the exons together, and addsunique features to each end of the transcript to make a "mature"mRNA. One might then ask what the purpose of an intron is if it isspliced out after it is transcribed? It is still unclear what all thefunctions of introns are, but scientists believe that some serve asthe site for recombination, the process by which progeny derivea combination of genes different from that of either parent,resulting in novel genes with new combinations of exons, the keyto evolution.

Page 13: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

What Is a Genome http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html

7 of 22 22.08.2006 13:48

Figure 2. Recombination.

Recombination involves pairing between complementary strands of twoparental duplex DNAs (top and middle panel). This process creates a stretch ofhybrid DNA (bottom panel) in which the single strand of one duplex is pairedwith its complement from the other duplex.

Gene Prediction Using Computers

When the complete mRNA sequence for a gene is known,computer programs are used to align the mRNA sequence with the appropriate region of the genomic DNA sequence. Thisprovides a reliable indication of the beginning and end of thecoding region for that gene. In the absence of a complete mRNAsequence, the boundaries can be estimated by ever-improving,but still inexact, gene prediction software. The problem is the lackof a single sequence pattern that indicates the beginning or end ofa eukaryotic gene. Fortunately, the middle of a gene, referred toas the core gene sequence--has enough consistent features toallow more reliable predictions.

Page 14: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

What Is a Genome http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html

8 of 22 22.08.2006 13:48

From Genes to Proteins: Start to Finish We just discussed that the journey from DNA to mRNA to proteinrequires that a cell identify where a gene begins and ends. This must be done both during the transcription and the translationprocess. Transcription

Transcription, the synthesis of an RNA copy from a sequence ofDNA, is carried out by an enzyme called RNA polymerase. Thismolecule has the job of recognizing the DNA sequence wheretranscription is initiated, called the promoter site. In general,there are two "promoter" sequences upstream from the beginningof every gene. The location and base sequence of each promotersite vary for prokaryotes (bacteria) and eukaryotes (higherorganisms), but they are both recognized by RNA polymerase, which can then grab hold of the sequence and drive theproduction of an mRNA.

Eukaryotic cells have three different RNA polymerases, eachrecognizing three classes of genes. RNA polymerase II isresponsible for synthesis of mRNAs from protein-coding genes. This polymerase requires a sequence resembling TATAA,commonly referred to as the TATA box, which is found 25-30 nucleotides upstream of the beginning of the gene, referred to asthe initiator sequence.

Transcription terminates when the polymerase stumbles upon atermination, or stop signal. In eukaryotes, this process is not fullyunderstood. Prokaryotes, however, tend to have a short regioncomposed of G's and C's that is able to fold in on itself and formcomplementary base pairs, creating a stem in the new mRNA. This stem then causes the polymerase to trip and release thenascent, or newly formed, mRNA. Translation

The beginning of translation, the process in which the geneticcode carried by mRNA directs the synthesis of proteins fromamino acids, differs slightly for prokaryotes and eukaryotes,although both processes always initiate at a codon for methionine.For prokaryotes, the ribosome recognizes and attaches at the sequence AGGAGGU on the mRNA, called the Shine-Delgarno sequence, that appears just upstream from the methionine (AUG)codon. Curiously, eukaryotes lack this recognition sequence andsimply initiate translation at the amino acid methionine, usuallycoded for by the bases AUG, but sometimes GUG. Translation isterminated for both prokaryotes and eukaryotes when theribosome reaches one of the three stop codons.

Page 15: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

What Is a Genome http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html

9 of 22 22.08.2006 13:48

Over 98 percent of the genomeis of unknown function. Although often referred to as"junk" DNA, scientists arebeginning to uncover thefunction of many of theseintergenic sequences—the DNAfound between genes.

Structural Genes, Junk DNA, and Regulatory Sequences

Structural Genes

Sequences that code for proteins are called structural genes.Although it is true that proteins arethe major components of structuralelements in a cell, proteins arealso the real workhorses of thecell. They perform such functionsas transporting nutrients into the cell; synthesizing new DNA,RNA, and protein molecules; and transmitting chemical signalsfrom outside to inside the cell, as well as throughout thecell—both critical to the process of making proteins. Regulatory Sequences

A class of sequences called regulatory sequences makes up anumerically insignificant fraction of the genome but providescritical functions. For example, certain sequences indicate thebeginning and end of genes, sites for initiating replication andrecombination, or provide landing sites for proteins that turn genes on and off. Like structural genes, regulatory sequences areinherited; however, they are not commonly referred to as genes. Other DNA Regions

Forty to forty-five percent of our genome is made up of short sequences that are repeated, sometimes hundreds of times.There are numerous forms of this "repetitive DNA", and a few have known functions, such as stabilizing the chromosomestructure or inactivating one of the two X chromosomes indeveloping females, a process called X-inactivation. The most highly repeated sequences found so far in mammals are called"satellite DNA" because their unusual composition allows themto be easily separated from other DNA. These sequences areassociated with chromosome structure and are found at thecentromeres (or centers) and telomeres (ends) ofchromosomes. Although they do not play a role in the coding ofproteins, they do play a significant role in chromosome structure,duplication, and cell division. The highly variable nature of thesesequences makes them an excellent "marker" by whichindividuals can be identified based on their unique pattern of their satellite DNA.

Page 16: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

What Is a Genome http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html

10 of 22 22.08.2006 13:48

Figure 3. A chromosome.

A chromosome is composed of a very long molecule of DNA and associatedproteins that carry hereditary information. The centromere, shown at the centerof this chromosome, is a specialized structure that appears during cell divisionand ensures the correct distribution of duplicated chromosomes to daughtercells. Telomeres are the structures that seal the end of a chromosome.Telomeres play a critical role in chromosome replication and maintenance bycounteracting the tendency of the chromosome to otherwise shorten with eachround of replication.

Another class of non-coding DNA is the "pseudogene", sonamed because it is believed to be a remnant of a real gene thathas suffered mutations and is no longer functional. Pseudogenesmay have arisen through the duplication of a functional gene, followed by inactivation of one of the copies. Comparing thepresence or absence of pseudogenes is one method used byevolutionary geneticists to group species and to determinerelatedness. Thus, these sequences are thought to carry a recordof our evolutionary history.

How Many Genes Do Humans Have? In February 2001, two largely independent draft versions of thehuman genome were published. Both studies estimated that thereare 30,000 to 40,000 genes in the human genome, roughly one-third the number of previous estimates. More recentlyscientists estimated that there are less than 30,000 human genes.However, we still have to make guesses at the actual number ofgenes, because not all of the human genome sequence isannotated and not all of the known sequence has been assigneda particular position in the genome.

So, how do scientists estimate the number of genes in a genome?For the most part, they look for tell-tale signs of genes in a DNAsequence. These include: open reading frames, stretches ofDNA, usually greater than 100 bases, that are not interrupted by astop codon such as TAA, TAG or TGA; start codons such asATG; specific sequences found at splice junctions, a location inthe DNA sequence where RNA removes the non-coding areas to

Page 17: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

What Is a Genome http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html

11 of 22 22.08.2006 13:48

form a continuous gene transcript for translation into a protein;and gene regulatory sequences. This process is dependent oncomputer programs that search for these patterns in various sequence databases and then make predictions about theexistence of a gene. From One Gene–One Protein to a More Global Perspective

Only a small percentage of the 3 billion bases in the human genome becomes an expressed gene product. However, of theapproximately 1 percent of our genome that is expressed, 40percent is alternatively spliced to produce multiple proteins from asingle gene. Alternative splicing refers to the cutting and pastingof the primary mRNA transcript into various combinations ofmature mRNA. Therefore the one gene–one protein theory,originally framed as "one gene–one enzyme", does not preciselyhold.

With so much DNA in the genome, why restrict transcription to atiny portion, and why make that tiny portion work overtime toproduce many alternate transcripts? This process may haveevolved as a way to limit the deleterious effects of mutations.Genetic mutations occur randomly, and the effect of a smallnumber of mutations on a single gene may be minimal. However,an individual having many genes each with small changes couldweaken the individual, and thus the species. On the other hand, ifa single mutation affects several alternate transcripts at once, it ismore likely that the effect will be devastating—the individual maynot survive to contribute to the next generation. Thus, alternatetranscripts from a single gene could reduce the chances that amutated gene is transmitted. Gene Switching: Turning Genes On and Off

The estimated number of genes for humans, less than 30,000, isnot so different from the 25,300 known genes of Arabidopsisthaliana, commonly called mustard grass. Yet, we appear, at leastat first glance, to be a far more complex organism. A person maywonder how this increased complexity is achieved. One answerlies in the regulatory system that turns genes on and off. Thissystem also precisely controls the amount of a gene product thatis produced and can further modify the product after it is made.This exquisite control requires multiple regulatory input points.One very efficient point occurs at transcription, such that an mRNA is produced only when a gene product is needed. Cellsalso regulate gene expression by post-transcriptional modification; by allowing only a subset of the mRNAs to go on totranslation; or by restricting translation of specific mRNAs to onlywhen the product is needed. At other levels, cells regulate geneexpression through DNA folding, chemical modification of thenucleotide bases, and intricate "feedback mechanisms" in which

Page 18: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

What Is a Genome http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html

12 of 22 22.08.2006 13:48

some of the gene's own protein product directs the cell to ceasefurther protein production. Controlling Transcription Promoters and Regulatory Sequences

Transcription is the process whereby RNA is made from DNA. It isinitiated when an enzyme, RNA polymerase, binds to a site onthe DNA called a promoter sequence. In most cases, thepolymerase is aided by a group of proteins called "transcriptionfactors" that perform specialized functions, such as DNA sequence recognition and regulation of the polymerase's enzymeactivity. Other regulatory sequences include activators, repressors, and enhancers. These sequences can be cis-acting (affecting genes that are adjacent to the sequence) ortrans-acting (affecting expression of the gene from a distantsite), even on another chromosome.

The Globin Genes: An Example of TranscriptionalRegulation

An example of transcriptional control occurs in the family of genesresponsible for the production of globin. Globin is the protein thatcomplexes with the iron-containing heme molecule to makehemoglobin. Hemoglobin transports oxygen to our tissues via redblood cells. In the adult, red blood cells do not contain DNA formaking new globin; they are ready-made with all of the hemoglobinthey will need.

During the first few weeks of life, embryonic globin is expressed inthe yolk sac of the egg. By week five of gestation, globin isexpressed in early liver cells. By birth, red blood cells are beingproduced, and globin is expressed in the bone marrow. Yet, theglobin found in the yolk is not produced from the same gene as isthe globin found in the liver or bone marrow stem cells. In fact, ateach stage of development, different globin genes are turned onand off through a process of transcriptional regulation called"switching".

To further complicate matters, globin is made from two differentprotein chains: an alpha-like chain coded for on chromosome 16;and a beta-like chain coded for on chromosome 11. Each chromosome has the embryonic, fetal, and adult form lined up onthe chromosome in a sequential order for developmentalexpression. The developmentally regulated transcription of globin iscontrolled by a number of cis-acting DNA sequences, and although there remains a lot to be learned about the interaction of thesesequences, one known control sequence is an enhancer called theLocus Control Region (LCR). The LCR sits far upstream on the sequence and controls the alpha genes on chromosome 16. It mayalso interact with other factors to determine which alpha gene isturned on.

Thalassemias are a group of diseases characterized by theabsence or decreased production of normal globin, and thus

Page 19: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

What Is a Genome http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html

13 of 22 22.08.2006 13:48

hemoglobin, leading to decreased oxygen in the system. There arealpha and beta thalassemias, defined by the defective gene, andthere are variations of each of these, depending on whether theembryonic, fetal, or adult forms are affected and/or expressed.Although there is no known cure for the thalassemias, there aremedical treatments that have been developed based on our currentunderstanding of both gene regulation and cell differentiation.Treatments include blood transfusions, iron chelators, and bonemarrow transplants. With continuing research in the areas of generegulation and cell differentiation, new and more effectivetreatments may soon be on the horizon, such as the advent of genetransfer therapies.

The Influence of DNA Structure and Binding Domains

Sequences that are important in regulating transcription do notnecessarily code for transcription factors or other proteins. Transcription can also be regulated by subtle variations in DNAstructure and by chemical changes in the bases to whichtranscription factors bind. As stated previously, the chemicalproperties of the four DNA bases differ slightly, providing eachbase with unique opportunities to chemically react with othermolecules. One chemical modification of DNA, calledmethylation, involves the addition of a methyl group (-CH3).Methylation frequently occurs at cytosine residues that are preceded by guanine bases, oftentimes in the vicinity of promotersequences. The methylation status of DNA often correlates withits functional activity, where inactive genes tend to be moreheavily methylated. This is because the methyl group serves toinhibit transcription by attracting a protein that binds specifically tomethylated DNA, thereby interfering with polymerase binding.Methylation also plays an important role in genomic imprinting, which occurs when both maternal and paternal alleles are presentbut only one allele is expressed while the other remains inactive.Another way to think of genomic imprinting is as "parent of origin differences" in the expression of inherited traits.Considerable intrigue surrounds the effects of DNA methylation,and many researchers are working to unlock the mystery behindthis concept. Controlling Translation Translation is the process whereby the genetic code carried byan mRNA directs the synthesis of proteins. Translational regulation occurs through the binding of specific molecules,called repressor proteins, to a sequence found on an RNA molecule. Repressor proteins prevent a gene from beingexpressed. As we have just discussed, the default state for agene is that of being expressed via the recognition of its promoter by RNA polymerase. Close to the promoter region is anothercis-acting site called the operator, the target for the repressor

Page 20: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

What Is a Genome http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html

14 of 22 22.08.2006 13:48

The cell cycle is theprocess that a cell undergoes to replicate.

protein. When the repressor protein binds to the operator, RNApolymerase is prevented from initiating transcription, and geneexpression is turned off.

Translational control plays a significant role in the process ofembryonic development and cell differentiation. Upon fertilization,an egg cell begins to multiply to produce a ball of cells that are allthe same. At some point, however, these cells begin to differentiate, or change into specific cell types. Some willbecome blood cells or kidney cells, whereas others may becomenerve or brain cells. When all of the cells formed are alike, thesame genes are turned on. However, once differentiation begins,various genes in different cells must become active to meet theneeds of that cell type. In some organisms, the egg houses storeimmature mRNAs that become translationally active only afterfertilization. Fertilization then serves to trigger mechanisms thatinitiate the efficient translation of mRNA into proteins. Similarmechanisms serve to activate mRNAs at other stages ofdevelopment and differentiation, such as when specific proteinproducts are needed.

Mechanisms of Genetic Variation and Heredity Does Everyone Have the Same Genes?

When you look at the human species, you see evidence of a process called genetic variation, that is, there are immediatelyrecognizable differences in human traits, such as hair and eye color, skin pigment, and height. Then there are the not so obviousgenetic variations, such as blood type. These expressed, orphenotypic, traits are attributable to genotypic variation in a person's DNA sequence. When two individuals display differentphenotypes of the same trait, they are said to have two differentalleles for the same gene. This means that the gene's sequenceis slightly different in the two individuals, and the gene is said tobe polymorphic, "poly" meaning many and "morph" meaningshape or form. Therefore, although people generally have the same genes, the genes do not have exactly the same DNAsequence. These polymorphic sites influence gene expressionand also serve as markers for genomic research efforts. Genetic Variation

Most genetic variation occurs during the phases of the cell cycle when DNA isduplicated. Mutations in the new DNA strandcan manifest as base substitutions, such as when a single base gets replaced with

another; deletions, where one or more bases are left out; orinsertions, where one or more bases are added. Mutations can

Page 21: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

What Is a Genome http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html

15 of 22 22.08.2006 13:48

either be synonymous, in which the variation still results in acodon for the same amino acid or non-synonymous, in whichthe variation results in a codon for a different amino acid. Mutations can also cause a frame shift, which occurs when the variation bumps the reference point for reading the genetic codedown a base or two and results in loss of part, or sometimes all,of that gene product. DNA mutations can also be introduced bytoxic chemicals and, particularly in skin cells, exposure toultraviolet radiation.

The manner in which a cell replicates differs with the variousclasses of life forms, as well as with the end purpose of the cellreplication. Cells that compose tissues in multicellular organismstypically replicate by organized duplication and spatial separation oftheir cellular genetic material, a process called mitosis. Meiosis isthe mode of cell replication for the formation of sperm and egg cellsin plants, animals, and many other multicellular life forms. Meiosisdiffers significantly from mitosis in that the cellular progeny havetheir complement of genetic material reduced to half that of theparent cell.

Mutations that occur in somatic cells—any cell in the bodyexcept gametes and their precursors—will not be passed on tothe next generation. This does not mean, however, that somaticcell mutations, sometimes called acquired mutations, are benign. For example, as your skin cells prepare to divide andproduce new skin cells, errors may be inadvertently introducedwhen the DNA is duplicated, resulting in a daughter cell thatcontains the error. Although most defective cells die quickly, somecan persist and may even become cancerous if the mutationaffects the ability to regulate cell growth.

Page 22: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

What Is a Genome http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html

16 of 22 22.08.2006 13:48

Mutations and the Next Generation

There are two places where mutations can be introduced and carried into the next generation. In the first stages ofdevelopment, a sperm cell and egg cell fuse. They then begin todivide, giving rise to cells that differentiate into tissue-specific celltypes. One early type of differentiated cell is the germ line cell,which may ultimately develop into mature gametes. If a mutationoccurs in the developing germ line cell, it may persist until thatindividual reaches reproductive age. Now the mutation has thepotential to be passed on to the next generation.

Mutations may also be introduced during meiosis, the mode ofcell replication for the formation of sperm and egg cells. In this case, the germ line cell is healthy, and the mutation is introducedduring the actual process of gamete replication. Once again, thesperm or egg will contain the mutation, and during the reproductive process, this mutation may then be passed on to theoffspring.

One should bear in mind that not all mutations are bad. Mutationsalso provide a species with the opportunity to adapt to newenvironments, as well as to protect a species from new pathogens. Mutations are what lie behind the popular saying of"survival of the fittest", the basic theory of evolution proposedby Charles Darwin in 1859. This theory proposes that as new environments arise, individuals carrying certain mutations thatenable an evolutionary advantage will survive to pass thismutation on to its offspring. It does not suggest that a mutation isderived from the environment, but that survival in thatenvironment is enhanced by a particular mutation. Some genes,and even some organisms, have evolved to tolerate mutationsbetter than others. For example, some viral genes are known tohave high mutation rates. Mutations serve the virus well byenabling adaptive traits, such as changes in the outer protein coatso that it can escape detection and thereby destruction by the host's immune system. Viruses also produce certain enzymesthat are necessary for infection of a host cell. A mutation withinsuch an enzyme may result in a new form that still allows the virusto infect its host but that is no longer blocked by an anti-viral drug.This will allow the virus to propagate freely in its environment. Mendel's Laws—How We Inherit Our Genes

In 1866, Gregor Mendel studied the transmission of sevendifferent pea traits by carefully test-crossing many distinctvarieties of peas. Studying garden peas might seem trivial tothose of us who live in a modern world of cloned sheep and genetransfer, but Mendel's simple approach led to fundamental insights into genetic inheritance, known today as Mendel's Laws. Mendel did not actually know or understand the cellularmechanisms that produced the results he observed. Nonetheless,he correctly surmised the behavior of traits and the mathematical

Page 23: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

What Is a Genome http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html

17 of 22 22.08.2006 13:48

predictions of their transmission, the independent segregation ofalleles during gamete production, and the independentassortment of genes. Perhaps as amazing as Mendel'sdiscoveries was the fact that his work was largely ignored by thescientific community for over 30 years!

Mendel's Principles of Genetic Inheritance

Law of Segregation: Each of the two inherited factors (alleles)possessed by the parent will segregate and pass into separategametes (eggs or sperm) during meiosis, which will each carry onlyone of the factors.

Law of Independent Assortment: In the gametes, alleles of onegene separate independently of those of another gene, and thus allpossible combinations of alleles are equally probable.

Law of Dominance: Each trait is determined by two factors(alleles), inherited one from each parent. These factors each exhibita characteristic dominant, co-dominant, or recessive expression,and those that are dominant will mask the expression of those thatare recessive.

How Does Inheritance Work?

Our discussion here is restricted to sexually reproducing organisms where each gene in an individual is represented by twocopies, called alleles—one on each chromosome pair. There maybe more than two alleles, or variants, for a given gene in a population, but only two alleles can be found in an individual.Therefore, the probability that a particular allele will be inherited is50:50, that is, alleles randomly and independently segregate intodaughter cells, although there are some exceptions to this rule.

The term diploid describes a state in which a cell has two sets ofhomologous chromosomes, or two chromosomes that are thesame. The maturation of germ line stem cells into gametes requires the diploid number of each chromosome be reduced byhalf. Hence, gametes are said to be haploid—having only asingle set of homologous chromosomes. This reduction isaccomplished through a process called meiosis, where onechromosome in a diploid pair is sent to each daughter gamete.Human gametes, therefore, contain 23 chromosomes, half thenumber of somatic cells—all the other cells of the body.

Because the chromosome in one pair separates independently of all other chromosomes, each new gamete has the potential for atotally new combination of chromosomes. In humans, theindependent segregation of the 23 chromosomes can lead to asmany as 16 to 17 million different combinations in one individual'sgametes. Only one of these gametes will combine with one of thenearly 17 million possible combinations from the other parent,generating a staggering potential for individual variation. Yet, thisis just the beginning. Even more variation is possible when you

Page 24: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

What Is a Genome http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html

18 of 22 22.08.2006 13:48

consider the recombination between sections of chromosomesduring meiosis as well as the random mutation that can occurduring DNA replication. With such a range of possibilities, it isamazing that siblings look so much alike! Expression of Inherited Genes

Gene expression, as reflected in an organism's phenotype, is based on conditions specific for each copy of a gene. As we justdiscussed, for every human gene there are two copies, and forevery gene there can be several variants or alleles. If both allelesare the same, the gene is said to be homozygous. If the alleles are different, they are said to be heterozygous. For some alleles, their influence on phenotype takes precedence over all otheralleles. For others, expression depends on whether the geneappears in the homozygous or heterozygous state. Still otherphenotypic traits are a combination of several alleles from severaldifferent genes. Determining the allelic condition used to beaccomplished solely through the analysis of pedigrees, much theway Mendel carried out his experiments on peas. However, thismethod can leave many questions unanswered, particularly fortraits that are a result of the interaction between several differentgenes. Today, molecular genetic techniques exist that can assistresearchers in tracking the transmission of traits by pinpointingthe location of individual genes, identifying allelic variants, andidentifying those traits that are caused by multiple genes.

The Nature of Alleles

Page 25: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

What Is a Genome http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html

19 of 22 22.08.2006 13:48

A dominant allele is an allele that is almost always expressed, even if onlyone copy is present. Dominant alleles express their phenotype even whenpaired with a different allele, that is, when heterozygous. In this case, thephenotype appears the same in both the heterozygous and homozygousstates. Just how the dominant allele overshadows the other allele dependson the gene, but in some cases the dominant gene produces a geneproduct that the other allele does not. Well-known dominant alleles occur inthe human genes for Huntington disease, a form of dwarfism calledachondroplasia, and polydactylism (extra fingers and toes).

On the other hand, a recessive allele will be expressed only if there aretwo identical copies of that allele, or for a male, if one copy is present on theX chromosome. The phenotype of a recessive allele is only seen when bothalleles are the same. When an individual has one dominant allele and onerecessive allele, the trait is not expressed because it is overshadowed bythe dominant allele. The individual is said to be a carrier for that trait.Examples of recessive disorders in humans include sickle cell anemia,Tay-Sachs disease, and phenylketonuria (PKU).

A particularly important category of genetic linkage has to do with the X and Y sex chromosomes. These chromosomes not only carry the genes thatdetermine male and female traits, but also those for some othercharacteristics as well. Genes that are carried by either sex chromosomeare said to be sex linked. Men normally have an X and a Y combination of sex chromosomes, whereas women have two X's. Because only men inheritY chromosomes, they are the only ones to inherit Y-linked traits. Both menand women can have X-linked traits because both inherit X chromosomes.

X-linked traits not related to feminine body characteristics are primarilyexpressed in the phenotype of men. This is because men have only one Xchromosome. Subsequently, genes on that chromosome that do not codefor gender are expressed in the male phenotype, even if they are recessive.In women, a recessive allele on one X chromosome is often masked in theirphenotype by a dominant normal allele on the other. This explains whywomen are frequently carriers of X-linked traits but more rarely have themexpressed in their own phenotypes. In humans, at least 320 genes areX-linked. These include the genes for hemophilia, red–green colorblindness, and congenital night blindness. There are at least a dozenY-linked genes, in addition to those that code for masculine physical traits.

It is now known that one of the X chromosomes in the cells of humanfemales is completely, or mostly, inactivated early in embryonic life. Thisis a normal self-preservation action to prevent a potentially harmfuldouble dose of genes. Recent research points to the "Xist" gene on the X chromosome as being responsible for a sequence of events thatsilences one of the X chromosomes in women. The inactivated Xchromosomes become highly compacted structures known as Barr bodies. The presence of Barr bodies has been used at international sportcompetitions as a test to determine whether an athlete is a male or afemale.

Exceptions to Mendel's Laws

There are many examples of inheritance that appear to be exceptions to Mendel's laws. Usually, they turn out to representcomplex interactions among various allelic conditions. Forexample, co-dominant alleles both contribute to a phenotype.Neither is dominant over the other. Control of the human bloodgroup system provides a good example of co-dominant alleles.

Page 26: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

What Is a Genome http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html

20 of 22 22.08.2006 13:48

The Four Basic Blood Types

There are four basic blood types, and they are O, A, B, andAB. We know that our blood type is determined by the"alleles" that we inherit from our parents. For the blood typegene, there are three basic blood type alleles: A, B, and O.We all have two alleles, one inherited from each parent. Thepossible combinations of the three alleles are OO, AO, BO,AB, AA, and BB. Blood types A and B are "co-dominant" alleles, whereas O is "recessive". A codominant allele is apparent even if only one is present; a recessive allele isapparent only if two recessive alleles are present. Becauseblood type O is recessive, it is not apparent if the personinherits an A or B allele along with it. So, the possible allelecombinations result in a particular blood type in this way:

OO = blood type OAO = blood type ABO = blood type BAB = blood type ABAA = blood type ABB = blood type B

You can see that a person with blood type B may have a Band an O allele, or they may have two B alleles. If bothparents are blood type B and both have a B and a recessiveO, then their children will either be BB, BO, or OO. If thechild is BB or BO, they have blood type B. If the child is OO,he or she will have blood type O.

Pleiotropism, or pleotrophy, refers to the phenomenon in whicha single gene is responsible for producing multiple, distinct, andapparently unrelated phenotypic traits, that is, an individual canexhibit many different phenotypic outcomes. This is because thegene product is active in many places in the body. An example isMarfan's syndrome, where there is a defect in the gene codingfor a connective tissue protein. Individuals with Marfan's syndrome exhibit abnormalities in their eyes, skeletal system, andcardiovascular system.

Some genes mask the expression of other genes just as a fullydominant allele masks the expression of its recessive counterpart.A gene that masks the phenotypic effect of another gene is calledan epistatic gene; the gene it subordinates is the hypostaticgene. The gene for albinism in humans is an epistatic gene. It isnot part of the interacting skin-color genes. Rather, its dominantallele is necessary for the development of any skin pigment, andits recessive homozygous state results in the albino condition,regardless of how many other pigment genes may be present.Because of the effects of an epistatic gene, some individuals whoinherit the dominant, disease-causing gene show only partial symptoms of the disease. Some, in fact, may show no expressionof the disease-causing gene, a condition referred to as

Page 27: Just the Facts: A Basic Introduction to the Science ......What Is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology

What Is a Genome http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html

21 of 22 22.08.2006 13:48

nonpenetrance. The individual in whom such a nonpenetrantmutant gene exists will be phenotypically normal but still capableof passing the deleterious gene on to offspring, who may exhibitthe full-blown disease.

Then we have traits that are multigenic, that is, they result fromthe expression of several different genes. This is true for humaneye color, in which at least three different genes are responsiblefor determining eye color. A brown/blue gene and a central browngene are both found on chromosome 15, whereas a green/blue gene is found on chromosome 19. The interaction between thesegenes is not well understood. It is speculated that there may beother genes that control other factors, such as the amount ofpigment deposited in the iris. This multigenic system explains whytwo blue-eyed individuals can have a brown-eyed child.

Speaking of eye color, have you ever seen someone with onegreen eye and one brown eye? In this case, somatic mosaicismmay be the culprit. This is probably easier to describe than explain. In multicellular organisms, every cell in the adult isultimately derived from the single-cell fertilized egg. Therefore,every cell in the adult normally carries the same geneticinformation. However, what would happen if a mutation occurredin only one cell at the two-cell stage of development? Then theadult would be composed of two types of cells: cells with themutation and cells without. If a mutation affecting melaninproduction occurred in one of the cells in the cell lineage of oneeye but not the other, then the eyes would have different geneticpotential for melanin synthesis. This could produce eyes of twodifferent colors.

Penetrance refers to the degree to which a particular allele isexpressed in a population phenotype. If every individual carryinga dominant mutant gene demonstrates the mutant phenotype, thegene is said to show complete penetrance.

Molecular Genetics: The Study of Heredity, Genes,and DNA

As we have just learned, DNA provides a blueprint that directs allcellular activities and specifies the developmental plan ofmulticellular organisms. Therefore, an understanding of DNA, gene structure, and function is fundamental for an appreciation ofthe molecular biology of the cell. Yet, it is important to recognizethat progress in any scientific field depends on the availability ofexperimental tools that allow researchers to make new scientificobservations and conduct novel experiments. The last section ofthe genetic primer concludes with a discussion of some of thelaboratory tools and technologies that allow researchers to studycells and their DNA. Back to top


Recommended