1
Introduction to molecular biology
Summary
• Cells
• Chromosomes
• DNA
• RNA
• Aminoacids
• Proteins
• Genomics
• Transcriptomics
• Proteomics
2
Cells
All the living beings are composed of cells, thatare the basic unit of life. Each cell derives fromother cell.
Prokaryotes
No nucleus or internal membranes. Eukaryotes
• Nucleus. • Internal membranes.• Organelles inside the cell that play different and specific roles.
Organisms can be:Unicellular
• Prokaryotes: bacteria, rchaea.• Eukaryotes: baker yeast.
Multicellular
•Eukaryotes: animals, plants, fungi…
Human beings: 60 E18 cells, 320 different types
Cells
Composition
70% Water7% Small molecules:
• Salts• Lipids• Aminoacids• Nucleotides
23% Macromolecules:• Proteins• Polysaccharide
Cell functions:
A cell contains all the necessary information toperform a replication (a virus does not!). Processesdeveloped by cells include:
Metabolic pathwaysTraduction of RNA to proteins…
3
Chromosomes
• The nucleus of Eukaryots containsone or more DNA molecules (doublestranded). Each of thesesupermoleluces are calledchomosomes.
• For examples, human beings have22 pairs of autosomes) and 1 pair ofsexual chromosomes. :
Cells
• Almost all the cells in an organism have the samegenome (some times there are slight differences).
• The DNA represents all the information needed by thecell to perform its functions.
4
Three basic macromolecules for life
• DNA
– It contains all the information needed by the cell (the “hard drive”)
– Actually, since almost all the cells in an organism share the samegenome, it contains all the information needed by ANY cell to performtheir functions.
– It stays (almost) always in the nucleus.
• RNA
– RNA has two main functions:
• It mimics the information in DNA (located in the nucleus) and migrates toother parts of the cell where this information is used (messenger RNA, mRNA)
• It has a crucial role in protein synthesis (transfer RNA, tRNA).
• Proteins
– Many different functions (signalling, structural, enzymes, regulation…). They are the key constituents of the organism.
Central dogma of molecular biology
• It is not a DOGMA– A dogma is some
important part of thefaith that must be believed.
– The researcher thatcoined this term finallyrecognized that “I did notknow what dogma meant”.
– There are strong supportto this… it is not a dogma (or at least there are other fields of knowledgethat deserve this termmuch more ☺.
5
DNA vs RNA
DNA: code of life
There are four different nucleotides for all living beings: Adenine (A), Guanine(G), Cytosine (C) y Thymine (T). They have two complimentary pairs: A-T andC-G
DeoxyriboNucleic Acid
7
Structure of a nucleotide
• Purines: Adenine (A) and Guanine (G). There is a double ring.
• Pyrimidines: Thymine (T), Cytosine (C) andUracil (U). Thymine is substituted by Uracilon RNA. Single ring.
One “nucleotide” is a compound formed by one base (A, C, T ó G), one sugarmolecule and phosphoric acid.
How to read a DNA sequence?
The DNA molecule is created when bonds betwen 5’
and 3’ of the nucleotides are set. DNA is alwyas readfrom 5’ to 3’. Funny equivocations…
All the nucleotides have two
bonds: 5’ and 3’. The number isthe position of the carbon atoms
in the sugar molecule.
Nucleotides, in turn, form a
phosphodiesther bond.
Sequence: TGACT
8
DNA
• It can be seen as a codewith only 4 letters instead of2 (binary coding).
• How many letters? 16 todescribe differentpossibilities.
None (gap)----
aNyG or A or T or CN
not-C, D follows CG or A or TD
not-T (not-U), V follows UG or C or AV
not-A, B follows AG or T or CB
not-G, H follows G in the alphabetA or C or TH
Weak interaction (2 H bonds)A or TW
Strong interaction (3 H bonds)G or CS
KetoG or TK
aMinoA or CM
pYrimidineT or CY
puRineG or AR
CytosineCC
ThymineTT
AdenineAA
GuanineGG
Origin of the nameMeaningSymbol
Code:
DNA is double stranded
• Hydrogen bonds between the nucleotide pairs..
• DNA is not symmetric!! It has two directions and isread, always, from 5’ to 3’.
• Both strands are complimentary: A-T and C-G
– Forward strand
– Reverse strand
9
Mitochondrial DNA
Mitochondrial DNA (mtDNA) is the DNA located in organelles called mitochondria. All mtDNA is received by the mother (since mitochondria is provided by the zygote.
Mitochondria are sometimes described as "cellular power plants," because they generate ATP, used as a source of chemical energy .
RNA (RiboNucleic Acid)
• Protein synthesis occurs in the Ribosomes• Organelles located in in the cytoplasm outside the nucleus.
• DNA is in the nucleus.• RNA transports the information from the nucleus to the Ribosomes
• The mechanism that creates RNA complimentary to DNA is called
Transcription.
DNA vs RNA:
(T) is substituted by uracil (U). • RNA is single stranded. It can bend and form two stranded chains (palindromes) (“Sit on a potato pan, Otis”).
10
Messenger RNA (mRNA)
• Part of the DNA is
trascripted into RNA (RNA is
a copy of the DNA).
• RNA goes to the cytoplasm
and in the ribosomes, mRNA
is used to build proteins.
• RNA itself is the message
from the nucleus to the
cytoplasm.
Transcription process
Inititiation
In the first stage, RNA polimerase binds to a region of DNA (that is called the promoter). The enzyme opens de DNA, and allows the creation of the RNA molecule thathas a complementary sequence to the DNA.
Elongation
RNA polimerase moves along the supporting strand andRNA nucleotides are inserted in the new RNA molecule
Termination
RNA termination process is a complex process (it involvespalidrons –hairpins- in prokaryots and more complexprocesses in eukaryots). Once it has finished, DNA isclosed again, and RNA moves form the nucleus to thecytoplasm.
11
Transcription in action
RNA maturation
• In Eukaryots, the sequence that appears in the genome is not exactly the onetranslated.
• RNA has a maturation process• Remove intermediate sequences called introns.
• Join the exons using polymerases.
• A single gene (DNA) can raise several variants (using different exons). This process iscalled Alternative Splicing.
12
What is a gene?
• Promoter region. It contains the necessary sequences to activate or deactivatethe gene. Limits are fuzzy and depends on different genes. Proximal promoter isconsidered to be 1000-50000 bp upstream the TSS (transcription start site)
• Exons: Coding regions of the gen (it converts into proteins) 1 to 178 exones/gene (average: 8.8)
8 bp to 17 kb /exon (average145 bp)
• Introns: Non coding region flanked by two exons.Size (average): 1 kb – 50 kb /intron (much larger than exons)
• Size of a gene: the largest: 2.4 Mb (Dystrophin). Average: 27 kb.
13
PROTEINS
Aminoacids
Amino acids are the basic structural building units of proteins. An amino acid is a molecule that contains both amine and carboxyl functional groups with the general formula H2NCHRCOOH, where R is an organic
substituent.
They form polymer chains
Short ones called peptides, large ones called polypeptides or proteins.
TranslationProcess to form the protein according to the mRNA template.
As both the amine and carboxylic acid groups of amino acids can react to form amide bonds, one amino
acid molecule can react with another and become joined through an amide linkage. This polymerization
of amino acids is what creates proteins.
14
Aminoacids
• 20 standard aminoacids• Bricks to build proteins.
• 10 essential amino acids
• Cannot be synthesized by human body. • They therefore must be obtained from food• Plants synthesizes all ofthem.
Aminoacids
Amino Acid 3-Letter 1-Letter polarity acidity hydrophobycityAlanine Ala A nonpolar neutral 1.8Arginine Arg R polar basic -4.5Asparagine Asn N polar neutral -3.5Aspartic acid Asp D polar acidic -3.5Cysteine Cys C polar neutral 2.5Glutamic acid Glu E polar acidic -3.5Glutamine Gln Q polar neutral -3.5Glycine Gly G nonpolar neutral -0.4Histidine His H polar basic -3.2Isoleucine Ile I nonpolar neutral 4.5Leucine Leu L nonpolar neutral 3.8Lysine Lys K polar basic -3.9Methionine Met M nonpolar neutral 1.9Phenylalanine Phe F nonpolar neutral 2.8Proline Pro P nonpolar neutral -1.6Serine Ser S polar neutral -0.8Threonine Thr T polar neutral -0.7Tryptophan Trp W nonpolar neutral -0.9Tyrosine Tyr Y polar neutral -1.3Valine Val V nonpolar neutral 4.2
15
Proteins
• Proteins are large molecules composed of aminoacids.
• Their 3D structure is complex
• It is not a double helix as DNA: the shape is different for each of
them.
• Proteins fold. This folding plays a crucial role in their function
• For example, mad cow disease is produced by an anormal folding
of a protein.
Protein structure
• Protein structure is crucial todetermine their chemicalproperties, and even, theirfunction.
• 3D structure determines whichare the aminoacids in thesurface.
• There are 4 levels at whichstructure can be studied:1. Aminoacid sequence
2. Polipeptide folding
3. Protein shape
4. Protein interactions (that includechanges in the positions of theatoms).
16
Central Dogma (once again)
• Transcription brings the data fromDNA to RNA
• RNA from the nucleus to theRibosoms
• Translation obtains protein accordingto the genetic code and thecorresponding mRNA
– tRNA is used as a lorry to carry theaminoacids as we will see.
Translation
• Translation is the second step in the central dogma. • mARN is decoded using the genetic code
• Aminoacids follow the sequence givenby mRNA.
• This process takes place in the cytoplasm.• tRNA is used as a “lorry” to carry theaminoacids.•Ribosomes are the factories to build the
proteins.
17
Trasnfer RNA (tRNA)
tRNA is a RNA that is used to carry aminoacids to the ribosomes in order to build teh proteins.tRNA abundance is larger than mRNA (75% vs 15%)
Most RNA in the cell is tRNAtRNA acknoledges mRNA and transfer the correspondign aminoacid to the protein beingcreated.
Genetic code
Codon: a sequence of 3 nucleotides that codes for an aminoacid according tothis table.
AUG codes methionine, and is also the start code. First AUG in mRNA is the region where translation starts.
18
Some exceptions:
Genetic code is almost universal.
Other considerations…
• A codon is a sequence of 3 nuclotides (DNA or RNA) thatcodes for a particular aminoacid.
– There are 4 possible bases (RNA) : A, C, G y U
– 3 bases per codon
• Therefore, there are: 4 * 4 * 4 = 64 possible codons
• Special codons:
– Start codon: AUG. Translation starts in this codon. It also codesfor an aminoacid (methinine)
– Stop codons (three flavours): UAA, UAG, UGA
• There are 61 codons left to code 19 aminoacids
– Genetic code is redundant: the same aminoacid may be codedby several codons.
19
Translation again:
Anticodon: A sequence of 3
nucleotides in tRNA that acknowledgethe corresponding codon in mRNA.
Using the anticodon, the aminoacid toinclude in the protein is selected.
tRNA carries the “free” aminoacid and, in the Ribosome, it is joined to thepolypeptide chain that it is beingcreated.
For example, tRNA with anticodonUAC, corresponds to the AUG codonthat, in turn, codes methionine.
ATGGAAGTATTTAAAGCGCCACCTATTGGGATATAAG…
ATG GAA GTA TTT AAA GCG CCA CCT ATT GGG ATA TAA G…
M E V F K A P P I G I stop
Translation in action
20
In brief:
• Proteins are coded in the genes in ADN located in the nucleus. DNA stays always in the nucleus.
• Ribosomes are factories to build proteins located in the cytoplasm. mRNA carries the mesage from the nucleus to the ribosomes. There is an intermediate step called mRNA maturation in whichintrons are excluded and exons are retained.
• Ribosomes build what mRNA codes, using aminoacids that in turn, are carried by tRNA.– Ribosomes are composed of proteins and rRNA (a third class of RNA…
and there are even more!!)
Some important
Definitions in
BIOINFORMATICS
21
• Coding From DNA to protein is done by codons. There are three possibilities (startswith the first, the second or the third nucleotide in the sequence). We can use onestrand (forward) or the other (reverse strand). Each of these six possbilities are called a reading frame. Only one of them is valid. For example, this sequence has the following possibilities :
ATGCC (M) ATGCC (C) ATGCC (A)
• A sequence flanked by start codon and a stop codon is called an Open ReadingFrame (ORF).
ORF (Open Reading Frame):
Genomic Sequence
Open reading frame
ATG TGA
ORFs as gene candidates
• An open reading frame that begins with a start codon (ATG)
• Most prokaryotic genes code for proteins that are 60 or more amino acids in length
• The probability that a random sequence of nucleotides of length n has no stop codons is (61/64)n
• When n is 50, there is a probability of 92% that the random sequence contains a stop codon
• When n is 100, this probability exceeds 99%
– A large sequence without stop codons is probably coding a protein.
22
• Nucleic acids = composed of nucleotides= bases or base pairs
• Short form: nt (nucleotides), bp (base pairs).
• 400-nt: means 400 nucleotide positions (in DNA theyare 800!)
• 400-bp menas 400 base pairs
• 1000000-bp = 1000-kb = 1-Mb
Definitions:
Genomic analysis
How to build a whole genome in four steps:– Cut it!:
• Restriction enzymes break the DNA in specific sites.
• It is divided into sort pieces.
– Copy it!: • It is easy to copy DNA (it was designed for that!).
• We get several clones of each DNA sequence using the Polymerase ChainReaction (PCR).
– Using a cycle of PCR, the concentration of DNA is doubled
» 20 cycles of PCR increases the concentration by 2^20…
» (about 1 million times)
– Read it!:• Electrophoresis to read small fragments.
– Ensembl it!:• Using all the fragments, there are overlapping sequences that can be used to
perform the ensembl (just like building a puzzle).
• This puzzle has “large sky regions” difficult to build: there are large parts ofthe genomes quite repetitve (and they are also important).
23
Genomic analysis and bioinformatics
DNA sequence analysis
Gene hunting
Protein sequenceanalysis
2001: First draft versionof the human genome. 2003: Human genomecurated. First “releaseversion”. Mouse genomecompleted.
Protein function can be inferred from
their sequence and structure. Structure
analysis gives better resutls
Once that we have the sequence we can find genes (using
statistical properties of the intra gene regions).
It is also important to measure gene expression and predict
their function.
• DNA– Useful for genomic diseases
• Single gene (mendelian), chromosomal.
• Multifactorial o complex diseases.
Predisposition to develop a disease
– Does not change if the organism has an acquired disease condition �
Not valid as a marker of an acquiered disease
• RNA– Easy to measure
– RNA concentration changes for disease state
Early marker for different diseases
• Proteins
– It is difficult to perform a whole proteome analysis.
– They finally explain most of the disease targets� Closer to thebiological fact
Most reasonable drug targets
Bioinformatics analysis:
24
Genomes:
ORGANISM CHROMOSOMES Size GENE Number
Homo sapiens
(Humans)23 3,200,000,000 ~ 30,000
Mus musculus
(Mouse)20 2,600,000,000 ~30,000
Drosophila
melanogaster
(Fruit Fly)
4 180,000,000 ~18,000
Saccharomyces
cerevisiae (Yeast)16 14,000,000 ~6,000
Zea mays (Corn) 10 2,400,000,000 ???
Transcriptome:
• Different mRNA (including splice forms) for a particular organism.
– About 30.000 genes
– About 250.000 splice variants
• Other RNA fucntions related with trasncription regulation
– miRNA: small pieces of RNA that interrupt the transcription of a gene
25
Proteome
• The complete collection of proteins in an organism– Nobody knows how many… At least several millions.
– For each splice variant, using post transductional modifications, different proteins (with different functions can be obtaines).
• One gene � Several splice forms � Several proteins– Proteins are modified by other molecules that are joined to it
• Phosphate, acyl, methil, sugars, lípids, etc.,
• They change radically the activity of the protein
– There are many proteins with two forms: idle and active. Thetransition is done by adding a phosphate group.
• Many disparate biological activity can be assigned to a single gene.
Questions?