Introduction to bioinformatics Lecture 2 Genes and Genomes

Introduction to bioinformaticsLecture 2

Genes and Genomes

CENTR

FORINTEGRATIVE

BIOINFORMATICSVU

E

Organisational• Course website:

http://ibi.vu.nl/teaching/mnw_2year/mnw2_2007.phpor click on http://ibi.vu.nl (>teaching >Introduction to Bioinformatics)

• Course book: Bioinformatics and Molecular Evolution by Paul G. Higgs and Teresa K. Attwood (Blackwell Publishing), 2005, ISBN (Pbk) 1-4051-0683-2

• Lots of information about Bioinformatics can be found on the web.

.....acctc ctgtgcaaga acatgaaaca nctgtggttc tcccagatgg gtcctgtccc aggtgcacct gcaggagtcg ggcccaggac tggggaagcc tccagagctc aaaaccccac ttggtgacac aactcacaca tgcccacggt gcccagagcc caaatcttgt gacacacctc ccccgtgccc acggtgccca gagcccaaat cttgtgacac acctccccca tgcccacggt gcccagagcc caaatcttgt gacacacctc ccccgtgccc ccggtgccca gcacctgaac tcttgggagg accgtcagtc ttcctcttcc ccccaaaacc caaggatacc cttatgattt cccggacccc tgaggtcacg tgcgtggtgg tggacgtgag ccacgaagac ccnnnngtcc agttcaagtg gtacgtggac ggcgtggagg tgcataatgc caagacaaag ctgcgggagg agcagtacaa cagcacgttc cgtgtggtca gcgtcctcac cgtcctgcac caggactggc tgaacggcaa ggagtacaag tgcaaggtct ccaacaaagc aaccaagtca gcctgacctg cctggtcaaa ggcttctacc ccagcgacat cgccgtggag tgggagagca atgggcagcc ggagaacaac tacaacacca cgcctcccat gctggactcc gacggctcct tcttcctcta cagcaagctc accgtggaca agagcaggtg gcagcagggg aacatcttct catgctccgt gatgcatgag gctctgcaca accgctacac gcagaagagc ctctc.....

DNA sequenceDNA sequence

Genome sizeGenome sizeOrganism Number of base

pairsX-174 virus 5,386Epstein Bar Virus 172,282Mycoplasma genitalium 580,000Hemophilus Influenza 1.8 106 Yeast (S. Cerevisiae) 12.1 106

Human Human 3.2 3.2 10 1099

Wheat 16 109

Lilium longiflorum 90 109

Salamander 100 109 Amoeba dubia 670 109

Four DNA nucleotide building blocks

G-C is more strongly hydrogen-bonded than A-T

A gene codes for a protein

Protein

mRNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

Central Dogma of Molecular Biology

Replication DNATranscription

mRNATranslation

Protein

Transcription is carried out by RNA polymerase (II)Translation is performed on ribosomesReplication is carried out by DNA polymeraseReverse transcriptase copies RNA into DNA

Transcription + Translation = Expression

But DNA can also be transcribed into non-coding RNA …

tRNA (transfer): transfer of amino acids to theribosome during protein synthesis.

rRNA (ribosomal): essential component of the ribosomes (complex with rProteins).

snRNA (small nuclear): mainly involved in RNA-splicing(removal of introns). snRNPs.

snoRNA (small nucleolar): involved in chemical modifi-cations of ribosomal RNAs and other RNA genes. snoRNPs.

SRP RNA (signal recognition particle): form RNA-protein complex involved in mRNA secretion.

Further: microRNA, eRNA, gRNA, tmRNA etc.

Eukaryotes have spliced genes …

Promoter: involved in transcription initiation (TF/RNApol-binding sites) TSS: transcription start site UTRs: un-translated regions (important for translational control) Exons will be spliced together by removal of the Introns Poly-adenylation site important for transcription termination

(but also: mRNA stability, export mRNA from nucleus etc.)

DNA makes mRNA makes Protein

DNA makes RNA makes Protein

… yet another picture to appreciate the above statement

Some facts about human genes

There are about 20.000 – 25.000 genes in the human genome (~ 3% of the genome)

Average gene length is ~ 8.000 bp

Average of 5-6 exons per gene

Average exon length is ~ 200 bp

Average intron length is ~ 2000 bp

8% of the genes have a single exon

Some exons can be as small as 1 or 3 bp

DMD: the largest known human gene

The largest known human gene is DMD, the gene that encodes dystrophin: ~ 2.4 milion bp over 79 exons

X-linked recessive disease (affects boys)

Two variants: Duchenne-type (DMD) and becker-type (BMD)

Duchenne-type: more severe, frameshift-mutations

Becker-type: milder phenotype, “in frame”- mutations

Posture changes during progression of Duchenne muscular dystrophy

Nucleic acid basics

Nucleic acids are polymers

Each monomer consists of 3 moieties

nucleoside

nucleotide

Nucleic acid basics (2)

A base can be of 5 rings Purines and Pyrimidines can base-pair (Watson- Crick pairs)

Watson and Crick, 1953

Nucleic acid as hetero-polymers

Nucleosides, nucleotides

(Ribose sugar, RNA precursor)

(2’-deoxy ribose sugar, DNA precursor)

(2’-deoxy thymidine tri-phosphate, nucleotide)

DNA and RNA strands

REMEMBER:

DNA = deoxyribonucleotides;RNA = ribonucleotides (OH-groups at the 2’ position)

Note the directionality of DNA (5’-3’ & 3’-5’) or RNA (5’-3’)

DNA = A, G, C, T ; RNA = A, G, C, U

So …

DNA RNA

Stability of base-pairing

C-G base pairing is more stable than A-T (A-U) base pairing (why?)

3rd codon position has freedom to evolve (synonymous mutations)

Species can therefore optimise their G-C content (e.g. thermophiles are GC rich) (consequences for codon use?)

Thermocrinis ruber, heat-loving bacteria

TAA, TAG, TGA Stop Stop codons

CGT, CGC, CGA, CGG, AGA, AGGRArginine

AAA, AAGKLysine

GAT, GACDAspartic acid

GAA, GAGEGlutamic acid

CAT, CACHHistidine

AAT, AACNAsparagine

CAA, CAGQGlutamine

TGGWTryptophan

TAT, TACYTyrosine

TCT, TCC, TCA, TCG, AGT, AGCSSerine

ACT, ACC, ACA, ACGTThreonine

CCT, CCC, CCA, CCGPProline

GGT, GGC, GGA, GGG GGlycine

GCT, GCC, GCA, GCG AAlanine

TGT, TGCc Cysteine

ATGM, StartMethionine

TTT, TTCFPhenylalanine

GTT, GTC, GTA, GTGVValine

CTT, CTC, CTA, CTG, TTA, TTGLLeucine

ATT, ATC, ATA IIsoleucine

DNA codonsSingle Letter Code

Amino Acid

DNA compositional biases

Base compositions of genomes: G+C (and therefore also A+T) content varies between different genomes

The GC-content is sometimes used to classify organism in taxonomy

High G+C content bacteria: Actinobacteriae.g. in Streptomyces coelicolor it is 72%

Low G+C content: Plasmodium falciparum (~20%)

Other examples:Saccharomyces cerevisiae (yeast) 38%Arabidopsis thaliana (plant) 36%Escherichia coli (bacteria) 50%

Genetic diseases: cystic fibrosis

Known since very early on (“Celtic gene”)

Autosomal, recessive, hereditary disease (Chr. 7)

Symptoms: Exocrine glands (which produce

sweat and mucus) Abnormal secretions Respiratory problems Reduced fertility and (male)

anatomical anomalies

30,0003,000 20,000

cystic fibrosis (2)

Gene product: CFTR (cystic fibrosis transmembrane conductance regulator)

CFTR is an ABC (ATP-binding cassette) transporter or traffic ATPase.

These proteins transport molecules such as sugars, peptides, inorganic phosphate, chloride, and metal cations across the cellular membrane.

CFTR transports chloride ions (Cl-) ions across the membranes of cells in the lungs, liver, pancreas, digestive tract, reproductive tract, and skin.

cystic fibrosis (3)

CF gene CFTR has 3-bp deletion leading to Del508 (Phe) in 1480 aa protein (epithelial Cl- channel)

Protein degraded in Endoplasmatic Reticulum (ER) instead of inserted into cell membrane

The deltaF508 deletion is the most common cause of cystic fibrosis. The isoleucine (Ile) at amino acid position 507 remains unchanged because both ATC and ATT code for isoleucine

Diagram depicting the five domains of the CFTR membrane protein (Sheppard 1999).

Theoretical Model of NBD1. PDB identifier 1NBD as viewed in Protein Explorer http://proteinexplorer.org

Let’s return to DNA and RNA structure …

Unlike three dimensional structures of proteins, DNA molecules assume simple double helical structures independent of their sequences.

There are three kinds of double helices that have been observed in DNA: type A, type B, and type Z, which differ in their geometries.

RNA on the other hand, can have as diverse structures as proteins, as well as simple double helix of type A.

The ability of being both informational and diverse in structure suggests that RNA was the prebiotic molecule that could function in both replication and catalysis (The RNA World Hypothesis).

In fact, some viruses encode their genetic materials by RNA (retrovirus)

Three dimensional structures of double helices

Side view: A-DNA, B-DNA, Z-DNA

Top view: A-DNA, B-DNA, Z-DNA

Space-filling models of A, B and Z- DNA

http://en.wikipedia.org/wiki/Image:A-B-Z-DNA_Side_View.png

http://en.wikipedia.org/wiki/Image:A-B-Z-DNA_Top_View.png

Major and minor grooves

Forces that stabilize nucleic acid double helix

There are two major forces that contribute to stability of helix formation: Hydrogen bonding in base-pairing Hydrophobic interactions in base stacking

5’

5’

3’

3’

Same strand stackingcross-strand stacking

Types of DNA double helix

Type Amajor conformation RNAminor conformation DNA

Right-handed helixShort and broad

Type Bmajor conformation DNA

Right-handed helixLong and thin

Type Zminor conformation DNA

Left-handed helixLonger and thinner

Secondary structures of Nucleic acids

DNA is primarily in duplex form

RNA is normally single stranded which can have a diverse form of secondary structures other than duplex.

Non B-DNA Secondary structures

Cruciform DNA

Triple helical DNA

Slipped DNA

Hoogsteen basepairs

Source: Van Dongen et al. (1999) , Nature Structural Biology 6, 854 - 859

More Secondary structures

RNA pseudoknots Cloverleaf rRNA structure

Source: Cornelis W. A. Pleij in Gesteland, R. F. and Atkins, J. F. (1993) THE RNA WORLD. Cold Spring Harbor Laboratory Press.

16S rRNA Secondary Structure Based onPhylogenetic Data

3D structures of RNA :transfer-RNA structures

Secondary structure of tRNA (cloverleaf)

Tertiary structure of tRNA

3D structures of RNA :ribosomal-RNA structures

Secondary structure of large rRNA (16S)

Tertiary structure of large rRNA subunit

Ban et al., Science 289 (905-920), 2000

3D structures of RNA :Catalytic RNA

Secondary structure of self-splicing RNA

Tertiary structure of self-splicing RNA

Some structural rules …

Base-pairing is stabilizing

Un-paired sections (loops) destabilize

3D conformation with interactions makes up for this

Three main principles

• DNA makes RNA makes Protein

• Structure more conserved than sequence

• Sequence Structure Function

How to go from DNA to protein sequence

A piece of double stranded DNA:

5’ attcgttggcaaatcgcccctatccggc 3’3’ taagcaaccgtttagcggggataggccg 5’

DNA direction is from 5’ to 3’

How to go from DNA to protein sequence6-frame conceptual translation using the codon table:

5’ attcgttggcaaatcgcccctatccggc 3’

3’ taagcaaccgtttagcggggataggccg 5’

So, there are six possibilities to make a protein from an unknown piece of DNA, only one of which might be a natural protein

Remark

• Identifying (annotating) human genes, i.e. finding what they are and what they do, is a difficult problem– First, the gene should be delineated on the genome

• Gene finding methods should be able to tell a gene region from a non-gene region

• Start, stop codons, further compositional differences– Then, a putative function should be found for the gene located

Dean, A. M. and G. B. Golding: Pacific Symposium on Bioinformatics 2000

Evolution and three-dimensional protein structure information

Isocitratedehydrogenase:

The distance fromthe active site(in yellow) determinesthe rate of evolution(red = fast evolution, blue = slow evolution)

Genomic Data Sources• DNA/protein sequence • Expression (microarray)• Proteome (xray, NMR,

mass spectrometry)• Metabolome• Physiome (spatial,

temporal)

Integrative bioinformatics

Dinner discussion: Integrative Bioinformatics & Genomics VUDinner discussion: Integrative Bioinformatics & Genomics VU

metabolomemetabolome

proteomeproteome

genomegenome

transcriptometranscriptome

physiomephysiome

Genomic Data SourcesVertical Genomics

DNA makes RNA makes Protein(reminder)

DNA makes RNA makes Protein:Expression data

• More copies of mRNA for a gene leads to more protein

• mRNA can now be measured for all the genes in a cell at ones through microarray technology

• Can have 60,000 spots (genes) on a single gene chip

• Colour change gives intensity of gene expression (over- or under-expression)

Proteomics

• Elucidating all 3D structures of proteins in the cell

• This is also called Structural Genomics• Finding out what these proteins do • This is also called Functional Genomics

Protein-protein interaction networks

Metabolic networks

Glycolysis and

Gluconeogenesis

Kegg database (Japan)

High-throughput Biological Data

• Enormous amounts of biological data are being generated by high-throughput capabilities; even more are coming– genomic sequences– arrayCGH (Comparative Genomic Hybridization)

data, gene expression data– mass spectrometry data– protein-protein interaction data– protein structures– ......

Protein structural data explosionProtein Data Bank (PDB): 14500 Structures (6 March 2001)10900 x-ray crystallography, 1810 NMR, 278 theoretical models, others...

Dickerson’s formula: equivalent to Moore’s law

On 27 March 2001 there were 12,123 3D protein structures in the PDB: Dickerson’s formula predicts 12,066 (within 0.5%)!

n = e0.19(y-1960) with y the year.

Sequence versus structural data• Structural genomics initiatives are now in full

swing and growth is still exponential.

• However, growth of sequence data is even more rapidly. There are now more than 500 completely sequenced genomes publicly available.

Increasing gap between structural and sequence data (“Mind the gap”)

BioinformaticsLarge - external(integrative) Science Human

Planetary Science Cultural Anthropology

Population Biology Sociology Sociobiology Psychology Systems Biology Biology Medicine

Molecular Biology Chemistry Physics

Small – internal (individual)

Bioinformatics

Bioinformatics• Offers an ever more essential input to

– Molecular Biology– Pharmacology (drug design)– Agriculture– Biotechnology– Clinical medicine– Anthropology– Forensic science– Chemical industries (detergent industries, etc.)

Date post:	25-Feb-2016
Category:	Documents
Upload:	connie
View:	49 times
Download:	4 times

Introduction to bioinformatics Lecture 2 Genes and Genomes

Documents