+ All Categories
Home > Documents > CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Date post: 03-Jan-2016
Category:
Upload: matthew-mcdonald
View: 231 times
Download: 4 times
Share this document with a friend
Popular Tags:
96
CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology
Transcript
Page 1: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

CS 5263 Bioinformatics

Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology

Page 2: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Outline

• Administravia

• What is bioinformatics

• Why bioinformatics

• Course overview

• Short introduction to molecular biology

Page 3: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Survey form

• Your name

• Email

• Academic preparation

• Interests

• help me better design lectures and assignments

Page 4: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Course Info

• Instructor: Jianhua RuanOffice: S.B. 4.01.48Phone: 458-6819Email: [email protected] hours: MW 2-3pm

• Web: http://www.cs.utsa.edu/~jruan/teaching/cs5263_fall_2008/

Page 5: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Course description

• A survey of algorithms and methods in bioinformatics, approached from a computational viewpoint.

• Prerequisite:– Programming experiences– Some knowledge in algorithms and data structures – Basic understanding of statistics and probability– Appetite to learn some biology

Page 6: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Textbooks

• An Introduction to Bioinformatics Algorithms

by Jones and Pevzner

• Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids

by Durbin, Eddy, Krogh and Mitchison

• Additional resources – Papers– Handouts– See course website

Page 7: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Grading

• Attendance: 10%– At most 2 classes missed without affecting grade

• Homeworks: 50%– About 5 assignments– Combination of theoretical and programming

exercises– No exams– No late submission accepted– Read the collaboration policy!

• Final project and presentation: 40%

Page 8: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Why bioinformatics

• The advance of experimental technology has generated huge amount of data– The human genome is “finished”– Even if it were, that’s only the beginning…

• The bottleneck is how to integrate and analyze the data– Noisy– Diverse

Page 9: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Growth of GenBank vs Moore’s law

Page 10: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Genome annotations

Meyer, Trends and Tools in Bioinfo and Compt Bio, 2006

Page 11: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

What is bioinformatics

• National Institutes of Health (NIH):– Research, development, or application of

computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.

Page 12: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

What is bioinformatics

• National Center for Biotechnology Information (NCBI):– the field of science in which biology, computer

science, and information technology merge to form a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned.

Page 13: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

What is bioinformatics

• Wikipedia – Bioinformatics refers to the creation and

advancement of algorithms, computational and statistical techniques, and theory to solve formal and practical problems posed by or inspired from the management and analysis of biological data.

Page 14: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Chemistry

MathematicsStatistics

Computer ScienceInformatics

Physics

Medicine

BiologyMolecular Biology

Bioinformatics

Page 15: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Course objectives

• Learn the basis of sequence analysis and other computational biology algorithms

• Familiarize with the research topics in bioinformatics

• Be able to – Read / criticize bioinformatics research articles– Identify subareas that best suit your background– Communicate and exchange ideas with

(computational) biologists

Page 16: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

What you will learn?

• Basic concepts in molecular biology and genetics

• Algorithms to address selected problems in bioinformatics– Dynamic programming, string algorithms, graph

algorithms– Statistical learning algorithms: HMM, EM, Gibbs

sampling– Data mining: clustering / classification

• Applications to real data

Page 17: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

What you will not learn?

• Designing / performing biological experiments (duh!)

• Programming (in perl, etc).

• Building bioinformatics software tools (GUI, database, Web, …)

• Using existing tools / databases (well, not exactly true)

Page 18: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Covered topics

• Biology• Sequence analysis

– Sequence alignment• Pairwise, multiple, global, local, optimal, heuristic

– String matching– Motif finding

• Gene prediction• RNA structure prediction• Phylogenetic tree• Functional Genomics

– Microarray data analysis– Biological networks

8 weeks

5 weeks

1 week

Page 19: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Computer Scientists vs Biologists

(courtesy Serafim Batzoglou, Stanford)

Page 20: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Biologists vs computer scientists

• (almost) Everything is true or false in computer science

• (almost) Nothing is ever true or false in Biology

Page 21: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Biologists vs computer scientists

• Biologists seek to understand the complicated, messy natural world

• Computer scientists strive to build their own clean and organized virtual world

Page 22: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Biologists vs computer scientists

• Computer scientists are obsessed with being the first to invent or prove something

• Biologists are obsessed with being the first to discover something

Page 23: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Some examples of central role of CS in bioinformatics

Page 24: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

1. Genome sequencing

AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

3x109 nucleotides

~500 nucleotides

Page 25: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

3x109 nucleotides

Computational Fragment AssemblyIntroduced ~19801995: assemble up to 1,000,000 long DNA pieces2000: assemble whole human genome

A big puzzle~60 million pieces

1. Genome sequencing

Page 26: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Where are the genes?Where are the genes?

2. Gene Finding

In humans:

~22,000 genes~1.5% of human DNA

Page 27: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Start codonATG

5’ 3’Exon 1 Exon 2 Exon 3Intron 1 Intron 2

Stop codonTAG/TGA/TAA

Splice sites

2. Gene Finding

Hidden Markov Models

(Well studied for many years in speech recognition)

Page 28: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

3. Protein Folding• The amino-acid sequence of a protein determines the 3D fold

• The 3D fold of a protein determines its function

• Can we predict 3D fold of a protein given its amino-acid sequence?– Holy grail of compbio—40 years old problem

– Molecular dynamics, computational geometry, machine learning

Page 29: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

4. Sequence Comparison—Alignment

AGGCTATCACCTGACCTCCAGGCCGATGCCC

TAGCTATCACGACCGCGGTCGATTTGCCCGAC

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- | | | | | | | | | | | | | x | | | | | | | | | | |

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Sequence AlignmentIntroduced ~1970BLAST: 1990, most cited paper in historyStill very active area of research

query

DB

BLAST

Efficient string matching algorithms

Fast database index techniques

Page 30: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

…, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10 minutes on a microcomputer (IBM PC).

Lipman & Pearson, 1985

Database size today: 1012

(increased by 2 million folds).

BLAST search: 1.5 minutes

…, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10 minutes on a microcomputer (IBM PC).

Page 31: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

5. Microarray analysisClinical prediction of Leukemia type

• 2 types– Acute lymphoid (ALL)

– Acute myeloid (AML)

• Different treatments & outcomes• Predict type before treatment?

Bone marrow samples: ALL vs AML

Measure amount of each gene

Page 32: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Some goals of biology for the next 50 years

• List all molecular parts that build an organism– Genes, proteins, other functional parts

• Understand the function of each part• Understand how parts interact physically and functionally• Study how function has evolved across all species• Find genetic defects that cause diseases• Design drugs rationally• Sequence the genome of every human, use it for personalized

medicine

• Bioinformatics is an essential component for all the goals above

Page 33: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

A short introduction to molecular biology

Page 34: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Life

• Two categories:– Prokaryotes (e.g. bacteria)

• Unicellular• No nucleus

– Eukaryotes (e.g. fungi, plant, animal)• Unicellular or multicellular• Has nucleus

Page 35: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Prokaryote vs Eukaryote

• Eukaryote has many membrane-bounded compartment inside the cell– Different biological processes occur at different

cellular location

Page 36: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Organism, Organ, CellOrganism

Organ

Page 37: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Chemical contents of cell

• Water• Macromolecules (polymers) - “strings” made by linking

monomers from a specified set (alphabet)–Protein–DNA–RNA–…

• Small molecules–Sugar–Ions (Na+, Ka+, Ca2+, Cl- ,…)–Hormone–…

Page 38: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

DNA

• DNA: forms the genetic material of all living organisms– Can be replicated and passed to descendents– Contains information to produce proteins

• To computer scientists, DNA is a string made from alphabet {A, C, G, T}– e.g. ACAGAACGTAGTGCCGTGAGCG

• Each letter is a nucleotide• Length varies from hundreds to billions

Page 39: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

RNA

• Historically thought to be information carrier only– DNA => RNA => Protein– New roles have been found for them

• To computer scientists, RNA is a string made from alphabet {A, C, G, U}– e.g. ACAGAACGUAGUGCCGUGAGCG

• Each letter is a nucleotide• Length varies from tens to thousands

Page 40: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Protein

• Protein: the actual “worker” for almost all processes in the cell– Enzymes: speed up reactions– Signaling: information transduction– Structural support– Production of other macromolecules– Transport

• To computer scientists, protein is a string made from 20 kinds of characters– E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGP

• Each letter is called an amino acid• Length varies from tens to thousands

Page 41: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

DNA/RNA zoom-in

• Commonly referred to as Nucleic Acid• DNA: Deoxyribonucleic acid• RNA: Ribonucleic acid• Found mainly in the nucleus of a cell (hence

“nucleic”)• Contain phosphoric acid as a component (hence

“acid”)• They are made up of a string of nucleotides

Page 42: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Nucleotides• A nucleotide has 3 components

– Sugar ring (ribose in RNA, deoxyribose in DNA)

– Phosphoric acid– Nitrogen base

• Adenine (A)• Guanine (G)• Cytosine (C)• Thymine (T) or Uracil (U)

Page 43: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Monomers of RNA: ribo-nucleotide

• A ribonucleotide has 3 components– Sugar - Ribose– Phosphate group– Nitrogen base

• Adenine (A)• Guanine (G)• Cytosine (C)• Uracil (U)

Page 44: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Monomers of DNA: deoxy-ribo-nucleotide

• A deoxyribonucleotide has 3 components– Sugar – Deoxy-ribose– Phosphate group– Nitrogen base

• Adenine (A)• Guanine (G)• Cytosine (C)• Thymine (T)

Page 45: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Polymerization: Nucleotides => nucleic acids

Phosphate

Sugar

Nitrogen Base

Phosphate

Sugar

Nitrogen Base

Phosphate

Sugar

Nitrogen Base

Page 46: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

G

A

G

T

C

A

G

C

5’-AGCGACTG-3’

AGCGACTG

Phosphate

Sugar

Base

1

23

4

5

Often recorded from 5’ to 3’, which is the direction of many biological processes.e.g. DNA replication, transcription, etc.

5’

3’

DNA

Free phosphate 5 prime 3 prime

Page 47: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

G

A

G

U

C

A

G

U

5’-AGUGACUG-3’

AGUGACUG

Often recorded from 5’ to 3’, which is the direction of many biological processes.e.g. translation.

5’

3’

RNA

Free phosphate 5 prime 3 prime

Page 48: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

T

C

A

C

T

G

G

C

G

A

G

T

C

A

G

C

Base-pair:

A = T

G = C

5’

5’3’

3’

5’-AGCGACTG-3’3’-TCGCTGAC-5’

AGCGACTGTCGCTGAC

Forward (+) strand

Backward (-) strand

One strand is said to be reverse- complementary to the other

DNA usually exists in pairs.

Page 49: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

DNA double helix

G-C pair is stronger than A-T pair

Page 50: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Reverse-complementary sequences

• 5’-ACGTTACAGTA-3’

• The reverse complement is:

3’-TGCAATGTCAT-5’

=>

5’-TACTGTAACGT-3’

• Or simply written as

TACTGTAACGT

Page 51: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Orientation of the double helix

• Double helix is anti-parallel–5’ end of each strand pairs with 3’ end of the other–5’ to 3’ motion in one strand is 3’ to 5’ in the other

• Double helix has no orientation–Biology has no “forward” and “reverse” strand–Relative to any single strand, there is a “reverse complement” or “reverse strand”–Information can be encoded by either strand or both strands

5’TTTTACAGGACCATG 3’3’AAAATGTCCTGGTAC 5’

Page 52: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

RNA

• RNAs are normally single-stranded

• Form complex structure by self-base-pairing

• A=U, C=G

• Can also form RNA-DNA and RNA-RNA double strands.– A=T/U, C=G

Page 53: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Carboxyl groupAmino group

Protein zoom-in

Side chain

Generic chemical form of amino acid

• Protein is the actual “worker” for almost all processes in the cell

• A string built from 20 letters– E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKH

• Each letter is called an amino acid

R | H2N--C--COOH | H

Page 54: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

• 20 amino acids, only differ at side chains– Each can be expressed by three letters– Or a single letter: A-Y, except B, J, O, U, X, Z

– Alanine = Ala = A

– Histidine = His = H

Amino acid

Page 55: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

R R | | H2N--C--CO--NH--C--COOH | | H H

R R | | H2N--C--COOH H2N--C--COOH | | H H

Amino acids => peptide

Peptide bond

Page 56: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Protein

• Has orientations• Usually recorded from N-terminal to C-terminal• Peptide vs protein: basically the same thing• Conventions

– Peptide is shorter (< 50aa), while protein is longer– Peptide refers to the sequence, while protein has 2D/3D structure

R

H2N

R R R R R

COOH

N-terminal C-terminal

Page 57: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Protein structure• Linear sequence of amino acids folds to

form a complex 3-D structure.

• The structure of a protein is intimately connected to its function.

Page 58: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Genome and chromosome

• Genome: the complete DNA sequences in the cell of an organism – May contain one (in most prokaryotes) or

more (in eukaryotes) chromosomes

• Chromosome: a single large DNA molecule in the cell– May be circular or linear– Contain genes as well as “junk DNAs”– Highly packed!

Page 59: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Formation of chromosome

Page 60: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Formation of chromosome

50,000 times shorter than extended DNA

The total length of DNA present in one adult human is the equivalent of nearly 70 round trips from the earth to the sun

Page 61: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Gene

• Gene: unit of heredity in living organisms – A segment of DNA with information to make a

protein

Page 62: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Some statistics

Chromosomes Bases Genes

Human 46 3 billion 20k-25k

Dog 78 2.4 billion ~20k

Corn 20 2.5 billion 50-60k

Yeast 16 20 million ~7k

E. coli 1 4 million ~4k

Marbled lungfish

? 130 billion ?

Page 63: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Human genome

• 46 chromosomes: 22 pairs + X + Y

• 1 from mother, 1 from father

• Female: X + X

• Male: X + Y

Page 64: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Human genome

• Every cell contains the same genomic information– Except sperms and eggs, which only contain

half of the genome• Otherwise your children would have 46 + 46

chromosomes

Page 65: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Cell division: mitosis• A cell duplicates its

genome and divides into two identical cells

• These cells build up different parts of your body

Page 66: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Cell division: meiosis• A reproductive cell

divides into four cells, each containing only half of the genomes– Diploid => haploid

• Two haploid cells (sperm + egg) forms a zygote– Which will then develop

into a multi-cellular organism by mitosis

Page 67: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Central dogma of molecular biology

DNA replication is critical in both mitosis and meiosis

Page 68: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

DNA Replication

• The process of copying a double-stranded DNA molecule– Semi-conservative

5’-ACATGATAA-3’

3’-TGTACTATT-5’

5’-ACATGATAA-3’ 5’-ACATGATAA-3’

3’-TGTACTATT-5’ 3’-TGTACTATT-5’

Page 69: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

• Mutation: changes in DNA base-pairs• Proofreading and error-correcting mechanisms

exist to ensure extremely high fidelity

Page 70: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Central dogma of molecular biology

Page 71: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Transcription

• The process that a DNA sequence is copied to produce a complementary RNA– Called message RNA (mRNA) if the RNA

carries instruction on how to make a protein – Called non-coding RNA if the RNA does not

carry instruction on how to make a protein– Only consider mRNA for now

• Similar to replication, but– Only one strand is copied

Page 72: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Transcription(where genetic information is stored)

(for making mRNA)

Coding strand: 5’-ACGTAGACGTATAGAGCCTAG-3’

Template strand: 3’-TGCATCTGCATATCTCGGATC-5’

mRNA: 5’-ACGUAGACGUAUAGAGCCUAG-3’

Coding strand and mRNA have the same sequence, except that T’s in DNA are replaced by U’s in mRNA.

DNA-RNA pair:

A=U, C=G

T=A, G=C

Page 73: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Translation

• The process of making proteins from mRNA• A gene uniquely encodes a protein• There are four bases in DNA (A, C, G, T), and four in

RNA (A, C, G, U), but 20 amino acids in protein• How many nucleotides are required to encode an amino

acid in order to ensure correct translation?– 4^1 = 4– 4^2 = 16– 4^3 = 64

• The actual genetic code used by the cell is a triplet.– Each triplet is called a codon

Page 74: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

The Genetic CodeThirdletter

Page 75: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Translation

• The sequence of codons is translated to a sequence of amino acids

• Gene: -GCT TGT TTA CGA ATT-• mRNA: -GCU UGU UUA CGA AUU -• Peptide: - Ala - Cys - Leu - Arg - Ile –

• Start codon: AUG– Also code Met– Stop codon: UGA, UAA, UAG

Page 76: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Translation• Transfer RNA (tRNA) – a different type of RNA.

– Freely float in the cell.– Every amino acid has its own type of tRNA that binds

to it alone.

• Anti-codon – codon binding crucial.

mRNA

tRNA-Leu

Nascent peptide

tRNA-Pro

Anti-codon

Page 77: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Transcriptional regulation

genepromoter

Transcription starting site

RNA PolymeraseTranscription factor

• Will talk more in later lectures • RNA polymerase binds to certain location on promoter to initiate

transcription• Transcription factor binds to specific sequences on the promoter to regulate

the transcription– Recruit RNA polymerase: induce– Block RNA polymerase: repress– Multiple transcription factors may coordinate

Page 78: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Splicing

genepromoter

Transcription starting site

Pre-mRNAtranscription

• Pre-mRNA needs to be “edited” to form mature mRNA• Will talk more in later lectures.

5’ UTR 3’ UTRexon exon exon

intron intron

Start codon Stop codon

Open reading frame (ORF)

Pre-mRNA

Mature mRNA(mRNA)

Splicing

Page 79: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Summary• DNA: a string made from {A, C, G, T}

– Forms the basis of genes– Has 5’ and 3’– Normally forms double-strand by reverse complement

• RNA: a string made from {A, C, G, U}– mRNA: messenger RNA– tRNA: transfer RNA– Other types of RNA: rRNA, miRNA, etc.– Has 5’ and 3’– Normally single-stranded. But can form secondary structure

• Protein: made from 20 kinds of amino acids– Actual worker in the cell– Has N-terminal and C-terminal– Sequence uniquely determined by its gene via the use of codons– Sequence determines structure, structure determines function

• Central dogma: DNA transcribes to RNA, RNA translates to Protein– Both steps are regulated

Page 80: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Experimental techniques to manipulate DNA

Page 81: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

DNA synthesis

• Creating DNA synthetically in a laboratory• Chemical synthesis

– Chemical reactions– Arbitrary sequences– Maximum length 160-200

• Cloning: make copies based on a DNA template– Biological reactions– Requires template– Many copies of a long DNA in a short time

Page 82: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

in vivo DNA Cloning

• Connect a piece of DNA to bacterial DNA, which can then be replicated together with the host DNA

bacterial DNA

Page 83: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

in vitro DNA Cloning

• Polymerase chain reaction (PCR)

denature

5’

5’5’

5’ 5’5’

5’

Primer (< 30 bases)

5’ 5’

dNTP

5’5’

5’

DNA Polymerase

Page 84: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Some terms

• Denature: a DNA double-strand is separated into two strands– By raising temperature

• Renature: the process that two denatured DNA strands re-forms a double-strand– By cooling down slowly

• Hybridization: two heterogeneous DNAs form a double-stranded DNA– may have mismatches– The rationale behind many molecular biological

techniques including DNA microarray

Page 85: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

DNA sequencing technology

• Read out the letters from a DNA sequence

1974, Frederick Sanger

GTGAGGCGCTGC

Page 86: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

DNA sequencing: Basic idea

• PCR primer extension

5’-TTACAGGTCCATACTA 3’-AATGTCCAGGTATGATACATAGG-5’

• We need to supply A, C, G, T for the synthesis to continue

• Besides A, C, G, T, we add some A*, C*, G*, and T*– Very similar to ACGT in all aspects, except that– The extension will stop if used

Page 87: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

DNA sequencing, cont

Page 88: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

DNA sequencing, cont

Page 89: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.
Page 90: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Advances in DNA sequencing

• 1969: three years to sequence 115nt DNA

• 1979: three years to sequence ~1650nt

• 1989: one week to sequence ~1650nt

• 1995: Haemophilus genome sequenced at TIGR - 1,830,138nt

• 2000: Human Genome - working draft sequence, 3 billion bases

• 2003: (near) completion of human genome

Page 91: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

The bioinformatics landmark

• Completion of human genome sequencing is a success embraced by – Advancement in sequencing technology– Speed of computation– Algorithm development in bioinformatics

• HGP (Human Genome Project) strategy – Hierarchical sequencing– Estimated 15 years (1990 – 2005), completed in 13 years– $3 billion

• Celera strategy– Whole-genome shotgun sequencing– Three years (1998-2001)– $300 million

Page 92: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Now

• Over 300 genomes have been sequenced

• ~1011 - 1012 nt

Page 93: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

2007

• Genomes of three individual human were sequenced– James Watson– Craig Venter– TBN Chinese

• Cost for sequencing Watson’s genome– $3 million, 2 months– Compared to $3 billion, 13 years for HGP

Page 94: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

• Sequencing speed has been tremendously improved

• High efficiency and relatively low cost makes it possible to sequence the genome of any individual from any species

What’s next?

Page 95: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Continue to sequence more species?

More individuals?

What to do with those sequences?

Page 96: CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Coming next: biological sequence analysis


Recommended