+ All Categories
Home > Documents > 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

Date post: 21-Dec-2015
Category:
View: 221 times
Download: 3 times
Share this document with a friend
Popular Tags:
51
1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida
Transcript
Page 1: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

1

CAP5510 – BioinformaticsFall 2009

Tamer Kahveci

CISE Department

University of Florida

Page 2: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

2

Vital Information

• Instructor: Tamer Kahveci

• Office: E436

• Time: Mon/Wed/Thu 3:00 - 3:50 PM

• Office hours: Mon/Wed 2:00-2:50 PM

• TA: TBA

• Course page: – http://www.cise.ufl.edu/~tamer/teaching/fall2009

Page 3: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

3

Goals

• Understand the major components of bioinformatics data and how computer technology is used to understand this data better.

• Learn main potential research problems in bioinformatics and gain background information.

Page 4: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

4

This Course will

• Give you a feeling for main issues in molecular biological computing: sequence, structure and function.

• Give you exposure to classic biological problems, as represented computationally.

• Encourage you to explore research problems and make contribution.

Page 5: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

5

This Course will not

• Teach you biology.

• Teach you programming

• Teach you how to be an expert user of off-the-shelf molecular biology computer packages.

• Force you to make a novel contribution to bioinformatics.

Page 6: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

6

Course Outline

• Introduction to terminology• Biological sequences • Sequence comparison

– Lossless alignment (DP)– Lossy alignments (BLAST, etc)

• Substitution matrices, statistics • Multiple alignment • Phylogeny • Protein structures and function (primary, secondary, etc.) • Structure alignment • Structure prediction ?• Pathways

Page 7: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

7

Grading

• Homeworks (35 %) • Project (50 %)

– Contribution (2.5 % bonus)

• Survey (15 %)

How can I get an A ?

Bioinformatics DailyFirst homework is posted

First homework is posted

Page 8: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

8

Expectations

• Require– Data structures and algorithms.– Coding (C, Java)

• Encourage – actively participate in discussions in the classroom– read bioinformatics literature in general– attend colloquiums on campus

• Academic honesty

Page 9: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

9

Text Book

• Not required, but recommended.• Class notes + papers.

Page 10: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

10

Where to Look ?

• Journals– Bioinformatics– Genome Research– Nucleic Acid Research– Journal of Computational Biology– Protein Science

• Conferences– RECOMB– ISMB– PSB– CSB– VLDB, ICDE, SIGMOD

Page 11: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

11

What is Bioinformatics?• Bioinformatics is the field of science in which biology, computer

science, and information technology merge into a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. There are three important sub-disciplines within bioinformatics:– the development of new algorithms and statistics with which to assess

relationships among members of large data sets – the analysis and interpretation of various types of data including

nucleotide and amino acid sequences, protein domains, and protein structures

– the development and implementation of tools that enable efficient access and management of different types of information.

From NCBI (National Center for Biotechnology Information)http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/milestones.html

Page 12: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

12

Does biology have anything to do with computer science?

Page 13: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

13

Challenges 1/6

• Data diversity– DNA

(ATCCAGAGCAG)– Protein sequences

(MHPKVDALLSR)– Protein structures– Microarrays– Pathways– Bio-images– Time series

Page 14: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

14

Challenges 2/6

• Database diversity– GenBank, SwissProt, …– PDB, Prosite, …– KEGG, EcoCyc, MetaCyc, …

Page 15: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

15

Challenges 3/6• Database size

– GeneBank : As of August 2009, there are over 85,759,586,764 bases.

– 400 K protein sequences, each about 300 long

– 50K protein structures in PDB. 400K in Modbase.

Genome sequence now accumulate so quickly that, in less than a week, a single laboratory can produce more bits of data than

Shakespeare managed in a lifetime, although the latter make better reading.

-- G A Pekso, Nature 401: 115-116 (1999)

Page 16: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

16

• Moore’s Law Matched by Growth of Data• CPU vs Disk

– As important as the increase in computer speed has been, the ability to store large amounts of information on computers is even more crucial

Str

uct

ure

s in

PD

B

0500

10001500200025003000350040004500

1980 1985 1990 19950

20

40

60

80

100

120

1401979 1981 1983 1985 1987 1989 1991 1993 1995

CP

U In

stru

ctio

nT

ime

(ns)Num.

Protein DomainStructures

Challenges 4/6

Page 17: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

17

Challenges 5/6

• Deciphering the code– Within same data type: hard– Across data types: harder

caacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtggcgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgcttgctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgggttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactacaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaaccaatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtcggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaaaaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg

atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgcagcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatacatggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtgaaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatccagcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattcttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaactggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgcaggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgtgttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact

Page 18: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

18

Challenges 6/6

• Inaccuracy

• Redundancy

Page 19: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

19

What is the Real Solution?

We need better computational methods

•Compact summarization•Fast and accurate analysis of data•Efficient indexing

Page 20: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

20

A Gentle Introduction to Molecular Biology

Page 21: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

21

Goals

• Understand major components of biological data– DNA, protein sequences, expression arrays,

protein structures

• Get familiar to basic terminology

• Learn commonly used data formats

Page 22: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

22

Genetic Material: DNA

• Deoxyribonucleic Acid, 1950s– Basis of inheritance– Eye color, hair color,

• 4 nucleotides – A, C, G, T

Page 23: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

23

Chemical Structure of Nucleotides

Purines

Pyrmidines

Page 24: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

24

Making of Long Chains

5’ -> 3’

Page 25: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

25

DNA structure

• Double stranded, helix (Watson & Crick)

• Complementary– A-T– G-C

• Antiparallel– 3’ -> 5’ (downstream)– 5’ -> 3’ (upstream)

• Animation (ch3.1)

Page 26: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

26

Base Pairs

Page 27: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

27

Question

• 5’ - GTTACA – 3’

• 5’ – XXXXXX – 3’ ?

• 5’ – TGTAAC – 3’

• Reverse complements.

Page 28: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

28

Repetitive DNA

• Tandem repeats: highly repetitive – Satellites (100 k – 1 Gbp) / (a few hundred bp)– Mini satellites (1 k – 20 kbp) / (9 – 80 bp)– Micro satellites (< 150 bp) / (1 – 6 bp)– DNA fingerprinting

• Interspersed repeats: moderately repetitive– LINE– SINE

• Proteins contain repetitive patterns too

Page 29: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

29

Genetic Material: an Analogy

• Nucleotide => letter• Gene => sentence• Contig => chapter• Chromosome => book

– Gender, hair/eye color, …– Disorders: down syndrome, turner syndrome, …

• http://gslc.genetics.utah.edu/units/disorders/karyotype/– Chromosome number varies for species

• http://www.web-books.com/MoBio/Free/Ch1C2.htm– We have 46 (23 + 23) chromosomes

• http://www.web-books.com/MoBio/Free/Ch1C5.htm

• Complete genome => volumes of encyclopedia• Hershey & Chase experiment show that DNA is the

genetic material. (ch14)

Page 31: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

31

Functions of Genes 2/2

• Movement: contracting in order to pull things together or push things apart.

• Transcription control: deciding when other genes should be turned ON/OFF– Animation (ch7)

• Trafficking: affecting where different elements end up inside the cell

Page 32: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

32

Central Dogma

Page 33: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

33

Introns and Exons 1/2

Page 34: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

34

Introns and Exons 2/2

• Humans have about 35,000 genes = 40,000,000 DNA bases = 3% of total DNA in genome.

• Remaining 2,960,000,000 bases for control information. (e.g. when, where, how long, etc...)

Page 35: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

35

Central dogma

ProteinPhenotype

DNA(Genotype)

Gene expression

Page 36: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

36

Gene Expression

• Building proteins from DNA– Promoter sequence: start of a gene 13 nucleotides.

• Positive regulation: proteins that bind to DNA near promoter sequences increases transcription.

• Negative regulation

Page 37: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

37

Microarray

Animation on creating microarrays

Page 38: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

38

Amino Acids

• 20 different amino acids– ACDEFGHIKLMNPQRSTVWY but not BJOUXZ

• ~300 amino acids in an average protein, ~400 K known protein sequences

• How many nucleotides can encode one amino acid ?– 42 < 20 < 43

– E.g., Q (glutamine) = CAG– degeneracy– Triplet code (codon)

Page 39: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

39

Triplet Code

Page 40: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

40

Molecular Structure of Amino Acid

Side Chain

•Non-polar, Hydrophobic (G, A, V, L, I, M, F, W, P)•Polar, Hydrophilic (S, T, C, Y, N, Q)•Electrically charged (D, E, K, R, H)

C

Page 41: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

41

Peptide Bonds

Page 42: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

42

Direction of Protein Sequence

Animation on protein synthesis (ch15)

Page 43: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

43

Data Format

• GenBank

• EMBL (European Mol. Biol. Lab.)

• SwissProt

• FASTA

• NBRF (Nat. Biomedical Res. Foundation)

• Others– IG, GCG, Codata, ASN, GDE, Plain ASCII

Page 44: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

44

Primary Structure of Proteins

phi1

psi1

phi2

2N angles

Page 45: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

45

Secondary Structure: Alpha Helix

• 1.5 A translation• 100 degree rotation• Phi = -60• Psi = -60

Page 46: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

46

anti-parallel parallel

Secondary Structure: Beta sheet

Phi = -135Psi = 135

Page 47: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

47

Ramachandran Plot

Sample pdb entry ( http://www.rcsb.org/pdb/ )

Page 48: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

48

• 3-d structure of a polypeptide sequence– interactions between non-local atoms

tertiary structure ofmyoglobin

Tertiary Structure

Page 49: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

49

• Arrangement of protein subunits

quaternary structure of Cro

human hemoglobin tetramer

Quaternary Structure

Page 50: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

50

• 3-d structure determined by protein sequence

• Prediction remains a challenge

• Diseases caused by misfolded proteins– Mad cow disease

• Classification of protein structure

Structure Summary

Page 51: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.

51

STOP

Next Week:•Basic sequence comparison•Dynamic programming methods

–Global/local alignment–Gaps


Recommended