Introduction to Bioinformatics236523/234525
Lecturer: Prof. Yael Mandel-Gutfreund
Teaching Assistance: Shula Shazman
Idit kostiCourse web site :http://webcourse.cs.technion.ac.il/236523
2
What is Bioinformatics?
3
Course Objectives
• To introduce the bioinfomatics discipline • To make the students familiar with the major
biological questions which can be addressed by bioinformatics tools
• To introduce the major tools used for sequence and structure analysis and explain in general how they work (limitation etc..)
4
Course Structure and Requirements
1.Class Structure1. 2 hours Lecture 2. 1 hour tutorial
2. Home work• Homework assignments will be given every second week• The homework will be done in pairs.• 5/5 homework assignments will be submitted
2. A final project will be conducted and submitted in pairs
5
Grading
• 20 % Homework assignments• 80 % final project
6
Literature list• Gibas, C., Jambeck, P. Developing Bioinformatics
Computer Skills. O'Reilly, 2001. • Lesk, A. M. Introduction to Bioinformatics. Oxford
University Press, 2002.• Mount, D.W. Bioinformatics: Sequence and Genome
Analysis. 2nd ed.,Cold Spring Harbor Laboratory Press, 2004.
Advanced Reading
Jones N.C & Pevzner P.A. An introduction to Bioinformatics algorithms MIT Press, 2004
7
What is Bioinformatics?
8
“The field of science in which biology, computer science, and information technology merge to form a single discipline”
Ultimate goal: to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned.
What is Bioinformatics?
9
Central Paradigm in Molecular Biology
mRNAGene (DNA) Protein
21ST centaury
Genome Transcriptome Proteome
10
From DNA to Genome
Watson and Crick DNA model
First protein sequence1955
1960
1965
1970
1975
1980
1985
First protein structure
11
1995
1990
2000 First human genome draft
First genomeHemophilus Influenzae
Yeast genome
12
Total 1379 294
Eukaryotes 133 39
Bacteria 1152 235
Archaea 94 23
Complete Genomes
2010 2005
1,000 Genomes Project: Expanding the Map of Human Genetics
Researchers hope the effort will speed up the discovery of many diseases's genetic roots
13
14
Main Goal: To understand the living
cell
Annotation Comparativegenomics
Structuralgenomics
Functionalgenomics
25000 genomes… What’s Next ?The “post-genomics” The “post-genomics” eraera
From ….25000 genomes
To…Understanding living cells
16
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATGCGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAACTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTCAGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGAAGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAATAT GGA CAA TTG GTT TCT TCT CTG AAT .................... TGAAAAACGTA
Annotation
17
Annotation
Identify the genes within a given sequence of DNA
Identify the sitesWhich regulate the gene
Predict the function
18
How do we identify a genein a genome?
A gene is characterized by several features (promoter, ORF…)some are easier and some harder to detect…
19
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATGCGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAACTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTCAGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGAAGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAATAT GGA CAA TTG GTT TCT TCT CTG AAT .................................
.............. TGAAAAACGTA
TF binding sitepromoter
Ribosome binding SiteORF=Open Reading FrameCDS=Coding Sequence
Tran
script
ion
Star
t Site
20
Using Bioinformatics approaches for Gene hunting
Relative easy in simple organisms (e.g. bacteria)
VERY HARD for higher organism (e.g. humans)
21
Comparativegenomics
22
Comparison between the full drafts of the human and chimp genomesrevealed that they differ only by 1.23%
How humans are chimps?
Perhaps not surprising!!!
So where are we different ??
23
Human ATAGCGGGGGGATGCGGGCCCTATACCCChimp ATAGGGG - - GGATGCGGGCCCTATACCCMouse ATAGCG - - - GGATGCGGCGC -TATACCA
24
And where are we similar ???
VERY SIMAILARConserved between many organisms
VERYDIFFERENT
25
Functionalgenomics
26
TO BE IS NOT ENOUGH In any time point a gene can be functional or not
27
From the gene expression pattern we can lean:
What does the gene do ?When is it needed?What other genes or proteins interact with it?…..
What's wrong??
28
StructuralGenomics
29
The protein three dimensional structure can tell
much more than the sequence alone
Protein-ligand complexes
Functional sites
fold Evolutionaryrelationship
Shape and electrostatics
Active sites
protein complexes
Biologic processes
30
Resources and Databases
The different types of data are collected in database
– Sequence databases – Structural databases– Databases of Experimental Results
All databases are connected
31
Sequence databases
• Gene database• Genome database• Disease related mutation database• ………….
32
Genome Browsers
Easy “walk” through the genome
33
Genome Browsers
• UCSC Genome Browser http://genome.ucsc.edu/
• Ensembl Genome Browser (http://www.ensembl.org)
• WormBase: http://www.wormbase.org/
• AceDB: http://www.acedb.org/
• Comprehensive Microbial Resource: http://www.tigr.org/tigr-scripts/CMR2/CMRHomePage.spl
• FlyBase: http://flybase.bio.indiana.edu/
34
Mutation database
• Single base difference in a single position
among two different individuals of the same species
• Play an important role in differentiation and disease
35
Sickle Cell Anemia
• Due to 1 swapping an A for a T, causing inserted amino acid to be valine instead of glutamine in hemoglobin
Image source: http://www.cc.nih.gov/ccc/ccnews/nov99/
36
Healthy Individual>gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNAACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA
GGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACTGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC
>gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens]
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG
AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH
37
Diseased Individual>gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNAACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA
GGTGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACTGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC
>gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens]
MVHLTPVEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG
AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH
38
Structure Databases
• 3-dimensional structures of proteins, nucleic acids, molecular complexes etc
• 3-d data is available due to techniques such as NMR and X-Ray crystallography
39
40
Databases of Experimental Results
• Data such as experimental microarray images- gene expression data
• Proteomic data- protein expression data• Metabolic pathways, protein-protein
interaction data, regulatory networks
• ETC………….
41
PubMed
Service of the National Library of Medicine
http://www.ncbi.nlm.nih.gov/pubmed/
Literature Databases
42
Putting it all Together
• Each Database contains specific information
• Like other biological systems also these databases are interrelated
43
GENOMIC DATAGenBank
DDBJ
EMBL
ASSEMBLED GENOMES
GoldenPath
WormBase
TIGR
PROTEINPIR
SWISS-PROT
STRUCTUREPDB
MMDB
SCOP
LITERATUREPubMed
PATHWAYKEGG
COG
DISEASELocusLink
OMIM
OMIA
GENESRefSeq
AllGenes
GDBSNPsdbSNP
ESTsdbEST
unigene
MOTIFSBLOCKS
Pfam
Prosite
GENE EXPRESSION
Stanford MGDB
NetAffx
ArrayExpress