BIOINF 4399B Computational Proteomics and
Metabolomics
Oliver Kohlbacher & Sven Nahnsen
WS 11/12 1. Introduction and Overview
Overview
• Administrative stuff (credits, requirements)
• Motivation/ quick review of relevant contents of
Bioinformatics 2
• Overview of the contents of this lecture
• Proteomics and Metabolomics
• Computational mass spectrometry
• http://abi.inf.uni-tuebingen.de/Teaching/ws-2011-12/
Course Requirements
To pass this course you must:
• regularly and actively participate in the weekly problem sessions,
• pass the final exam and assignments
• You have to work on assignments alone (no groups!)
Course Credits & Grading
• Credits
• MSc Bioinfo: 4 LP, module “Wahlpflichtbereich Bioinformatik”
• MSc Info: 4 LP, area “Wahlpflichtbereich Informatik”
• Diplom: 2+2 SWS Prakt. Informatik (passing required)
• Grade
• 40% assignments
• 60% finals
• Finals: oral exam (30 minutes) covering the contents of the whole lecture and the assignments
• Finals will be scheduled at the end of the semester (in the week of February 7)
Assignments
• One assignment every week
• One week for each assignment, hand in: via e-mail
before the problem sessions on Tuesdays
• Work alone on assignments!
• Assignments will comprise theoretical tasks, as well as
programming tasks
• Schedule 1st assignment:
online: Oct. 17; printed: Oct. 18; due: Oct. 25
Recommended Software
OpenMS/ TOPP
A software library for mass spectrometry
www.openms.de
Contact
• Questions concerning the lecture/assignments
• Website
abi.inf.uni-tuebingen.de/Teaching/SS12/CPM
• Timo Sachsenberg (Sand 14, C322)
• Mathias Walzer (Sand 14, C 304)
• Sven Nahnsen (Sand 14, C322, please send e-mail first)
The central dogma of molecular biology
Origin of the “Central Dogma of Molecular Biology” (Francis Crick, 1956)
• First articulation by Francis Crick in 1956
• Published in Nature in 1970
The central dogma – classical view
• In general, the classic view reflects how biology is (biological data are) organized
• Genomics enabled a more complex view • Barry, P. 2007. Genome 2.0: Mountains of new data are challenging
old views. Science News 172(10):154 (week of Sept. 8).
• The RNA revolution: Biology's Big Bang. The Economist, Jun 14th 2007
• Gerstein et al., 2007. What is a gene, post-ENCODE? History and updated definition. Genome Research 17(6):669-81
• RNA editing can lead to protein sequences that are very different from the initial DNA (Li, M. et al. Science doi:10.1126/science.1207018 (2011))
• …
National Human Genome Research Institute (NHGRI)
Reminder (Bioinformatics 2)
Systems Biology
Oltvai-Barabasi, Science, 2002
Reminder (Bioinformatics 2)
Systems Biology
• Quantitative data on various levels of biological complexity build fundaments of systems biology
• Mathematical modeling has been based on gene expression
• Recent important technological improvements allow the analysis of protein and metabolite profiles to a great depth
• Important layers for understanding biology
• New experimental techniques offer tremendous challenges for computational analysis
Aims of systems biology
• Describe large-scale organization
• Quantitative modeling
• Describe cell as system of networks
• Fundamental research: time-resolved quantitative understanding of living systems
• Medicine: enable personalized medicine (e.g., improve treatment strategies for cancer patients)
• Biotechnology: improve production, degradation, construction of synthetic organisms, etc.
Exp. Methods – Transcriptomics
• Extract and amplify RNA
• Hybridization on microarray
• Identify and quantify by fluorescence signal
• Sequences can be mapped back to genome
Lindsay, Nature Rev. Drug Discovery, 2003, 2, 803
Microarray Data Analysis
• Key problems in microarray data analysis are • Data normalization
• Clustering
• Dimension reduction
• Diagnostics/classification
• Network inference
• Visualization of results
Janko Dietzsch , Nils Gehlenborg and Kay Nieselt. Mayday-a microarray
data analysis workbench. Bioinformatics 2006 22(8):1010-1012
Genome sequencing February 15, 2001 February 16, 2001
Genome sequencing
• 2001: initial publication
• 2003: 2nd draft “Human Genome”
• > 13 years of work and > 3*109 $
• 2010: 8 days 1*104 $
• Future: within 3 years Biotech company (Pacific Biosciences) expects similar amount of data in < 15 min for < 1*103 $
Status genomics/transcriptomics
• Dramatic drop in cost for genome sequencing
• Number of sequenced genomes grows continuously
• Genome is a very static snapshot of living system
• Biological adaption is rather slow; long-term information storage
• Proteins and their reaction products, metabolites are much closer to reality
• Genome and transcriptome databases are essential bases for proteomics and metabolomics research
Genomics vs. Proteomics
Genomics Proteomics Genomes rather static
~ 20 k genes
established technology
(capillary sequencer)
Proteomes are dynamic
(age, tissue, breakfast,
…)
up to 1000 k proteins
emerging technologies
(MS, HPLC/MS, protein chips)
Main fields of proteomics
protein expression
protein characterization
(identification + PTMs) protein interaction
protein localization
?
0.0
0.5
1.0
Applications of proteomics
?
• Drug target identification • Determine content of a
protein mixture
• Understanding regulation
of protein activity
• Gene annotation
• Therapeutic markers
• Drug target identification
• Functional annotation
(compartment and function)
• Drug target identification
protein expression
protein characterization
(identification + PTMs) protein interaction
protein localization
Exp. Methods – Proteomics
• Compare two proteomes (e.g. healthy/diseased)
• Separate using 2D-PAGE (w.r.t. molecular mass, pI)
• Excise protein spots from the gel
• Tryptic digest of the proteins
• Identify proteins using mass spectrometry and Database search
Lindsay, Nature Rev. Drug Discovery, 2003, 2, 803
Separation 1
separate peptides
by their retention
time on column
Ionization
electrospray,
transfers charge
to the peptides
Separation 2
MS separates by
mass-to-charge
ratio (m/z)
HPLC ESI TOF
HPLC-MS
RT
I Spectrum (scan)
Mass Spectrometry
mass
spectrometry
measure a peptide‘s
mass-to-charge ratio
m/z
Inte
nsi
ty
Peak area proportional to
peptide concentration
Proteomics: Database Search
• Identification of mass spectra is easily done through database search
• Search all peptides of matching mass from a database
• Construct a theoretical mass spectrum for these peptide candidates
• Score against the experimental spectrum
Sequence DB
? ? ?
Exp. Metabolomics
• Extract all metabolites/ small molecules, usually < 800 Da
• Separate homogenous collection of analytes (lipids, di- or tripeptides, phospholipids, sugars, etc.)
• Identify and quantify the analytes
Exptl. Metabolomics
Nicholson and Lindon. Nature 2008, 455, 1054-1056
Metabolic Networks
http://www.genome.jp/dbget-bin/www_bget?pathway+ecj00020
Metabolomics: Simulation
http://www.systems-biology.org/cd/simulation/data/sim1cp2.gif
This lecture
Quantitative mass spectrometry System-wide biological data on
proteome and metabolome level
Computational
mass spectrometry
• Basics of Proteomics/ Metabolomics
• Basics of chromatography and mass spectrometry
• Computational mass spectrometry
• Algorithms for peptide/ protein quantification and identification
• Algorithms for metabolite identification and quantification
• Applications, e.g., biomarkers and complete proteomes
• Open research questions, e.g., clinical translation
This lecture
Proteomics
• Studying the proteome Proteome:=
Proteomics
• Studying the proteome Proteome:= All proteins that are expressed in a given organism, tissue or cell at a given state and time
Proteomics
• Studying the proteome Proteome:= All proteins that are expressed in a given organism, tissue or cell at a given state and time
• Goal of studying proteomes: understand the function of all proteins in a biological system
• Large databases have been established, e.g., the Gene Ontology Consortium (www.geneontology.org) catalogues all proteins by their molecular function, biological process and cellular compartment
Protein
• A protein or polypeptide consists of a linear chain of amino acids that build 3-dimensional structures
• Amino acids are connected via peptide bonds
H2N C
H
R1
C NH C C NH C
O
R2
O H
R3
C NH C C
O H O
R4
OH
Peptide bonds
C-terminus N-terminus
Protein
• There are some problematic issues on defining a protein • Protein identity: unique amino acid sequence and single
source of origin?
• There may be different genes encoding the identical amino acid sequence
• Different organisms may encode identical proteins
• Splice variants: A gene can give rise to different mRNAs
• Polymorphisms: many genes occur in allelic variants encoding sequence variations
• Posttranslational modifications: PTMs are very hetero-geneous and significantly alter the function of the protein
Metabolomics
• Studying the metabolome Metabolome:=
Metabolomics
• Studying the metabolome Metabolome:= The metabolome refers to the complete set of small-molecule metabolites (such as metabolic intermediates, hormones and other signaling molecules, and secondary metabolites) to be found within a biological sample, such as a single organism. (http://en.wikipedia.org/wiki/Metabolome)
Technologies
Modern Proteomics and Metabolomics studies are based on
Liquid chromatography (LC)
-
Mass spectrometry (MS)
(Liquid) chromatography
• Mobile phase liquid, stationary phase is usually solid
• Analytes are held back on a column
• “Mobile phase” is pumped over the column
• Analytes continously separate and elute from the column according to specific properties (e.g. hydrophobicity)
• Other chromatography (e.g. gas chromatography) techniques
will also be mentioned
HPLC (High Performance Liquid Chromatography)
pump
column
(stationary phase)
detector
mobile phase
retention time (RT)
detector
pump
eluent
analyte mixture
column
injection valve
HPLC (High Performance Liquid Chromatography)
detector
pump
eluent
analyte mixture
column
injection valve
HPLC (High Performance Liquid Chromatography)
detector
pump
eluent
analyte mixture
column
injection valve
HPLC (High Performance Liquid Chromatography)
detector
pump
eluent
analyte mixture
column
injection valve
HPLC (High Performance Liquid Chromatography)
detector
pump
eluent
analyte mixture
column
injection valve
HPLC (High Performance Liquid Chromatography)
detector
pump
eluent
analyte mixture
column
injection valve
HPLC (High Performance Liquid Chromatography)
detector
pump
eluent
analyte mixture
column
injection valve
HPLC (High Performance Liquid Chromatography)
detector
pump
eluent
analyte mixture
column
injection valve
HPLC (High Performance Liquid Chromatography)
detector
pump
eluent
analyte mixture
column
injection valve
HPLC (High Performance Liquid Chromatography)
Mass spectrometry
• Mass spectrometry (MS) is an analytical technique to measure the mass (or more precisely: mass-to-charge ratio, m/z) of an analyte
• MS has a long history in physics and chemistry and today the key technology in proteomics and metabolomics
• “soft ionization” methods enable its application in the bio(-analytical) sciences
• For OMICS analyses MS is usually coupled to a second separation technique (e.g. LC for proteomics and LC/GC for metabolomics)
• There are various types of mass spectrometers (see 3rd lecture)
Mass spectrometry
Modified from Aebersold and Mann, Nature, 2003
Ionization techniques
Mass analyzers Mass detector
Reflector
time-of-flight
(TOF)
time-of-flight
time-of-flight
(TOF-TOF)
Triple
Quadrupole
Quadrupole –
time–of-flightc
Ion trap
Fourier
transform –
Ioncyclotron
resonance
Orbitrap
Electron multiplier
Matrix Assisted
Laser Desorption/
Ionisation (MALDI) Electrospray
ionization (ESI)
Challenges in computational MS
• Huge data sets (up to TBs per experiment)
• Ambiguity in protein identification
• Uncertainty in proteome size
• Ambiguity of masses for small molecules • Peak picking/ Feature Finding
• Map alignment/ Quantification
• Peptide/ Protein Identification
• Metabolite identification
• Statistical analysis
• Enrichment analysis
• Analysis of time course data
• Data integration
upstream
downstream
Peak Picking
raw data sticks
• Identify peaks
• Integrate peaks to sticks
Quantification
• Determine volume of each feature in a map
Quantification
m/z: Isotopic pattern
RT: elution profile
feature model
• Quantification as a 3D signal detection problem
Map alignment
• Correct for retention time offset and distortions in label-free experiments
Peptide Identification
LC-MS/MS experiment Fragment m/z values
Sequence db
Theoretical fragment m/z
values from suitable peptides
Compare
Q9NSC5|HOME3_HUMAN Homer protein homolog 3 -
Homo sapiens (Human)
MSTAREQPIFSTRAHVFQIDPATKRNWIPAGKHALTVSYFY
DATRNVYRIISIGGAKAIINSTVTPNMTFTKTSQKFGQWDS
RANTVYGLGFASEQHLTQFAEKFQEVKEAARLAREKSQD
GGELTSPALGLASHQVPPSPLVSANGPGEEKLFRSQSADA
PGPTERERLKKMLSEGSVGEVQWEAEFFALQDSNNKLAG
ALREANAAAAQWRQQLEAQRAEAERLRQRVAELEAQAAS
EVTPTGEKEGLGQGQSLEQLEALVQTKDQEIQTLKSQTGG
PREALEAAEREETQQKVQDLETRNAELEHQLRAMERSLEE
ARAERERARAEVGRAAQLLDVSLFELSELREGLARLAEAAP
569.24
572.33
580.30
581.46
582.63
606.32
610.24
616.14
569.24
572.33
580.30
581.46
582.63
606.32
610.24
616.14
569.24
574.83
580.70
580.92
579.99
603.92
611.14
616.74
570.84
571.72
580.40
591.18
579.35
607.25
611.42
614.45
569.24
572.33
580.30
581.46
582.63
606.32
610.24
616.14
569.24
572.33
580.30
581.46
582.63
606.32
610.24
616.14
569.24
572.33
580.30
581.46
582.63
606.32
610.24
616.14
569.24
572.33
580.30
581.46
582.63
606.32
610.24
616.14
569.24
572.33
580.30
581.46
582.63
606.32
610.24
616.14
569.24
572.33
580.30
581.46
582.63
606.32
610.24
616.14
1 QRESTATDILQK 18.77
2 EIEEDSLEGLKK 14.78
3 GIEDDLMDLIKK 12.63
Score hits
Theoretical spectra
m/z
[%]
m/z
[%]
m/z
[%]
m/z
[%]
Experimental spectra
m/z
RT
Protein inference
Nesvizhskii, Molecular and Cellular Proteomics, 2005
Metabolite identification
Kind and Fiehn. BMC Bioiformatics 2006, 7:235
Preliminary Schedule Date Topic
Oct 11 Today’s overview
Oct 18 Proteomics and Metabolomics
Oct 25 Physics and chemistry of LC-MS
Nov 1 ALLERHEILIGEN
Nov 8 Lab Excursion: Proteome Center Tübingen
Nov 15 Basic statistics for computational MS
Nov 22 Protein/ Metabolite quantification I
Nov 29 Protein/ Metabolite quantification II
Dec 6 Protein/ Metabolite quantification III
Dec 13 Peptide ID I
Dec 20 Peptide ID II
Dec 27 / Jan 3 CHRISTMAS BREAK
Jan 10 Protein ID
Jan 17 Posttranslational Modifications
Jan 24 Metabolite ID
Jan 31 Summary/ repetition
Week of Feb 7 Exams
Textbooks
Eidhammer, Flikka, Martens, Mikalsen: Computational methods for mass spectrometry proteomics, Wiley, 2007
Good introduction:
• Biochemical basics
• Mass spectrometry
• Algorithms for protein identification/ quantification
Important papers
Nature Reviews, Molecular Cell Biology, 2004
Important papers
Nature, 2003
Important papers
Nature Methods, 2007
Nesvizhskii, Molecular and Cellular Proteomics, 2005
Important papers
Important papers
Important papers
Materials
• Script will be made available as a printout at the beginning of
each class
• Additional materials (papers, literature) available on the
course web site or can be requested via e-mail:
course web site:
http://abi.inf.uni-tuebingen.de/Teaching/ws-2011-12/
• Textbooks are available in the library in the section dedicated
to this lecture (Handapparat)