Date post: | 12-Jan-2016 |
Category: |
Documents |
Upload: | alberta-tate |
View: | 212 times |
Download: | 0 times |
1
Functional Genomics Introduction
Julie A Dickerson
Electrical and Computer Engineering
Iowa State University
Module Structure: Day 1
Introduction to Functional Genomics Transcriptomics
Analysis and Experiment Design for Microarray Data (Dr. Peng Liu)
RNA-Seq Data (Mr. Kun Liang) LAB:
Using R for Normalizing, processing microarray data, and clustering analysis of ‘omics data (John Van Hemert)
June 15, 2010
BBSI - 2010
3
Module Structure: Day 2
Metabolomics (Dr. Ann Perera) Proteomics (Dr. Young-Jin Lee) Pathways and data integration methods (Dr.
Julie Dickerson and Erin Boggess)
Lab: Analyzing integrated sets of microarray, proteomics
and metabolomics data (Erin Boggess)
4
F1: Outline
Module Structure What is Functional Genomics? Data Types Available Transcriptomics
Basic biology behind microarrays What can you learn from microarrays? Types of arrays Limitations of microarrays
5
Functional Genomics Definition Functional genomics is a field of molecular
biology that attempts to make use of the data produced by genomic projects to describe gene (and protein) functions and interactions. Functional genomics focuses on the dynamic aspects such as gene transcription, translation, and protein-protein interactions, as opposed to the static aspects such as DNA sequence or structures.
From Wikipedia, the free encyclopedia
Genome Wide View of Metabolism
Streptococcuspneumoniae
Explore capabilities of global network How do we go from a pretty picture to a
model we can manipulate?
Metabolic Pathways
Metabolitesglucose
Enzymesphosphofructokinase
Reactions & Stoichiometry1 F6P => 1 FBP
Kinetics
Regulationgene regulation
metabolite regulation
hexokinase
phosphoglucoisomerase
phosphofructokinase
aldolase
triosephosphate isomerase
G3P dehydrogenase
phosphoglycerate kinase
phosphoglycerate mutase
enolase
pyruvate kinase
Metabolic Modeling: The Dream
June 11, 2009 BBSI - 2009 9
Data Types Available for Determining Function Genomes Genes Proteins Metabolites Phenotypes
Sequence Microarrays,
Nextgen sequencing Proteomics Metabolomics Phenomics
10
A VERY Simplified Eukaryotic Cell
nucleus
chromosome
DNA strands
DNA contains thousands of genes.
cytoplasm
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
11
Posttranscriptional Modificationsto Primary TranscriptPrimary transcript
Intervening sequences corresponding to intronsthat are removed through splicing
3’ UTR5’ UTR
Primary transcript after modification: messenger RNA (mRNA)
AAAAAA...AAAA
poly-A tailCoding portions of RNA sequencecorresponding to exons
5’ UTR 3’ UTR
5’ cap
G
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
12
Transcription takes place inside the nucleus.
nucleus
chromosome
DNA strands cytoplasm
Translation takes place outside the nucleus.
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
13
Translation
mRNA
Ribosome
amino acid sequence
folds to become a protein
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
14
During translation transfer RNA (tRNA) translates the genetic code
... ...A A C GU GU
codon codon
A A U
leu
U G C
thr
tRNAanticodon
amino acids
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
15
The Genetic Code
UUU phe UCU ser UAU tyr UGU cysUUC phe UCC ser UAC tyr UGC cysUUA leu UCA ser UAA STOP UGA STOPUUG leu UCG ser UAG STOP UGG trp
CUU leu CCU pro CAU his CGU argCUC leu CCC pro CAC his CGC argCUA leu CCA pro CAA gln CGA argCUG leu CCG pro CAG gln CGG arg
AUU ile ACU thr AAU asn AGU serAUC ile ACC thr AAC asn AGC serAUA ile ACA thr AAA lys AGA argAUG met ACG thr AAG lys AGG arg
GUU val GCU ala GAU asp GGU glyGUC val GCC ala GAC asp GGC glyGUA val GCA ala GAA glu GGA glyGUG val GCG ala GAG glu GGG gly
Firs
t B
ase
Second Base
U
C
A
G
U C A G
mRNAcodon
aminoacid
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
16
Miscellaneous Comments The biology is more complicated than I described.
Humans have somewhere around 30,000 genes. (The exact number is a subject for debate.) Regulation of these genes seems to be more important than number!
Much of the variation is created by differences in how cells use the genes they have.
Microarrays are a tool that can help us understand how cells of various types use their genes in response to varying conditions.
04/21/23BCB570 Gene Expression Data
Analysis 17
Microarrays With only a few exceptions, every
cell of the body contains a full set of chromosomes and identical genes.
Only a fraction of these genes are turned on, however, and it is the subset that is "expressed" that confers unique properties to each cell type.
"Gene expression" is the term used to describe the transcription of the information contained within the DNA, the repository of genetic information, into messenger RNA (mRNA) molecules that are then translated into the proteins that perform most of the critical functions of cells.
04/21/23BCB570 Gene Expression Data
Analysis 18
Microarrays
Microarrays work by exploiting the ability of a given mRNA molecule (target) to bind specifically to, or hybridize to, the DNA template (probe) from which it originated.
This mechanism acts as both an "on/off" switch to control which genes are expressed in a cell as well as a "volume control" that increases or decreases the level of expression of particular genes as necessary.
Source: The Genetic Science Learning Center, University of Utah
04/21/23BCB570 Gene Expression Data
Analysis 19
DNA Microarrays
Small, solid supports onto which the sequences from thousands of different genes are immobilized, or attached, at fixed locations.
The DNA is printed, spotted, or actually synthesized directly onto the support.
The spots themselves can be DNA, complementary DNA (cDNA, DNA synthesized from a mRNA template) , or oligonucleotides. (or oligo, a short fragment of a single-stranded DNA that is typically 5 to 50 nucleotides long)
04/21/23BCB570 Gene Expression Data
Analysis 20
Why do microarray experiments? Comparing two conditions to find differentially
expressed genes Control/treatment Disease/normal
Compare more than two conditions; some of which may interact Different treatments, different strains
Exploratory analysis What genes are expressed under drought stress?
04/21/23BCB570 Gene Expression Data
Analysis 21
Why use microarrays (cont)?
What happens over time? Developmental stages
Predicting certain conditions (cancer vs. normal)
Patterns of gene expression that characterize a patient’s or organism’s response
04/21/23BCB570 Gene Expression Data
Analysis 22
Differentially Expressed Genes
Find genes that show a large difference in expression between groups and are similar within a group
Statistical tests (t-test), look at if the groups have different means or variances (chi-squared, F-statistics)
Adapted from “Practical Microarray Analysis”, Presentation by Benedikt Brors, German Cancer Research Center
04/21/23BCB570 Gene Expression Data
Analysis 23
Multiple Conditions
Are there differences in expression level between the k conditions?
Analysis of Variance (ANOVA)
Mutant 1 Mutant 2
Inoculated Control Inoculated Control
24
Some Example Microarray Experiments from Iowa State UniversityJim Reecy from Animal Science: muscle undergoing
hypertrophy vs. normal muscle
David Putthoff, Steve Rodermel, Thomas Baum fromPlant Pathology: roots infected with soybean cystnematodes vs. uninfected roots
Anne Bronikowski in Genetics: wheel-running mice vs.non-runners
Roger Wise, Rico Caldo in Plant Pathology: interactionbetween multiple isolates of powdery mildew andmultiple genotypes of barley.
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
Wild-type vs. Myostatin Knockout Mice
Belgian Blue cattle have a mutation in the myostatin gene.
26
Identifying Genes Involved in Pathways That DistinguishCompatible from Incompatible Interactions
Barley Genotype
Mla6 Mla13 Mla1
Bg
h Is
ola
te
5874
K1
Incompatible
Incompatible Incompatible
IncompatibleCompatible
Compatible
Caldo, Nettleton, Wise (2004). The Plant Cell. 16, 2514-2528.
27
An Example Gene of Interest
Hours after Inoculation
Log
Exp
ress
ion
Incompatible
Compatible
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
04/21/23BCB570 Gene Expression Data
Analysis 28
Exploratory Analysis
Find patterns in data to see what genes are expressed under different conditions
Analysis includes clustering methods Used when little or no prior knowledge exists about
the problem
04/21/23BCB570 Gene Expression Data
Analysis 29Copyright ©1999 by the National Academy of Sciences
Perou, Charles M. et al. (1999) Proc. Natl. Acad. Sci. USA 96, 9212-9217
Fig. 5 (see Supplemental data at http://www.pnas.orgwww.pnas.org) for the full cluster diagram with all gene names\]
04/21/23BCB570 Gene Expression Data
Analysis 30
Time Series
Goal: find patterns of co-expressed genes over time or partial time
Typical length is 3-10 time points Cluster to find similar patterns (k-means, self-organizing
maps) Correlations to find genes that behave like a given gene of
interest.
0 hours 4 hours 12 hours 24 hours
04/21/23BCB570 Gene Expression Data
Analysis 31
Classification
Learn characteristic patterns from a training set and evaluate with a test set.
Classify tumor types based on expression patterns
Predict disease susceptibility, stages, etc.
04/21/23BCB570 Gene Expression Data
Analysis 32Source: “Practical Microarray Analysis”, Presentation by Benedikt Brors, German Cancer Research Center
33
Some Commonly Used Toolsfor Microarray Analysis Oligonucleotide arrays
Affymetrix GeneChips
Nimblegen
Agilent
34
Oligonucleotides An oligonucleotide is a short sequence of nucleotides.
(oligonucleotide=oligo for short)
An oligonucleotide microarray is a microarray whose probes consist of synthetically created DNA oligonucleotides.
Probes sequences are chosen to have good and relatively uniform hybridization characteristics.
A probe is chosen to match a portion of its target mRNA transcript that is unique to that sequence.
Oligo probes can distinguish among multiple mRNA transcripts with similar sequences.
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
04/21/23BCB570 Gene Expression Data
Analysis 35
Simplified Example
gene 1
gene 2
shared green regions indicatehigh degree of sequence similaritythroughout much of the transcript
ATTACTAAGCATAGATTGCCGTATAoligo probefor gene 1
GCGTATGGCATGCCCGGTAAACTGG
oligo probe for gene 2
...
... ...
...
Source: Dan Nettleton Course Notes Statistics 416/516X
36
Oligo Microarray Fabrication
Oligos can be synthesized and stored in solution.
Oligo sequences can be synthesized on a slide or chip using various commercial technologies.
The company Affymetrix uses a photolithographic approach which we will describe briefly.
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
37
Affymetrix GeneChips Affymetrix (www.affymetrix.com) manufactures
GeneChips.
GeneChips are oligonucleotide arrays.
Each gene (more accurately sequence of interest or feature) is represented by multiple short (25-nucleotide) oligo probes.
Some GeneChips include probes for around 120,000 genes and gene variants.
mRNA that has been extracted from a biological sample can be labeled (dyed) and hybridized to a GeneChip.
Only one sample is hybridized to each GeneChip.
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
04/21/23BCB570 Gene Expression Data
Analysis 38
Different Probe Pairs Represent Different Parts of the Same Gene
gene sequence
Probes are selected to be specific to the target geneand have good hybridization characteristics.
Source: Dan Nettleton Course Notes Statistics 416/516X
39
Affymetrix Probe Sets A probe set is used to measure mRNA levels of a single
gene.
Each probe set consists of multiple probe cells.
Each probe cell contains millions of copies of one oligo.
Each oligo is intended to be 25 nucleotides in length.
Probe cells in a probe set are arranged in probe pairs.
Each probe pair contains a perfect match (PM) probe cell and a mismatch (MM) probe cell.
A PM oligo perfectly matches part of a gene sequence.
A MM oligo is identical to a PM oligo except that the middle nucleotide (13th of 25) is replaced by its complementary nucleotide.
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
40
A Probe Set for Measuring Expression Level of a Particular Gene
probepair
probecell
gene sequence...TGCAATGGGTCAGAAGGACTCCTATGTGCCT...AATGGGTCAGAAGGACTCCTATGTGAATGGGTCAGAACGACTCCTATGTG
perfect match sequencemismatch sequence
probe set
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
41
Different Probe Pairs Represent Different Parts of the Same Gene
gene sequence
Probes are selected to be specific to the target geneand have good hybridization characterictics.
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
42
Affymetrix’s Photolithographic Approach
GeneChip
maskmaskmaskmaskmaskmaskmask
mask
A ACC
GG
TT
TA
TT A
A C C
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
43
Sou
rce:
ww
w.a
ffym
etrix
.com
44Source: www.affymetrix.com
45Source: www.affymetrix.com
46Source: www.affymetrix.com
Image from Hybridized GeneChip
47
Image Processing for Affymetrix GeneChips
Image processing for Affymetrix GeneChips is typically done using proprietary Affymetrix software.
The entire surface of a GeneChip is covered with square-shaped cells containing probes.
Probes are synthesized on the chip in precise locations.
Thus spot finding and image segmentation are not major issues.
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
48
Probe Cell
8 x 8 =64pixels
borderpixelsexcluded
75th percentileof the 36 pixelintensitiescorrespondingto the center 36pixels is usedto quantifyfluorescenceintensity foreach probe cell.
These values arecalled PM valuesfor perfect-matchprobe cells andMM values formismatch probecells.
The PM and MM values are used to computeexpression measures for each probe set.
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
Normalization
Outputs from each individual probe pair are statistically combined to give an expression level for the gene represented by the probe set.
Normalization accounts for background noise on the chip, levels of control probes, etc
Key methods are MAS5.0, RMA, GCRMA
Summary of Microarrays
Positives: commercial chips are accurate and repeatable in experienced hands and the statistics and modeling have been well-explored
Negatives: cost, can only see what is on the chip and difficult to update to new knowledge.
June 11, 2007 BBSI - 2007 50
Short Read Sequencing
Sequencing technology has evolved in the last 15 years
Eventual goal is to be able to sequence a genome for $1000 (NIH).
Why not just sequence the transcriptome directly and see what is there?
June 11, 2007 BBSI - 2007 51
Sequencing by synthesis (454) Takes a single strand of DNA and
synthesizes its complementary strand enzymatically one base pair at a timedetecting which base was actually added at each step.
Pyrosequencing detect the activity of DNA polymerase with a chemiluminescent enzyme.
Reads are about 400-500 bp
June 11, 2007 BBSI - 2007 52
Other Techologies
Illumina Solexa: 40-100 bp, tag DNA or RNA at both ends
ABI SOLID around 50 bp
Digital Gene Expression
Sequence census methods for functional genomicsBarbara Wold & Richard M Myers