Date post: | 15-Jan-2016 |
Category: |
Documents |
View: | 220 times |
Download: | 0 times |
CS262 Lecture 16, Win07, Batzoglou
Gene Recognition
Credits for slides:Serafim BatzoglouMarina AlexanderssonLior PachterSerge Saxonov
CS262 Lecture 16, Win07, Batzoglou
Gene structure
exon1 exon2 exon3intron1 intron2
transcription
translation
splicing
exon = protein-codingintron = non-coding
Codon:A triplet of nucleotides that is converted to one amino acid
CS262 Lecture 16, Win07, Batzoglou
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA
exon exon exonintronintronintergene intergene
Hidden Markov Models for Gene Finding
Intergene State
First Exon State
IntronState
CS262 Lecture 16, Win07, Batzoglou
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA
exon exon exonintronintronintergene intergene
Hidden Markov Models for Gene Finding
Intergene State
First Exon State
IntronState
CS262 Lecture 16, Win07, Batzoglou
TAA A A A A A A A A A A AA AAT T T T TT TT T T TT T T TG GGG G G G GGGG G G G GCC C C C C C
Exon1 Exon2 Exon3
Duration d
Duration HMM for Gene Finding
iPINTRON(xi | xi-1…xi-w)
PEXON_DUR(d)iPEXON((i – j + 2)%3)) (xi | xi-1…xi-w)
j+2
P5’SS(xi-3…xi+4)
PSTOP(xi-4…xi+3)
CS262 Lecture 16, Win07, Batzoglou
HMM-based Gene Finders
• GENMARK (Borodovsky & McIninch 1993)
• GENIE (Kulp 1996)
• GENSCAN (Burge 1997) Big jump in accuracy of de novo gene finding Currently, one of the best HMM with duration modeling for Exon states
• FGENESH (Solovyev 1997) Currently one of the best
• HMMgene (Krogh 1997)
• VEIL (Henderson, Salzberg, & Fasman 1997)
CS262 Lecture 16, Win07, Batzoglou
Better way to do it: negative binomial
• EasyGene:
Prokaryotic
gene-finder
Larsen TS, Krogh A
• Negative binomial with n = 3
CS262 Lecture 16, Win07, Batzoglou
GENSCAN’s hidden weapon
• C+G content is correlated with: Gene content (+) Mean exon length (+) Mean intron length (–)
• These quantities affect parameters of model
• Solution Train parameters of model in four
different C+G content ranges!
CS262 Lecture 16, Win07, Batzoglou
Evaluation of Accuracy
(Slide by NF Samatova)
Sensitivity (SN) Fraction of exons (coding nucleotides) whose boundaries are predicted exactly (that are predicted as coding)
•Specificity (Sp) Fraction of the predicted exons (coding nucleotides) that are exactly correct (that are coding)
•Correlation Coefficient (CC)
Combined measure of Sensitivity & Specificity Range: -1 (always wrong) +1 (always right)
TP FP TN FN TP FN TN
Actual
Predicted
Coding / No Coding
TNFN
FPTP
Pre
dic
ted
Actual
No
Co
din
g /
Co
din
g
CS262 Lecture 16, Win07, Batzoglou
Results of GENSCAN
• On the initial test dataset (Burset & Guigo) 80% exact exon detection
• 10% partial exons• 10% wrong exons
• In general
HMMs have been best in de novo prediction In practice they overpredict human genes by ~2x
CS262 Lecture 16, Win07, Batzoglou
Comparison-based Methods
CS262 Lecture 16, Win07, Batzoglou
Cross-species gene finding
5’ 3’
Exon1 Exon2 Exon3Intron1 Intron2
[human]
[mouse]
GGTTTT--ATGAGTAAAGTAGACACTCCAGTAACGCGGTGAGTAC----ATTAA | ||||| ||||| ||| ||||| ||||||||||||| | |C-TCAGGAATGAGCAAAGTCGAC---CCAGTAACGCGGTAAGTACATTAACGA-
CS262 Lecture 16, Win07, Batzoglou
Comparison of 1196 orthologous genes(Makalowski et al., 1996)
• Sequence identity between genes in human/mouse– exons: 84.6%– protein: 85.4%– introns: 35%– 5’ UTRs: 67%– 3’ UTRs: 69%
• 27 proteins were 100% identical
CS262 Lecture 16, Win07, Batzoglou
CS262 Lecture 16, Win07, Batzoglou
Not always: HoxA human-mouse
CS262 Lecture 16, Win07, Batzoglou
Patterns of Conservation
30% 1.3%
0.14%
58%14%
10.2%
Genes Intergenic
Mutations Gaps Frameshifts
Separation
2-fold10-fold75-fold
CS262 Lecture 16, Win07, Batzoglou
Twinscan
• Twinscan is an augmented version of the Gencscan HMM.
E I
transitions
duration
emissionsACUAUACAGACAUAUAUCAU
CS262 Lecture 16, Win07, Batzoglou
Twinscan Algorithm
1. Align the two sequences (eg. from human and mouse)
2. Mark each human base as gap ( - ), mismatch ( : ), match ( | )
New “alphabet”: 4 x 3 = 12 letters = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| }
3. Run Viterbi using emissions ek(b) where b { A-, A:, A|, …, T| }
Emission distributions ek(b) estimated from real genes from human/mouse
eI(x|) < eE(x|): matches favored in exonseI(x-) > eE(x-): gaps (and mismatches) favored in introns
CS262 Lecture 16, Win07, Batzoglou
Example
Human: ACGGCGACGUGCACGU
Mouse: ACUGUGACGUGCACUU
Alignment: ||:|:|||||||||:|
Input to Twinscan HMM:A| C| G: G| C: G| A| C| G| U| G| C| A| C| G: U|
Recall, eE(A|) > eI(A|)
eE(A-) < eI(A-)
Likely exon
CS262 Lecture 16, Win07, Batzoglou
HMMs for simultaneous alignment and gene finding:
Generalized Pair HMMs
CS262 Lecture 16, Win07, Batzoglou
The SLAM hidden Markov model
CS262 Lecture 16, Win07, Batzoglou
Exon GPHMM
d
e
1.Choose exon lengths (d,e).2.Generate alignment of length d+e.
CS262 Lecture 16, Win07, Batzoglou
Approximate alignment
CS262 Lecture 16, Win07, Batzoglou
Measuring Performance
CS262 Lecture 16, Win07, Batzoglou
Example: HoxA2 and HoxA3
SLAM
SGP-2
TwinscanGenscan
TBLASTXSLAM CNS
VISTARefSeq
CS262 Lecture 16, Win07, Batzoglou
Gene Regulation and Gene Regulation and MicroarraysMicroarrays
CS262 Lecture 16, Win07, Batzoglou
Overview
• A. Gene Expression and Regulation
• B. Measuring Gene Expression: Microarrays
• C. Finding Regulatory Motifs
CS262 Lecture 16, Win07, Batzoglou
Cells respond to environment
Cell responds toenvironment—various external messages
CS262 Lecture 16, Win07, Batzoglou
Genome is fixed – Cells are dynamic
• A genome is static
Every cell in our body has a copy of same genome
• A cell is dynamic
Responds to external conditions Most cells follow a cell cycle of division
• Cells differentiate during development
• Gene expression varies according to:
Cell type Cell cycle External conditions Location
slide credits: M. Kellis
CS262 Lecture 16, Win07, Batzoglou
Where gene regulation takes place
• Opening of chromatin
• Transcription
• Translation
• Protein stability
• Protein modifications
CS262 Lecture 16, Win07, Batzoglou
Transcriptional Regulation
• Efficient place to regulate:
No energy wasted making intermediate products
• However, slowest response time
After a receptor notices a change:
1. Cascade message to nucleus
2. Open chromatin & bind transcription factors
3. Recruit RNA polymerase and transcribe
4. Splice mRNA and send to cytoplasm
5. Translate into protein
CS262 Lecture 16, Win07, Batzoglou
Transcription Factors Binding to DNA
Transcription regulation:
Certain transcription factors bind DNA
Binding recognizes DNA substrings:
Regulatory motifs
CS262 Lecture 16, Win07, Batzoglou
Promoter and Enhancers
• Promoter necessary to start transcription
• Enhancers can affect transcription from afar
CS262 Lecture 16, Win07, Batzoglou
Regulation of Genes
GeneRegulatory Element
RNA polymerase(Protein)
Transcription Factor(Protein)
DNA
CS262 Lecture 16, Win07, Batzoglou
Regulation of Genes
Gene
RNA polymerase
Transcription Factor(Protein)
Regulatory Element
DNA
CS262 Lecture 16, Win07, Batzoglou
Regulation of Genes
Gene
RNA polymerase
Transcription Factor
Regulatory Element
DNA
New protein
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAAT
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATTT
Promoter motifs
3’ UTR motifs
Exons
Introns
CS262 Lecture 16, Win07, Batzoglou
Example: A Human heat shock protein
• TATA box: positioning transcription start
• TATA, CCAAT: constitutive transcription
• GRE: glucocorticoid response
• MRE: metal response
• HSE: heat shock element
TATASP1CCAAT AP2HSEAP2CCAATSP1
promoter of heat shock hsp70
0--158
GENE
CS262 Lecture 16, Win07, Batzoglou
The Cell as a Regulatory Network
• Genes = wires• Motifs = gates
A B Make DC
If C then D
If B then NOT D
If A and B then D D
Make BD
If D then B
C
gene D
gene B
CS262 Lecture 16, Win07, Batzoglou
The Cell as a Regulatory Network (2)
CS262 Lecture 16, Win07, Batzoglou
DNA Microarrays
Measuring gene transcription in a high-throughput fashion
CS262 Lecture 16, Win07, Batzoglou
What is a microarray
CS262 Lecture 16, Win07, Batzoglou
What is a microarray
• Measure the level of mRNA messages in a cell
DN
A 1
DN
A 3
DN
A 5
DN
A 6
DN
A 4
DN
A 2
cDNA 4
cDNA 6
Hybridize Gen
e 1
Gen
e 3
Gen
e 5
Gen
e 6
Gen
e 4
Gen
e 2
MeasureRNA 4
RNA 6
RT
slide credits: M. Kellis
CS262 Lecture 16, Win07, Batzoglou
What is a microarray
• A 2D array of DNA sequences from thousands of genes
• Each spot has many copies of same gene
• Measure number of hybridizations per spot
Result:• Thousands of “experiments” – one per gene –
in one go
• Perform many microarrays for different conditions: Time during cell cycle Temperature Nutrient level
CS262 Lecture 16, Win07, Batzoglou
Goal of Microarray Experiments
• Measure level of gene expression across many different conditions:
Expression Matrix M: {genes}{conditions}:
Mij = |genei| in conditionj
• Group genes into coregulated sets
Observe cells under different conditions
Find genes with similar expression profiles
• Potentially regulated by same TF
slide credits: M. Kellis
CS262 Lecture 16, Win07, Batzoglou
Clustering vs. Classification
• Clustering Idea: Groups of genes that share similar function have similar expression
patterns• Hierarchical clustering• k-means • Bayesian approaches• Projection techniques
• Principal Component Analysis• Independent Component Analysis
• Classification Idea: A cell can be in one of several states
• (Diseased vs. Healthy, Cancer X vs. Cancer Y vs. Normal) Can we train an algorithm to use the gene expression patterns to
determine which state a cell is in?• Support Vector Machines• Decision Trees• Neural Networks• K-Nearest Neighbors
CS262 Lecture 16, Win07, Batzoglou
Clustering Algorithms
b
ed
f
a
c
h
ga b d e f g hc
• K-meansb
ed
f
a
c
h
gc1
c2
c3a b g hcd e f
• Hierarchical
slide credits: M. Kellis
CS262 Lecture 16, Win07, Batzoglou
Hierarchical clustering
• Bottom-up algorithm: Initialization: each point in a separate cluster
• At each step: Choose the pair of closest clusters Merge
• The exact behavior of the algorithm depends on how we define the distance CD(X,Y) between clusters X and Y
• Avoids the problem of specifying the number of clusters
b
ed
f
a
c
h
g
slide credits: M. Kellis
CS262 Lecture 16, Win07, Batzoglou
Distance between clusters
• CD(X,Y)=minx X, y Y D(x,y)
Single-link method
• CD(X,Y)=maxx X, y Y D(x,y)
Complete-link method
• CD(X,Y)=avgx X, y Y D(x,y)
Average-link method
• CD(X,Y)=D( avg(X) , avg(Y) )
Centroid method
ed
f
h
g
ed
f
h
g
ed
f
h
g
ed
f
h
g
slide credits: M. Kellis
CS262 Lecture 16, Win07, Batzoglou
Results of Clustering Gene Expression
• CLUSTER is simple and easy to use
• De facto standard for microarray analysis
Time: O(N2M)
N: #genesM: #conditions
CS262 Lecture 16, Win07, Batzoglou
K-Means Clustering Algorithm
• Each cluster Xi has a center ci
• Define the clustering cost criterion
• COST(X1,…Xk) = ∑Xi ∑x Xi |x – ci|2
• Algorithm tries to find clusters X1…Xk and centers c1…ck that minimize COST
• K-means algorithm: Initialize centers Repeat:
• Compute best clusters for given centers
• → Attach each point to the closest center
• Compute best centers for given clusters
• → Choose the centroid of points in cluster
Until the changes in COST are “small”
b
ed
f
a
c
h
g
c1
c2
c3
slide credits: M. Kellis
CS262 Lecture 16, Win07, Batzoglou
K-Means Algorithm
• Randomly Initialize Clusters
CS262 Lecture 16, Win07, Batzoglou
K-Means Algorithm
• Assign data points to nearest clusters
CS262 Lecture 16, Win07, Batzoglou
K-Means Algorithm
• Recalculate Clusters
CS262 Lecture 16, Win07, Batzoglou
K-Means Algorithm
• Recalculate Clusters
CS262 Lecture 16, Win07, Batzoglou
K-Means Algorithm
• Repeat
CS262 Lecture 16, Win07, Batzoglou
K-Means Algorithm
• Repeat
CS262 Lecture 16, Win07, Batzoglou
K-Means Algorithm
• Repeat … until convergence
Time: O(KNM) per iteration
N: #genesM: #conditions
CS262 Lecture 16, Win07, Batzoglou
Mixture of Gaussians – Probabilistic K-means
• Data is modeled as mixture of K Gaussians N(1, 2I), …, N(K, 2I)
Prior probabilities 1, …, K
• Different i for every Gaussian i, or even different covariance matrices are possible, but learning becomes harder
P(x) = ∑i P(x | N(1, 2I)) i
Use EM to learn parameters
CS262 Lecture 16, Win07, Batzoglou
Analysis of Clustering Data
• Statistical Significance of Clusters
Gene Ontology http://www.geneontology.org/
KEGG http://www.genome.jp/kegg/
• Regulatory motifs responsible for common expression
• Regulatory Networks
• Experimental Verification
CS262 Lecture 16, Win07, Batzoglou
Evaluating clusters – Hypergeometric Distribution
rm
k
N
mk
pN
m
p
rposP )(
• N experiments, p labeled ++, (N-p) ––• Cluster: k elements, m labeled ++• P-value of single cluster containing k
elements of which at least r are ++
Prob that a randomly chosen set of k experiments would result in m positive and k-m negative
P-value of uniformity
in computed cluster
slide credits: M. Kellis