Date post: | 13-Jan-2016 |
Category: |
Documents |
Upload: | alison-johnson |
View: | 224 times |
Download: | 1 times |
CS 5263 BioinformaticsLecture 23Microarray Data Analysis
What is a MicroarrayConceptually similar to (reverse) Northern blot(Many) probes, rather than mRNAs, are fixed on some surface, in an ordered wayGene 1Gene 300
Microarray categoriescDNAs microarrayEach probe is the cDNA of a gene (500-5000nt)
Oligonucleotide microarrayEach probe is a synthesized short DNA (uniquely corresponding to a substring of a gene)Affymetrix: ~ 25mersAglient: ~ 60 mers
Spotted cDNA microarray
Array ManufacturingEach tube contains cDNAs corresponding to a unique gene. Pre-amplified, and spotted onto a glass slide
Experimentcy3cy5
Data acquisitionComputer programs are used to process the image into digital signals. Segmentation: determine the boundary between signal and background Results: gene expression ratios between two samples
Affymetrix GeneChip
Array Designmultiple probes (11~16) for each genefrom Affymetrix Inc.
Experimentfrom Affymetrix Inc.Each probe set combines to give an absolute expression level.Image segmentation is relatively easy. But need to subtract background.
Affymetrix GeneChipOne color designcDNA microarrayTwo color design
PreprocessingImage processingAnalog to digitalBackground subtractionAccount for non-specific hybridizationTransformationConvenience, normal distribution assumptionsNormalizationRemove systematic biasesFiltering, averaging, etc.Remove random noisesOrder may be different.May be combined.
Background subtractionFor cDNA array, relatively straightforwardFor oligo array, how to combine PM and MM?Does MM really measure non-specific hybridization?Recent studies suggest to ignore MM entirely or use with cautionAvailable software toolsMAS 5 (by affymetrix)dChIPGCRMA
TransformationLog transformation for two-color array
TransformationLog transformation for one-color arrayWhen get a data set from someone, be careful with the scale
NormalizationPurpose: Correct for systematic errorsMake data from different samples comparable
Best approach to detect problems: visualization
An example data setJ DeRisi, V Iyer, and P Brown, Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale, Science, 278: 680 686, 1997Yeast cells grow in glucose mediumWhen glucose was depleted, cells change their metabolic pathwayscDNA microarrayTest: 2, 4, 6, 8, 10, 12, 14 hours after growth Control: 0 hourNo replicates!No normalization!Use fold-change to get differentially expressed genes!
Histogram of log ratiosTwo possibilities: Dye effect Sample differenceMedian = -0.27
Total intensity normalizationmean(cy3) = 3141mean(cy5) = 28383141 / 2838 = 1.11
Other options: use median use subset of genesExclude 10% extreme House-keeping genesSpike-in genesEtc.
Net effect: constant factor for every geneMedian = -0.1
Intensity-intensity plotTotal intensity normalization worked well here
Intensity-intensity plotDid not work well for this experimentDye-swapping can probably help?
M-A plotA: log2(cy5 * cy3) = log2(cy5)+log2(cy3)M: log2(cy5 / cy3) = = log2(cy5)-log2(cy3)
M-A plotDependency of M on A
Lowess normalizationLowess: Locally Weighted RegressionFit local polynomial functionsM adjusted according to fitted lineAMAM
Replicate filteringGenes with very high variability in replicates are questionableLog2(ratio1) Log2(ratio2) Ratio 1Ratio 2
Preprocessing questionsWhat kind of array it is?Two-color?One-color?Oligo array?cDNA array?How is the experiment designed?Time series?Test vs control? (what kind of control?)What kind of preprocessing has been done?What values are given: raw intensities or ratios?Transformation? Log scale? Linear scale? Normalization: within-array? Cross-array?What are the next steps?Identifying differentially expressed genes?Clustering?
Identify differentially expressed genesNave approach: fold changeLog2 (cy5 / cy3) > 1: up-regulated / inducedLog2(cy5 / cy3) < -1: down-regulated / repressed
Still widely used very simpleMain problem: genes with low expression levels may have a large fold change by chanceFrom 10 to 100: ten foldFrom 1000 to 3000: three foldHowever: low-intensity => relatively high variance
Problem with fold changeThe most differentially expressed genes are the ones with the lowest average expression levels
More robust estimation of differentially expressionEstimate variance as a function of average expressionCompute a Z-score depending on location: Z(x) = (x - ) / (x)x : log2(R/G) value. : local mean(x): local standard deviation
SAM (Significance Analysis of Microarrays)Test: replicate 1, replicate 2, replicate 3Control: replicate 1, replicate 2, replicate 3
Which one is more significantly differentially expressed?
T1T2T3C1C2C3RatioGene11000200015002003002506Gene2100020003000100015005002Gene310010001002080508Gene418001700190010008009002
SAM (Significance Analysis of Microarrays)Basic idea: Students t-test
Larger t => higher significanceP-value can be directly computed for t-testPermutation test can be used
T1T2T3C1C2C3tGene11000200015002003002504.3P110003003002002000200-0.4P22000100020030010002500.96P32001000100030025010000.60
False Discovery Rate (FDR)Multiple testing problemP-value cutoff = 0.05We tested 10000 genesWould expect 500 genes by chance at this significance levelBonferroni correctionUse p-value cutoff 0.05 / 10000Meaning among all genes selected, only 0.05 are expected to be false positiveToo conservativeFalse Discovery Rate (FDR)FDR = 0.1, meaning among all genes selected, (say 100), we would expect 10 to be false positiveAcceptable to biologistsSeveral different approaches to estimate
Microarray data analysisOften perform multiple experiments under varying conditionsTemperatureTime seriesDifferent chemical treatmentDifferent tissueDifferent mutant
Microarray data analysisFor each gene we have a vector ej = (e1j, e2j, , edj)
What to do next?
Supervised vs unsupervised learningSupervised learning (Classification)Associate genes with phenotypesE.g.: Genes A, B and C induced => cancer repressed => not cancerGoal: to learn such a function from dataClassification algorithms:Decision tree, SVM, neural networks, nave bayes, etc.Unsupervised learning (clustering)
AML: acute myeloid leukemia ALL: acute lymphoblastic leukemiaGolub et. al., Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science 286: 531 537, 1999
Clustering microarray dataGroup genes into co-expressed sets Genes with similar expression patterns across multiple experiments may be co-regulated
Group experiments into clustersExperiments within the same group may have similar gene expression signatureFor example, disease sub-types that can be classified from gene expression data
Clustering microarray dataHow to tell if two expression vectors are similar?Define the (dis)-similarity measure between two vectors How to group multiple profiles into meaningful subsets ?Describe the clustering procedure Are the results meaningful ? Evaluate biological meaning of a clustering
(Dis)-similarity measuresEuclidean distancePearson correlation coefficientCosine similarityEtc.
Clustering algorithmsHierarchical clusteringK-means clusteringSelf Organizing Maps (SOMs)Spectral clusteringEtc.
Hierarchical clusteringAgglomerative or divisive (less popular)Agglomerative basic idea:Given n genesInitially every gene in a single clusterfor each iterationfind two most similar genes (or gene groups), combine into one clusterTerminate when only one cluster is left
(how to define similarity between two groups?)
Hierarchical clusteringExact behavior depends on how to compute the distance between two clustersNo need to specify number of clustersA distance cutoff is often chosen to break tree into clustersabcdef
Distance between clustersSingle-linkageNot recommendedComplete-linkage
Average-linkage
Centroid methodhttp://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html
K-meansBasic idea:Given n genesEstimate number of clusters: k(Randomly) choose k genes as cluster centersAssign each gene to the closest centerRe-compute center for each clusterUntil assignment is stableSimilarity to EM. Objective function: minimize total distance to cluster centers.http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
An exampleA synthetic data setGenesExperiments
Hierarchical clusteringAverage linkage. Cluster genes only.
Average linkage. Cluster both genes and experiments.
K-meansK = 15
Another view of clustersExperimentsLog ratioLog ratio
Evaluating clusteringDo genes in the same cluster share similar functions?Functional enrichment analysisDo genes in the same cluster share similar cis-regulatory motifs?Motif finding
Gene Ontology (GO)Gene functions were often defined using free textHard to extract, transfer, revise, predict, annotate, comprehend, manage The list of vocabularies should be pre-defined and commonly agreedGene Ontology provides a controlled vocabulary to describe gene and gene product attribute
Gene ontologyTwo partsOntology: list of vocabularies (terms) to useAnnotations: characterizing genes using ontology terms
Three ontology categoriesBiological processMolecular functionCellular components
Part of a GO graph
Each GO category is a directed acyclic graph
A term can have multiple parents, and multiple children.
A gene can be Annotated by multiple terms.
If annotated by a child term, automatically annotated by all ascendant terms.
Functional enrichment analysisTotal number of genes: 6000Cluster A: 100 genesWhat functions do these 100 genes have?Do they share some functions significantly?Gene with function of interestGene with other functionsExample:Among 6000 genes, 60 genes have function F. In cluster A, 55 out of 100 genes also have function F.Significance can be computed using cumulative hyper-geometric test.
Significance of enrichmentM = 6000m = 100N = 60n = 55P-value = 8e-100Example:Among 6000 genes, 60 genes have function F. In cluster A, 55 out of 100 genes also have function F.Significance can be computed using cumulative hyper-geometric test.
An applicationTavazoie et al, Systematic determination of genetic network architecture, Nature Genetics, 22, 19993000 yeast genes, 15 time points during cell cycleUse k-means clustering, k=30Clusters correlate well with known functionAlignACE motif finding 600-long upstream regionsMany motifs known
Cell-division cycleA cell duplicates its genome and divides into two identical cellsFour phasesG1 (preparation)S (DNA duplication)G2 (preparation)M (cell division)
Motifs in Clusters
Enriched functions in clusters
Overview of course
Main themes by biological subjectSequence analysisAlignmentString matchingMotif findingGene predictionRNA secondary structure predictionFunctional genomicsMicroarrayFunctional enrichment analysis
Main themes by algorithmic techniquesDynamic programmingAlignment, HMM, RNA structureProbabilistic modelingHMM (regular grammar)RNA structure (context free grammar)Motif findingSuffix treesApplications in bioinfoClustering
Sequence alignmentDP algorithm for global and local alignmentNeedleman-WunschSmith-WatermanAlignment with affine gap costLinear space alignmentHeuristic alignmentBounded alignmentBLASTAlignment statisticsExtreme value distributionMultiple sequence alignment
Probabilistic modelsHMMsHMMs for pair-wise sequence alignmentHMMs for multiple alignmentHMMs for gene predictionStochastic context-free grammar for RNA structure predictionViterbi and posterior decodingExpectation-maximizationGibbs Sampling
GoalsBasis of sequence analysis and other computational biology algorithmsOverall picture about the fieldRead / criticize research articlesThink about the sub-field that best suits your background to exploreCommunicate and exchange ideas with (computational) biologists
Good luck with your final exams
See you on Dec 11
Please remember to turn in your homework and final project report by Dec 6