+ All Categories
Home > Documents > Global expression analysis

Global expression analysis

Date post: 12-Jan-2016
Category:
Upload: melody
View: 33 times
Download: 0 times
Share this document with a friend
Description:
Global expression analysis. Monday 10/1: Intro * 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8: Calling differentially expressed genes with baySeq - *read the paper! baySeq lab for RNA-seq data - PowerPoint PPT Presentation
21
1 Global expression analysis 1: Intro * 1 page Project Overview Due o to R lab 10/3: Stats & FDR - * read the paper! 8: Calling differentially expressed genes with baySeq - *rea eq lab for RNA-seq data 10/10: Clustering analysis 15: Clustering analysis tering lab 10/17: Motif analysis 12: Motif analysis f lab 10/14: ChIP/RIP/Nuc/Ect-Seq
Transcript
Page 1: Global expression analysis

1

Global expression analysis

Monday 10/1: Intro * 1 page Project Overview DueIntro to R lab

Wednesday 10/3: Stats & FDR - * read the paper!

Monday 10/8: Calling differentially expressed genes with baySeq - *read the paper!baySeq lab for RNA-seq data

Wednesday 10/10: Clustering analysis

Monday 10/15: Clustering analysisClustering lab

Wednesday 10/17: Motif analysis

Monday 10/12: Motif analysisMotif lab

Wednesday 10/14: ChIP/RIP/Nuc/Ect-Seq

Page 2: Global expression analysis

2

Global expression analysis

Goal: To measure transcript abundance of every gene in your organism at once …

AND make sense out of it

The power is in organizing genomic expression data to find meaningful patterns & groups of genes

Page 3: Global expression analysis

Gasch et al. 2000, 2001

Page 4: Global expression analysis

4

What kinds of information can we extract from genomic expression data?

1. Hypothetical functions for uncharacterized genes-- genes encoding subunits of multi-subunit protein complexes

are often highly coregulatedexample: ribosomal protein genes, proteasome genes in yeast

-- genes involved in the same cellular processes are often coregulated

2. New roles for characterized genes

5. Understanding developmental pathways

4. Implications of gene regulation-- WT vs. mutants can identify transcription factor targets-- promoter analysis of coregulated genes = upstream elements-- gene coregulation with known pathway targets can implicate

pathway activity

3. Better understanding of the experimental conditions-- based on expression patterns of characterized genes

6. Defining samples based on expression profilesexample: comparing tumor samples from patients

Page 5: Global expression analysis

5

Technologies for Quantifying & Identifying Nucleic Acids

DNA microarrays Deep sequencing

1. Collect RNA2. Generate fluorescently-labeled

cDNA3. Hybridize to array4. Detect fluorescence emission

with scanning laser

Data: Continuous measurements of relative fluorescence

1. Collect RNA2. Make strand-specific cDNA library3. Deep sequence short reads4. Relate sequences back to

genome / transcriptome location(or de novo assembly)

Data: Number of sequencing reads pereach base in the genome = Discrete ‘Counts’

Page 6: Global expression analysis

6

ORF

mRNA

Array Probes

Tiled-genome arrays cover the entire genome

Page 7: Global expression analysis

7

Tiled sequences across each gene / locus

To get relative differences in expression across two samples:1. Need to normalize array signals across arrays2. Need to compress measurements to a single score

for each gene/transcript

Tiled genomic arrays (Nimblegen, Affymetrix, Agilent)

Page 8: Global expression analysis

8

PM

MM

‘Robust Multiarray Analysis’ (RMA Irizarry et al. 2003)1. On Affy: Throw out elements where MM signal > PM signal

… but otherwise ignore MM

2. Local background subtraction from each probe intensity

3. Quantile normalization of arrays to be compared… sets the distribution of probe intensities to be the same

4. Convert intensity values to log2 scale

5. Use a linear model to fit a given probe set and compute one expression value per gene

PM = ‘perfect match’ oligoMM = ‘mismatch’ oligo (central nucleotide is mutated)

Tiled genomic arrays (Nimblegen, Affymetrix, Agilent)

Page 9: Global expression analysis

9

Deep sequencing for gene expression analysis

mRNA

Old protocol: make ds cDNA

New protocols:1st strand cDNA

(2nd strand with dUTP)

Sequence

Sequence

Number of sequencing reads per region ~= number of starting transcripts

Page 10: Global expression analysis

10

Number of sequencing reads per region ~= number of starting transcripts

* But sometimes one lane of sequencing works better than others:Simple normalization: Avg counts within gene length / Total Counts in That LaneRPKM: Reads Per Kb per Million mapped reads

BUT … have to account for the length of the gene/transcript:

Counts per base pair

Total reads in lane

40 x 106

32 x 106

Page 11: Global expression analysis

11

Another challenge: mapping reads to the genome/transcriptome

intron

Spliced transcript

DNA

DNA

Should you restrict yourself to ORF annotations?

Can map reads to genome or transcriptome sequence, or assemble de novo.

Page 12: Global expression analysis

12

Comparing samples via fold-changes: RPKM across samples reflects

Differential Expression

Usually work in log2 space

Page 13: Global expression analysis

13

ID Log ratioYPL187W 6.36YGR043C 1.82YGL089C 6.439YCR040W 1.012YCR039C 1.147YCL001W 1.934YJR004C 2.76YLL005C 2.395YGL101W 2.22YLR040C 2.073upgrade plate 1.863EMPTY 1.755upgrade plate 1.573EMPTY 1.529YBL051C 1.419YLR349W 1.382YCL066W 1.338YLR227W-A 1.335upgrade plate 1.314YDL186W 1.246YDR536W 1.183upgrade plate 1.165YHR124W 1.163EMPTY 1.127YAL065C 1.091YBR012W-A 1.078YCL026C-A 1.046YJL078C 1.045YHR161C 1.033YBR244W 1.028YGR237C 1YGL189C 0.997YCL009C 0.989YKL185W 0.968YDR285W 0.95YMR057C 0.949Q0250 0.942YOR235W 0.924YDR415C 0.922YER072W 0.906EMPTY 0.892EMPTY 0.89YDL013W 0.877YLR206W 0.874YML047C 0.874YDR306C 0.858YDR528W 0.823YGL088W 0.8YBL097W 0.787YBR013C 0.782YIR019C 0.779YDR361C 0.772YLR267W 0.769YAL008W 0.746YGL128C 0.741YDR530C 0.739

ID Log Ratio (635/532)YPL187W -0.072YGR043C -0.228YGL089CYCR040W 0.694YCR039C -0.487YCL001W -0.536YJR004C 0.026YLL005C -0.008YGL101W 0YLR040C -0.659upgrade plate -0.408EMPTY -0.008upgrade plate 0.109EMPTY -0.866YBL051C -0.054YLR349W -0.457YCL066WYLR227W-A -0.419upgrade plate -0.401YDL186W 0.959YDR536W -0.58upgrade plate 0.543YHR124W -0.465EMPTY -0.715YAL065C -1.133YBR012W-A 0.676YCL026C-A -0.468YJL078C -0.889YHR161C -0.033YBR244WYGR237C -0.754YGL189C -0.11YCL009C 0.014YKL185WYDR285W -0.435YMR057C 0.672Q0250 -0.219YOR235W 1.166YDR415C -0.334YER072W -0.509EMPTY -1.174EMPTY -0.818YDL013WYLR206WYML047C -0.819YDR306CYDR528W 0.276YGL088WYBL097WYBR013C -0.896YIR019CYDR361C -1.017YLR267W -0.457YAL008W 1.465YGL128C 0.027YDR530C 2.083

Now each sample = list of normalized relative transcript valuesArray 1 Array 2

Page 14: Global expression analysis

14

Assessing replicates: how well do the data agree overall?linear regression

Example of good replicatesy = 0.978x + 0.0095

R2 = 0.8332

-4

-3

-2

-1

0

1

2

3

4

5

-4 -2 0 2 4

Array 1 values

Arr

ay

2 v

alu

es

DES460 + 0.2% MMS - 45min

Linear (DES460 + 0.2%MMS - 45 min)

Example of bad replicates y = 0.1104x - 0.0358

R2 = 0.0205

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

-6 -5 -4 -3 -2 -1 0 1 2 3 4

Array 1 values

Arr

ay

2 v

alu

es

Where does the noise come from?-- can be biological variation

-- can be array artifacts… should define both types of variation …

Page 15: Global expression analysis

15

Now you have your data, in the form of relative log2 expression differences

Now what?

Page 16: Global expression analysis

16

Select differentially expressed genes to focus on

Methods of gene selection:

-- arbitrary fold-expression-change cutoffexample: genes that change >3X in expression between samples

-- statistically significant change in expressionrequires replicates

Expression difference

Gene X expression under condition 1Gene X expression under condition 2

Page 17: Global expression analysis

17

Expression difference

Gene X expression under condition 1Gene X expression under condition 2

Select differentially expressed genes to focus on

Methods of gene selection:

-- arbitrary fold-expression-change cutoffexample: genes that change >3X in expression between samples

-- statistically significant change in expressionrequires replicates

Page 18: Global expression analysis

18

Expression difference

Use statistics to compare the mean & variation of 2 (or more)

populations

Select differentially expressed genes to focus on

Methods of gene selection:

-- arbitrary fold-expression-change cutoffexample: genes that change >3X in expression between samples

-- statistically significant change in expressionrequires replicates

Page 19: Global expression analysis

19

Test if the means of 2 (or more) groups are the same or statistically different

The ‘null hypothesis’ H0 says that the two groups are statistically the same-- you will either accept or reject the null hypothesis

Choosing the right test:

parametric test if your data are normally distributed with equal variance

nonparametric test if neither of the above are true

Why do the data need to be normally distributed?

Page 20: Global expression analysis

20

Test if the means of 2 groups are the same or statistically different

The ‘null hypothesis’ H0 says that the two groups are statistically the same-- you will either accept or reject the null hypothesis

T = X1 – X2 difference in the means

standard error of the difference in the meansSED

If your two samples are normally distributed with equal variance, use the t-test

If T > Tc where Tc is the critical value for the degrees of freedom & confidence level,then reject H0

Notice that if the data aren’t normally distributed mean and standard deviation are not meaningful.

Page 21: Global expression analysis

21

Differential expression on DNA microarrays:Bioconductor package Limma (ref)

** See previous years’ limma lab for a walk-through example

1. Load your data2. Provide a ‘target’ file that says which samples are on which arrays3. Provide a ‘design’ file (and in some cases a ‘contrast matrix’) to specify

which samples you want to compare4. Limma will look at the entire dataset and model the error on the data, to tryto over-come measurement error

5. Limma then does a modified T-test to identify genes with significant expressiondifferences across the samples you specified.


Recommended