RNA-Seq plant data analysis
K. Gorczak, K. Klamecka, A. Szabelska, J. Zyprych-Walczak, I.Siatkowski
Department of Mathematical and Statistical Methods,Poznan University of Life Sciences
July 03, 2014
K. Gorczak, K. Klamecka, A. Szabelska, J. Zyprych-Walczak, I. SiatkowskiRNA-Seq plant data analysis
Outline
1 What RNA-Seq is?
2 Steps of RNA-Seq experiment
3 Methods for differential analysis
4 Normalization
5 Differential expression analysis
6 Graphical presentation of the results
7 Conclusions
K. Gorczak, K. Klamecka, A. Szabelska, J. Zyprych-Walczak, I. SiatkowskiRNA-Seq plant data analysis
What RNA-Seq is?
RNA-Seq is high-through sequencing technology that sequence cDNAin order to get information about RNA content in the sample.
Analysis of gene expression
Detection of alternative splicing events
Gene fusion transcripts
Cancer research
Disease diagnosis
Cellular processes in plants or animals
K. Gorczak, K. Klamecka, A. Szabelska, J. Zyprych-Walczak, I. SiatkowskiRNA-Seq plant data analysis
What RNA-Seq is?
Platforms:
Applied Biosystems’ SOLiD:
based on sequencing by ligation
Illumina’s Genome Analyzer:
based on sequencing by synthesis
Roche’s 454 Life Sciences:
based on sequencing by pyrosequencing
Ion Torrent:
based on sequencing by Ion semiconductor
Pacific Biosciences:
based on single-molecule sequencing
K. Gorczak, K. Klamecka, A. Szabelska, J. Zyprych-Walczak, I. SiatkowskiRNA-Seq plant data analysis
What RNA-Seq is?
Platforms:
Applied Biosystems’ SOLiD:
based on sequencing by ligation
Illumina’s Genome Analyzer:
based on sequencing by synthesis
Roche’s 454 Life Sciences:
based on sequencing by pyrosequencing
Ion Torrent:
based on sequencing by Ion semiconductor
Pacific Biosciences:
based on single-molecule sequencing
K. Gorczak, K. Klamecka, A. Szabelska, J. Zyprych-Walczak, I. SiatkowskiRNA-Seq plant data analysis
What RNA-Seq is?
Platforms:
Applied Biosystems’ SOLiD:
based on sequencing by ligation
Illumina’s Genome Analyzer:
based on sequencing by synthesis
Roche’s 454 Life Sciences:
based on sequencing by pyrosequencing
Ion Torrent:
based on sequencing by Ion semiconductor
Pacific Biosciences:
based on single-molecule sequencing
K. Gorczak, K. Klamecka, A. Szabelska, J. Zyprych-Walczak, I. SiatkowskiRNA-Seq plant data analysis
What RNA-Seq is?
Platforms:
Applied Biosystems’ SOLiD:
based on sequencing by ligation
Illumina’s Genome Analyzer:
based on sequencing by synthesis
Roche’s 454 Life Sciences:
based on sequencing by pyrosequencing
Ion Torrent:
based on sequencing by Ion semiconductor
Pacific Biosciences:
based on single-molecule sequencing
K. Gorczak, K. Klamecka, A. Szabelska, J. Zyprych-Walczak, I. SiatkowskiRNA-Seq plant data analysis
What RNA-Seq is?
Platforms:
Applied Biosystems’ SOLiD:
based on sequencing by ligation
Illumina’s Genome Analyzer:
based on sequencing by synthesis
Roche’s 454 Life Sciences:
based on sequencing by pyrosequencing
Ion Torrent:
based on sequencing by Ion semiconductor
Pacific Biosciences:
based on single-molecule sequencing
K. Gorczak, K. Klamecka, A. Szabelska, J. Zyprych-Walczak, I. SiatkowskiRNA-Seq plant data analysis
What RNA-Seq is?
Figure: Next-generation sequencing platforms.
K. Gorczak, K. Klamecka, A. Szabelska, J. Zyprych-Walczak, I. SiatkowskiRNA-Seq plant data analysis
Steps of RNA-Seq experiment
An RNA-Seq experiment takes a sample of purified RNA, shears itand makes it possible to perform an RNA analysis through cDNAsequencing, and, in the effect, obtaining millions of short reads (Osh-lack et al. 2010). Subsequently, this experiment covers a low-levelanalysis (such as base calling, read mapping, alignment), a high-levelanalysis (such as normalization, quantification expression, differen-tial expression) and, finally, biological insight.
K. Gorczak, K. Klamecka, A. Szabelska, J. Zyprych-Walczak, I. SiatkowskiRNA-Seq plant data analysis
Steps of RNA-Seq experiment
Figure: An RNA-Seq experiment design (Oshlack et al. 2010).
K. Gorczak, K. Klamecka, A. Szabelska, J. Zyprych-Walczak, I. SiatkowskiRNA-Seq plant data analysis
Methods for differential analysis
based on transcriptome based on genome
Cuffdiff edgeRRSEM and EBSeq DESeq
BitSeq SAMseqNOISeqEBSeqbaySeq
K. Gorczak, K. Klamecka, A. Szabelska, J. Zyprych-Walczak, I. SiatkowskiRNA-Seq plant data analysis
Methods for differential analysis
We assume that the read counts Kij are derived from a Negative Binomial (NB)distribution, as follows:
Kij∼NB(µij , φ),
where
Kij is the observed count for gene i = 1, . . . ,G and sample j = 1, . . . ,m
µij is a mean
φ is the dispersion
mean and variance are related by σ2ij = µij + µ
2ijφ
µij = λijmj , where mj is the library size for sample j and λij is the level ofgene expression
The null hypothesis is tested for each gene:
H0 ∶ λiA = λiB ,
where λiA, λiB represent the mean values of expression levels of gene i betweensample A and sample B.
K. Gorczak, K. Klamecka, A. Szabelska, J. Zyprych-Walczak, I. SiatkowskiRNA-Seq plant data analysis
Normalization
Normalization is an essential step in the analysis of differentiallyexpressed genes. It allows us to compare the expression betweensamples with regard to some technical effects from the sequencing.There are several normalization methods used for a count-baseddifferential analysis (Dillies et al. 2012):
Reads per Kilobase per Million reads (RPKM)
TotalCount (TC)
Trimmed mean of M-values (TMM)
Median
Quantile
Upper Quartile
Relative log expression (RLE)
K. Gorczak, K. Klamecka, A. Szabelska, J. Zyprych-Walczak, I. SiatkowskiRNA-Seq plant data analysis
Normalization
name method
dRPKM dRPKMj =109Kij
∑Gi=1Kij Li
dTC dTCj = ∑Gi=1 Kij
∑Gi=1
Kir
dTMM log2(dTMMj ) = ∑
G ′i=1 wijMij
∑G ′i=1
wij
Mij =log2(Kij /dTCj )log2(Kir /dTCr ) , wij =
dTCj −KijdTCj
Kij+ dTCr −Kir
dTCr Kir
dmed dmedj = mediani
Kij
(∏mv=1
Kiv )1/m
dQ dQj= 10
log10(Qj−1m ∑m
j=1 log10 Qj)
dUQ dUQj
= ∑Gi=1 KijUQj
dsam dsamj = ∑i∈G ′′ Kij∑i∈G ′′ ∑m
j=1Kij
G ′′ - set of genes whose GOFi ∈ (0.25, 0.75),
where GOFi = ∑mj=1
(Kij−dTCj Ki ⋅)2
dTCj
Ki ⋅
K. Gorczak, K. Klamecka, A. Szabelska, J. Zyprych-Walczak, I. SiatkowskiRNA-Seq plant data analysis
Differential expression analysis
ComputationsAll computations were performed in the R environment (Ihaka and Gen-tleman 1996). Four R packages were used to find normalization factors:DESeq (Anders and Huber 2010), edgeR (Robinson et al. 2010), EBSeq(Leng et al. 2013) and SAMSeq (Li and Tibshirani 2011). They are freelyavailable from the Bioconductor repository (www.bioconductor.org). TheedgeR package was used to find differentially expressed genes.
DataThe data is presented in the form of a rectangular table of integer values,where genes correspond to rows and samples correspond to columns. Eachcell of this table tells us how many reads have been mapped to some gene insome sample. The dataset used here is derived from NBPSeq package. AnRNA-Seq dataset from a pilot study of the defense response of Arabidopsisto infection by bacteria.
K. Gorczak, K. Klamecka, A. Szabelska, J. Zyprych-Walczak, I. SiatkowskiRNA-Seq plant data analysis
Differential expression analysis
Ilummina platform26222 genes6 samples2 treatment groups
Figure: Arabidopsis(source: http://www.abcam.com/index.html?pageconfig=resourcerid=11682pid=5).
K. Gorczak, K. Klamecka, A. Szabelska, J. Zyprych-Walczak, I. SiatkowskiRNA-Seq plant data analysis
Graphical presentation of the result
Figure: Venn diagram for differentially expressed genes.
K. Gorczak, K. Klamecka, A. Szabelska, J. Zyprych-Walczak, I. SiatkowskiRNA-Seq plant data analysis
Graphical presentation of the result
Figure: MA-plots of differentially expressed genes.
K. Gorczak, K. Klamecka, A. Szabelska, J. Zyprych-Walczak, I. SiatkowskiRNA-Seq plant data analysis
Graphical presentation of the result
[fig a]
[fig b]
Figure: Boxplot of baseMean and log2FC [fig a], overlap for differentiallyexpressed genes [fig b].
K. Gorczak, K. Klamecka, A. Szabelska, J. Zyprych-Walczak, I. SiatkowskiRNA-Seq plant data analysis
Conclusions
RNA-Seq provides new knowledge of the range of gene expres-sion levels
Normalization of count data... sequencing biases:
within-sample biasbetween-sample bias
R packages... useful in assessing the results of the RNA-Seqexperiment
Gene expression... for understanding the impact of the geneson certain diseases and cellular processes
Graphical presentation of the results facilitates the evaluationof the results
K. Gorczak, K. Klamecka, A. Szabelska, J. Zyprych-Walczak, I. SiatkowskiRNA-Seq plant data analysis
References
Oshlack A., Robinson M.D., Young M.D. (2010). From RNA-seq reads todifferential expression results. Genome Biology 11: 220.
Dillies M., Rau A., Aubert J., Hennequet-Antier C., Jeanmougin M., Ser-vant N., Keime C., Marot G., Castel D., Estelle J., Guernec G., Jagla B.,Jouneau L., Laloe D., Le Gall C., Schaeffer B., Le Crom S., Guedj M.,Jaffrezic F. (2012). A comprehensive evaluation of normalization methodsfor Illumina high-throughput RNA sequencing data analysis. Briefings inBioinformatics. doi:10.1093/bib/bbs046.
Ihaka R., Gentleman R. (1996). R: A Language for Data Analysis andGraphics. Journal of Computational and Graphical Statistics 5(3): 299-314. doi: 10.1080/10618600.1996.10474713.
Anders S., Huber W. (2010). Differential expression analysis for sequencecount data. Genome Biology 11: R106.
Robinson M., McCarthy D., Smyth G.K. (2010). edgeR: a Bioconductorpackage for differential expression analysis of digital gene expression data.Bioinformatics 26, 139-140. doi:10.1093/bioinformatics/btp616.
K. Gorczak, K. Klamecka, A. Szabelska, J. Zyprych-Walczak, I. SiatkowskiRNA-Seq plant data analysis
Leng N., Dawson J., Thomson J.A., Ruotti V., Rissman A.I., Smits B.M.G.,Haag J.D., Gould M.N., Stewart R.M., Kendziorski Ch. (2013). EBSeq:An empirical Bayes hierarchical model for inference in RNA-seq experi-ments. Bioinformatics doi: 10.1093/bioinformatics/btt087.
Li J., Tibshirani R. (2011). Finding consistent patterns: a nonparametricapproach for identifying differential expression in RNA-Seq data. StatisticalMethods in Medical Research 22(5): 519-36. doi: 10.1177/0962280211428386.
Di Y., Schafer D.W., Cumbie J.S., Chang J.H. (2011). The NBP NegativeBinomial Model for Assessing Differential Gene Expression from RNA-Seq.Statistical Applications in Genetics and Molecular Biology 10(1).
Bullard J.H., Purdom E., Hansen K.D., Dudoit S. (2010). Evaluation ofstatistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11: 94.
K. Gorczak, K. Klamecka, A. Szabelska, J. Zyprych-Walczak, I. SiatkowskiRNA-Seq plant data analysis
Thank you for your attention!
K. Gorczak, K. Klamecka, A. Szabelska, J. Zyprych-Walczak, I. SiatkowskiRNA-Seq plant data analysis