TheCCBRRNA-Seq PipelineFathiElloumi,Ph.D
NCICCBR3/20/2017
Agenda
• Introduction• DataanalysisWorkflow• Reviewmainsteps
• CCBRRNA-Seq pipeline• Workflowoverview• QualityControlreports• PrincipalComponentAnalysisPCAanddifferentialexpressedreportsreports• Downstreamanalysisafterrunningthepipeline
• RunningtheCCBRpipeline• Usecaseanddemo
Agenda
• Introduction• DataanalysisWorkflow• Reviewmainsteps
• CCBRRNA-Seq pipeline• Workflowoverview• QualityControlreports• PrincipalComponentAnalysisPCAanddifferentialexpressedreportsreports• Downstreamanalysisafterrunningthepipeline
• RunningtheCCBRpipeline• Usecaseanddemo
RNA-Seq Applications
4
• DifferentialGeneExpression• DifferentialTranscriptExpression• Stillconfinedtoknowntranscripts/isoforms
• TranscriptDiscovery/WholeTranscriptomeProfiling• Interestisinlookingfornewisoformsorunannotated genes
• Others• SNP/SomaticVariant/GeneFusionDetection
RNASequencing
PrepareSamples
RNA-Seq projectOverview
ExperimentalDesign
PrepareSamples
RNASequencing
QCandDataAnalysis
Hypothesis
- RNAextractionprotocol- Depth- LibrarytypeSE/PE- Nb.Replicates- …
Group1 Group2
Best Practices• Factorinatleast3replicates(absoluteminimum),but4ifpossible(optimumminimum).Biologicalreplicatesarerecommendedratherthantechnicalreplicates.
• AlwaysprocessyourRNAextractionsatthesametime. Extractionsdoneatdifferenttimesleadtounwantedbatcheffects.
• Thereare2majorconsiderationsforRNA-Seq libraries:• IfyouareinterestedincodingmRNA,youcanselecttousethemRNAlibraryprep. Therecommendedsequencingdepthisbetween10-20Mpaired-end(PE)reads. YourRNAhastobehighquality(RIN>8).
• IfyouareinterestedinlongnoncodingRNAaswell,youcanselectthetotalRNAmethod,withsequencingdepth~25-60MPEreads. ThisisalsoanoptionifyourRNAisdegraded.
• Ideallytoavoidlanebatcheffects,allsampleswouldneedtobemultiplexedtogetherandrunonthesamelane. ThismayrequireaninitialMiSeq runforlibrarybalancing. Additionallanescanberunifmoresequencingdepthisneeded.
• IfyouareunabletoprocessallyourRNAsamplestogetherandneedtoprocesstheminbatches,makesurethatreplicatesforeachconditionareineachbatchsothatthebatcheffectscanbemeasuredandremovedbioinformatically.
6
https://bioinformatics.cancer.gov/content/rna-seq
Agenda
• Introduction• DataanalysisWorkflow• Reviewmainsteps
• CCBRRNA-Seq pipeline• Workflowoverview• QualityControlreports• PrincipalComponentAnalysisPCAanddifferentialexpressedreportsreports• Downstreamanalysisafterrunningthepipeline
• RunningtheCCBRpipeline• Usecaseanddemo
TypicalRNA-Seq analysisworkflowSequencingFacility
Rawreads(fastq files) QCrawdata
Trimming
Alignment
Expressionquantification
Trimmedreads(fastq files)
DifferentialExpressionanalysis
Bamfiles
Gene,transcriptcounts
QCAligneddata
QCmetricsandplots
GoodQC?
QCmetricsandplots
Clustering&Visualization
Qualitycontrol(QC)ofrawdata
• DetectissuesrelatedtoSampleCollection,LibrarypreparationorSequencing• Needtocheck• Basequalityscore• sequencequality• Sequenceduplicationlevel• GCcontentlevel• Presenceofcontaminants
• bacteriaorvirus• Adaptorpresence
Alignment&quantification
HTSEQSUBREAD
Post-alignmentQC
• %mappedanduniquelymappedreads:70-90%• uniformityofreadcoverageovergenebody• Readdistribution• Checkforreadstrandedness• Biotypecomposition(checkforrRNA)
Differentialexpressionanalysis
• Whatarethegenesortranscriptsthataredifferentiallyexpressedbetweentwoormoregroups?• dostatisticaltest:
• T-test• EmpiricalBayes(moderatedt-test)• Anova (>2groups)• …
• adjustformultipletesting(FDR….)
Knowndifferentiallyexpressiondetectionmethods
ComparisonofsoftwarepackagesfordetectingdifferentialexpressioninRNA-seq studiesBriefingsinBioinformaticsvol 16N0I.59-70
Normalizationusingscalingmethods:overallgeneexpressionissameacrossallsamplesMethod Description
Totalcount(TC): Genecountsaredividedbythetotalnumberofmappedreads(orlibrarysize)associatedwiththeirsampleandmultipliedbythemeantotalcountacrossallthesamplesofthedataset
UpperQuartile(UQ): VerysimilarinprincipletoTC,thetotalcountsarereplacedbytheupperquartileofcountsdifferentfrom0inthecomputationofthenormalizationfactors
Median(Med): AlsosimilartoTC,thetotalcountsarereplacedbythemediancountsdifferentfrom0inthecomputationofthenormalizationfactors
DESeq Ascalingfactorforagiven sampleisthemedianoftheratio,foreachgene,ofitsreadcountoveritsgeometricmeanacrossallsamples
TrimmedMean ofM-values(TMM)
Ascalingfactoriscomputedastheweightedmeanoflogratiosbetweenthe sampleandthereference,afterexclusionofthemostexpressedgenesandthegeneswiththelargestlogratios
• Methodfordimensionreductiontoidentifypatterns(thousandsofgenes=thousandsofdimensions)
Theeigenvectorwiththelargesteigenvalue(totalvariance)isthefirstprincipalcomponent.Thesecondlargesteigenvaluewillbethedirectionofthesecondlargestvariance.
PrincipalComponentAnalysis
HierarchicalClustering
Dendrogram/tree
• branchingdiagramrepresentingahierarchyofcategoriesbasedondegreeofsimilarity
• canbedrawnforgenesand/orsamples
root branches leaves
Algorithmsforclustering:
Bottom-up:agglomerative
Heatmap
Agenda
• Introduction• DataanalysisWorkflow• Reviewmainsteps
• CCBRRNA-Seq pipeline• Workflowoverview• QualityControlreports• PrincipalComponentAnalysisPCAanddifferentialexpressedreportsreports• Downstreamanalysisafterrunningthepipeline
• RunningtheCCBRpipeline• Usecaseanddemo
RNA-Seq Pipelineworkflow
18
STEP1:INITIALQC
STEP2:COUNTING&DEG
RNA-Seq:InitialQCworkflow
- Trimmonatic:justadaptorclipping- STAR2passmode:formostsensitivenovel
junctionsdiscovery
Usecase:4samplesfromSEQCstudy
• MixtureofbiologicalsourcesandasetofsyntheticRNAsfromtheExternalRna ControlConsortium(ERCC)
Ø2samplesfromgroupA:Strategene UniversalHumanReferenceRNA(UHRR)– from10humancelllines-
Ø2samplesfromgroupB:Ambion HumanBrainReferenceRNA(HBRR)ØIlluminaHiSeq2000.-100bp-
Basequality(Qscore)
Q=-10log10 P,wherePisthebase-callingerrorprobability
SampleQCreport
Basequalitydistribution
Warningifthelowerquartileforanybaseislessthan10,orifthemedianforanybaseislessthan25.Failureifthelowerquartileforanybaseislessthan5orifthemedianforanybaseislessthan20.
Commonreasonsforwarnings- Generaldegradationof
qualityoverthedurationoflongruns
- Lossqualityearlierintherun(bubblesinflowcell)
- Readsofdifferentlength
Tilesissues(bubble,smudgeordebrisinlane)
Flowcell tileheatmap showingdeviationfromtheaveragequalityforeachtile
FailureifanytileshowsameanPhred scoremorethan5lessthanthemeanforthatbaseacrossalltiles
Agoodplotshouldbeallblue!
Checkproportionofsequenceswithlowqualityvalues
Failure ifthemostfrequentlyobservedmeanqualityisbelow20
Forbi-modalorcomplexdistribution,shouldcheckwithpertilequalities
Perbasesequencecontentshouldbeuniform
Biasedfragmentation
RNA-Seq librariesproducebiasedsequencecompositionatstartoftheread(10-12bp)/doesnotaffectdownstreamanalysis
GCcontentshouldbeanormaldistribution
Contaminantissue(adapterdimers=pairedofligatedadapterswithnoinsertsequence)Needtocheckoverrepresentedsequences
Nocalldistribution
Biasedsequencecomposition
Expected/checkwithbasequality
Allsequencesshouldhavethesamelength
Highduplicationlevelshouldbecarefullyassessed
- Technicalduplicates(PCRoveramplification)
- Biologicalduplicates- SmallRNAlibrary- Over-sequenceHigh
expressedtranscriptstoobservelow-expressedones
Checkforadaptersequence
Ifinsertsizesareshorterthanthereadlength->needtoremoveadaptersequence
CheckforcontaminationinOver-representedsequences:
errorifanysequenceisfoundtorepresentmorethan1%ofthetotal
FastqScreen:lookforBacteria/viruscontamination
MultiQC report
MultiQC:Multiplesamplesreport
multiQC report:Mappingstatsnb.of mappedReads Mappingrate70-90%
multiQC report:Picardduplicationratebypairedreads
multiQC report:Picard
multiQC report:RNAqualitycheck
DegradedRNAshowing3’biasincoverage
multiQC report:RSEQC
multiQC report:Exonscoverage
multiQC report:Countcheck
Checkingunassignedrateforoverlappingregionsandmulti-mappingreads
RNA-Seq:Differentialexpressionworkflow
RNA-Seq:PCAreport
RNA-Seq:EdgeR DEGreport(Limma,andDeseq2alsoavailable)
RNA-Seq:EdgeR DEGreport
EdgeR_deg_HBRR_vs_UHRR.txt
Whatisthemethodtouse?
DEGVenndiagram
Noclearanswer!
Compareresults:
- PCA- Sampleclustering- DEGresults
Visualizationandenrichmentanalysis
• Clusterthesamplesbasedonthetoprankedgenes(sd,mad,IQR..)• Pathwayenrichment(GSEA,IPA,…)• Easyuseof DEGfiles
DealingwithBatcheffect
• incorporatebatcheffectasco-variateinthemodel)
ViewingRNA-Seq data
• IntegrativeGenomicsViewer(IGV)• Readalignments• Splicesjunctions
Agenda
• Introduction• DataanalysisWorkflow• Reviewmainsteps
• CCBRRNA-Seq pipeline• Workflowoverview• QualityControlreports• PrincipalComponentAnalysisPCAanddifferentialexpressedreportsreports• Downstreamanalysisafterrunningthepipeline
• RunningtheCCBRpipeline• Usecaseanddemo
CCBRPipeliner
•Offersfornow3NGSdataworkflow:RnaSeq,ExomeSEq andGenomeSeq.• Eachworkflow:
ü isversion-awareü ismodular and extensible• Multiple options/programs canbeselected for atask.
ü isreproducible• uses aconfig file
ümaintains anaudit trail (asalog file)ü runs onNIHcluster and use Queuesystemü informs user,via email,once run iscomplete 54
Datapreparation/Input
• Pipeliner takesinrawpaired-endNGSdata:fastq.gz files• Fastq namingconvention:• <samplename>.R1.fastq.gz,• <samplename>.R2.fastq.gz
• Pipeliner canconvertfilenamestothedesirednamingconvention• labels.txt:two-columntextfile
• SampleA_R1_001.fastq.gz TumR1_Batch1.R1.fastq.gz
• ForDEG,youneedtoknowthephenotype/groupforthesamplesandthecontrastsfordifferentialanalysis
“groups.tab”file
SampleName group Sample label
sample1 treat treat1
sample2 treat treat2
sample3 treat treat3
sample4 control ctrl1
sample5 control ctrl2
sample6 control ctrl3
… … …
MandatoryFields(withoutlabels)
Onlyonefactor(youcansimulatemultifactorvariable)
“contrasts.tab”file
Group1 Vs.group2
treat control
… …
CCBRRNASEQPipeline (InitialQC)
58
Workingdirectory:/data/<user>/…
Datadirectory:/scratch/elloumif/SEQC4/
CCBRRNASEQPipeline (DEGAnalysis)
59
Workingdirectory:/data/<user>/…
Datadirectory:/scratch/elloumif/SEQC4/
RNA-Seq Output:Maindirectories
• rawQC:Fastqc resultsonrawdata• Trim:trimmeddata(adaptorcut)• QC:Fastqc resultsontrimmeddata• FQscreen:FastqScreen results(trimmeddata)• Reports:containsMultiqc reportandmainlogfileofthepipeline(snakemake.log)• DEG_genes:DEGresultsbasedongenecount+Htmlreports• DEG_genejunctions:DEGresultsbasedonjunctiongenecount+Htmlreports
DEGdirectoryoutputfiles
• Limma*files(txt,png,html)• Deseq2*files• edgeR*files
RNA-Seq Output:Mainfiles(mainworkingdirectory)• Bamfiles(*.bam)• rawcountdata(3methods):
• Gene:RawCountFile_gene.txt andRawCountFile_genes_filtered.txt• GeneNormalizeddata:CPM_TMM_counts.txt• RSEMresults:
• <sample>.rsem.genes.results• <sample>.rsem.isoforms.results
• EBSEQresults:• <sample>isoform..EBSeq• <sample>.isoform.EBSeq.normalized_data_matrix• <sample>.isoform.EBSeq.counts.matrix
• Run.json:configurationfile– runsettings
Configurationfile
Setupbeforerunningccbrpipeliner
• HelixandBiowulf accounts• X11client(Windows:Putty,NoMachine;Mac:Xquartz,NoMachine)• Space:• Biowulf homedirectorieshavedefaultof100GBallocation:notenoughtorunNGSpipelines.• Bestoption:havealab-wide/data/labname storageallocation,withhigherstorage
• BasicknowledgeofUnixcommands(ssh,mkdir,vi)
CCBRpipeliner availability
ü https://github.com/CCBR/Pipelinerüviamodule“ccbrpipeliner”atBiowulf
65
CCBRpipeliner documentation
https://github.com/CCBR/Pipeliner/blob/master/PipelinerVer1.0_documentation.pdf
66
Demo
Usecase:4samplesfromSEQCstudy
• MixtureofbiologicalsourcesandasetofsyntheticRNAsfromtheExternalRna ControlConsortium(ERCC)
Ø2samplesfromgroupA:Strategene UniversalHumanReferenceRNA(UHRR)– from10humancelllines-
Ø2samplesfromgroupB:Ambion HumanBrainReferenceRNA(HBRR)ØIlluminaHiSeq2000.-100bp-
Inputfiles
• Fastq files• Labels.txt• Groups.tab• Contrasts.tab
Outputfiles
• FASTQCreport• MultiQC report• Pca report• EdgeRreport• Rawcount files• Normalizeddatafiles
Q&A