+ All Categories
Transcript
Page 1: Mapping RNA sequence data Part 1: RNA-Rocket RNAseq pipeline · reference genome choose Plasmodium falciparum 3D7. There are a number of options that may be modified, however, for

MappingRNAsequencedataPart1:RNA-RocketRNAseqpipeline

ThegoalofthisexerciseistoretrieveanRNA-seqdatasetinFASTQformatandrunitthroughan RNA-sequence analysis pipeline. We will be using Pathogen Portal’s RNA-Rocket whichincludesaworkflowformappingRNA-Seqreadstoareferencegenome,usingthismappingtoassembletranscripts,mappingtranscriptstoexistingannotations,anddeterminingexpressionlevels.Themappingworkflowusestwoalgorithms,TopHatforaligningreadsandCufflinksfortranscriptpredictionandcalculatingexpressionlevels.TheinputrequiredisFASTQfilesandtheoutputsarereadalignments(BAMFiles),tabdelimitedassemblyandexpressionfilesforknowngenes,isoformsandnoveltranscripts.1. CreateanaccountonRNARocket

a. Go tohttp://rnaseq.pathogenportal.org/ b. Click on Create an Account and fill in the required information.

Clickheretocreateanaccountorlogintoyourexistingaccount

Page 2: Mapping RNA sequence data Part 1: RNA-Rocket RNAseq pipeline · reference genome choose Plasmodium falciparum 3D7. There are a number of options that may be modified, however, for

2. UploadtheRNAsequencingreadstoyourRNARocketlaunchpad.RNARocketallowsyoutodirectlyretrieveFASTQfilesofthesequencingreadsusingSRAaccessionnumbers.

a. Background:Thisexercisewill relyondatadeposited in thesequencereadarchive (SRA).

ThedataisbasedontranscriptomicanalysisofthreedevelopmentalstagesofPlasmodiumfalciparum:

1.Salivaryglandsporozoites2.Culturedsporozoites,and3.Culturedasexualstages.

EachdevelopmentalstagewasassayedbyRNAsequencing(2replicatespersample).Thestudyaccession number for this data on SRA is SRP033414 and additional information about thisexperimentmaybeobtainedfromGEO:http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52867Examining the information available in GEO and under the SRA accession numbers you willnoticethatthisdataispairedend.Soforeachsamplethereshouldbetwofilesoneforeachofthepairs.Moreinformationforeachsequencingruncanbefoundat:Salivaryglandsporozoitessample1:http://www.ncbi.nlm.nih.gov/sra/SRX385640Salivaryglandsporozoitessample2:http://www.ncbi.nlm.nih.gov/sra/SRX385641Culturedsporozoitessample1: http://www.ncbi.nlm.nih.gov/sra/SRX385642Culturedsporozoitessample2: http://www.ncbi.nlm.nih.gov/sra/SRX385643Asexualstageparasitessample1: http://www.ncbi.nlm.nih.gov/sra/SRX385644Asexualstageparasitessample2: http://www.ncbi.nlm.nih.gov/sra/SRX385645TherequiredinputfileforRNARocket’sanalysispipeline isaFASTQfile,atextfile(similartoFASTA)thatincludessequencequalityinformationanddetailsinadditiontothesequence(ie.name,qualityscores,sequencingmachineID,lanenumberetc.).FASTQfilesarelargeandasaresult not all sequencing repositorieswill store this format. However, tools are available toconvert, for example, NCBI’s SRA format to FASTQ. Sequence data is housed in threerepositoriesthataresynchronizedonaregularbasis.

▪ ThesequencereadarchiveatGenBank▪ TheEuropeanNucleotideArchiveatEMBL▪ TheDNAdatabankofJapan

Page 3: Mapping RNA sequence data Part 1: RNA-Rocket RNAseq pipeline · reference genome choose Plasmodium falciparum 3D7. There are a number of options that may be modified, however, for

b. UploaddataintoyourLaunchpad.Note:DuringthisexerciseyouwillNOTdownloadanydatatoyourcomputer.InsteadyouwillbeprovidinginformationtoenabletransferringdatafromENA/SRAtoRNA-Rocket.

i. Clickonthe“LaunchPad”linkintheGalaxymenubar.Thenselect“FromENA/SRA”.

Page 4: Mapping RNA sequence data Part 1: RNA-Rocket RNAseq pipeline · reference genome choose Plasmodium falciparum 3D7. There are a number of options that may be modified, however, for

ii. Onthenextpage,noticetheinstructionstousetheglobalsearchontheENAsite.Clickoncontinue.

iii. Cutandpastethestudyaccessionnumber(SRP033414)intothesearchbox(seeredcirclebelow).Clickonthesearchicon.

iii. Depending on RNA-rocket’s configuration you may be taken to the EBI searchresultspagewhereyouwillneedtoclickontheStudylinkIDinordertogettothestudypage.Ifyourpagelookslikethesecondscreenshot,pleaseproceedtoiv.

Page 5: Mapping RNA sequence data Part 1: RNA-Rocket RNAseq pipeline · reference genome choose Plasmodium falciparum 3D7. There are a number of options that may be modified, however, for

iv. Click on the link for File 1 in the column called “Fastq files (galaxy)” for the sample

assignedtoyourgroup,thenclickonthebackbuttononyourbrowserandclickonthelinkforFile2fromthesamesample.ThiswillbeginthefiletransfertoRNA-Rocket.YoumayneedtoscrolldowntoseetheReadFilestabwhichcontainstheFastqfiles(galaxy)columnthatyouneed.Youwillneedtoget2 files,oneforeachfilegeneratedbythepairedendsequencing.

Page 6: Mapping RNA sequence data Part 1: RNA-Rocket RNAseq pipeline · reference genome choose Plasmodium falciparum 3D7. There are a number of options that may be modified, however, for

Youshouldnowseeawindowthatlookssimilartothis:

Toviewtheprogressofyourupload,clickon“ProjectView”(redsquareinimageabove).

Youcaninspectthecontentsofcompletedtasks(likeuploadedfiles)byclickingontheeye iconnext tothenameof the file (arrow inabove image). InspectingaFASTQfileshouldlooklikethis:

Page 7: Mapping RNA sequence data Part 1: RNA-Rocket RNAseq pipeline · reference genome choose Plasmodium falciparum 3D7. There are a number of options that may be modified, however, for

c. ConfigureandinitiatetheRNAsequenceanalysispipeline.i. Background: Pathogen portal uses two algorithms for mapping (TopHat) and

transcript prediction and expression value calculation (Cufflinks). Note that therearemanyalgorithmsandmethodsforRNA-seqmappingandanalysiseachwith itsadvantages and disadvantages. You are encouraged to learn more about thealgorithmyouareusing.

o TopHat: http://tophat.cbcb.umd.edu/o Cufflinks: http://cufflinks.cbcb.umd.edu/index.html

ii. Navigatetotheworkflow.Clickonthe“LaunchPad”linkintheuppermenubar.On

the next page, scroll down to the “RNA-Seq Analysis” section and click on “MapReads&AssembleTranscripts”.

Page 8: Mapping RNA sequence data Part 1: RNA-Rocket RNAseq pipeline · reference genome choose Plasmodium falciparum 3D7. There are a number of options that may be modified, however, for

iii. SelectAnalysisType.Onthenextpage,scrolldownandchooseEukaryoticPaired-EndAnalysisunderSelectAnalysisType.Weareanalyzingapairedendeukaryoticsample.

iv. Selectthetargetprojectfromthedropdownmenu.Youshouldonlyhaveoneor

two projects one of which will contain both FASTQ files you uploaded (probablycalled“UploadedFiles”).OnceyouselectthecorrectprojectyoushouldseethetwoFASTQfilescontainedwithinit.Nextclickoncontinue.

v. Configurethepipeline.Thepipelineconsistsof7steps.

Step1:Inputdataset–Selecttheupstreamreadfile(endsin_1)andclickonthearrowtomoveittothe“Selected”window.

Step2:Inputdataset–Selectthedownstreamreadfile(endsin_2)andclickonthearrowtomoveittothe“Selected”window.

Page 9: Mapping RNA sequence data Part 1: RNA-Rocket RNAseq pipeline · reference genome choose Plasmodium falciparum 3D7. There are a number of options that may be modified, however, for

Step3: TopHat2 – Under Select areference genome choose Plasmodiumfalciparum3D7.Thereareanumberofoptionsthatmaybemodified,however,for the purposes of this exercise thedefaultparametersmaybeused.

Step4:Cufflinks–Set the Maximum Intron Length (-I):5000.The reference annotation should be automaticallyselected:Plasmodiumfalciparum3D7Select how to use the provided annotation:AssembleNovel+annotatedtranscripts.

Page 10: Mapping RNA sequence data Part 1: RNA-Rocket RNAseq pipeline · reference genome choose Plasmodium falciparum 3D7. There are a number of options that may be modified, however, for

Once again there are a number of options to modify but we only need to change themaximumIntronLength.Step5:BAMtoBigWig–NochangeneededStep6:BAMtoBigWig–NochangeneededStep7:CreateaBedGraphofgenomecoverage–NochangeneededClickontheRunWorkflowbutton.

After you start theworkflow you should get a confirmationwindow listing all the steps thathave been added to the queue. The progress of yourworkflow can be viewed to the right.Completedtasksareingreen,runningtasksareinyellowandtaskswaitinginthequeueareingrey. Theworkflowwill run overnight andwewill view the results and calculate differentialexpressioninasubsequentexercise.


Top Related