+ All Categories
Home > Documents > labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th,...

labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th,...

Date post: 10-Jun-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
51
labibi Documentation Release 1.0 C. Titus Brown November 14, 2015
Transcript
Page 1: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi DocumentationRelease 10

C Titus Brown

November 14 2015

Contents

1 Welcome 311 1 Learning goals 312 2 Safe space and code of conduct 313 3 Instructor introductions 314 4 Amazon and cloud computing - why 315 5 Sticky notes and how they work + Minute Cards 416 6 Warning 4

2 Non-model organisms and RNAseq 521 The overall process 5

3 Getting started with Amazon EC2 731 More details 7

4 Short read quality and trimming 1741 Prepping the computer 1742 Data source 1743 1 Copying in some data to work with 1744 1 Copying data into a working location 1845 2 FastQC 1846 3 Trimmomatic 1947 4 FastQC again 2048 5 Trim the rest of the sequences 2049 6 Interleave the sequences 22

5 Running digital normalization 23

6 Running the actual assembly 25

7 BLASTing your assembled data 27

8 Annotation of denovo transcriptome 2981 Identify the GeneTranscript relationships 2982 Generate the longest-ORF peptide candidates from the Trinity Assembly 2983 Capturing BLAST Homologies 3084 Characterization of functional annotation features 3085 Integration of all annotations into one database 3186 Output an Annotation Report 32

i

9 Quantification and Differential Expression 3391 Download Express 3392 Align Reads with Bowtie 3393 Quantify Expression using eXpress 3494 Differential Expression 34

10 Remapping your reads to your assembled transcriptome 37

11 Miscellaneous advice 39111 Sequencing depth and number of samples 39112 Downloading your data 39113 Developing your own pipeline 40

12 More resources 41121 Informational resources 41122 Places to share data scripts and results files 41

13 Miscellaneous questions 43

14 Tips and Tricks for working with Remote Computers 45141 Use screen to run things that take a long time 45142 Use CyberDuck to transfer files 45143 Subsetting data 45144 Running full analyses on Amazon Web Services 46

15 Technical information 47

ii

labibi Documentation Release 10

This workshop was given on May 4th and 5th 2015 by C Titus Brown Camile Scott and Tamer Mansour See theworkshop organization page for more information or contact Titus directly

Tutorials

Contents 1

labibi Documentation Release 10

2 Contents

CHAPTER 1

Welcome

11 1 Learning goals

For you

bull get a first (or second) look at tools

bull gain some experience in the basic command line

bull get 80 of way to a complete analysis of some data

For us

bull what are the on campus needs who are the on-campus people

12 2 Safe space and code of conduct

This is intended to be a safe and friendly place for learning

Please see the Software Carpentry workshop Code of Conduct httpsoftware-carpentryorgconducthtml

In particular please ask questions because I guarantee you that your question will help others

13 3 Instructor introductions

Titus Brown - prof here at UC Davis in the School of Vet Med

Tamer Mansour - MDPhD postdoc with Titus

Camille Scott - CS graduate student here at Davis

14 4 Amazon and cloud computing - why

bull simplifies software installation

bull can be used for bigger analyses quite easily

bull good for ldquoburstrdquo capacity (just got a data set)

bull accessible everywhere

3

labibi Documentation Release 10

bull they give us $100 gift certificates

15 5 Sticky notes and how they work + Minute Cards

Basic rules

bull no sticky note - ldquoworking on itrdquo

bull green sticky note - ldquoall is wellrdquo

bull red sticky note - ldquoneed helprdquo

Place the sticky notes where we can see them from the back of the room ndash eg on the back of your laptop

At the end of each session (coffee break lunch end of day) please write down on an index card one thing you learnedand one thing yoursquore still confused about

16 6 Warning

The bit that isnrsquot going to work is at the beginning This is unavoidable To reward you we have a coffee break at1030am

mdash

Next Non-model organisms and RNAseq

4 Chapter 1 Welcome

CHAPTER 2

Non-model organisms and RNAseq

With non-model systems where there is neither a good genome nor a lot of mRNAseq data you have to build yourown transcriptome from scratch ndash so-called ldquode novo transcriptome assemblyrdquo There are a few programs to do this -most notably Trinity and Oases - and we have found little difference

The main problem yoursquoll run into with non-model mRNAseq is that the output is fairly noisy with respect to splicevariants Our experience has been that many of these splice variants are probably ldquorealrdquo ndash in the sense of actuallypresent ndash but may be biological ldquonoiserdquo in the sense that they are not actually functional (See this excellent paper byPickrell and Pritchard making the case) Regardless therersquos little that you can do about this although we will talkabout it a bit on the second day

21 The overall process

_staticnonmodel-rnaseq-pipelinepng

bull Copy over your RNAseq data (from two or more samples)

bull Trim primers and junk from sequence (Short read quality and trimming)

bull Do abundance normalization (Running digital normalization)

bull Assemble everything together (Running the actual assembly)

This gives you an assembled transcriptome consisting of many transcripts and transcript families

At this point you can do one or more of the following

bull Annotate your transcripts (Annotation of denovo transcriptome)

bull Quantify your transcripts and examine differential expression (Quantification and Differential Expression)

bull BLAST your transcripts individually (BLASTing your assembled data)

Next Getting started with Amazon EC2

5

labibi Documentation Release 10

6 Chapter 2 Non-model organisms and RNAseq

CHAPTER 3

Getting started with Amazon EC2

Short version

bull go to httpawsamazoncom log in then ldquoEC2rdquo (upper left)

bull select ldquoLaunch instancerdquo

bull select ldquoUbuntu 1404rdquo from the list

bull select ldquom3xlargerdquo from the list (towards bottom of ldquoGeneral purposerdquo)

bull click ldquoReview and launchrdquo

bull select ldquoLaunchrdquo

bull if your first time through create a key pair otherwise select existing

bull click ldquolaunch instancerdquo

31 More details

311 Start up an EC2 instance

Log in

Go to lsquohttpsawsamazoncomlsquo in a Web browser

Select lsquoMy AccountConsolersquo menu option lsquoAWS Management Consolerdquo

Log in with your username amp password

Click on EC2 (upper left)

7

labibi Documentation Release 10

Select your zone

Many of the resources that we use are hosted by Amazon on the East coast Make sure that your dashboard has lsquoNVirginiarsquo on the upper right

Then click on Launch Instance

Select the machine operating system to boot

Find the ldquoUbuntu Server 1404rdquo image in the first list to show up

8 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Choose the machine size

Select lsquoGeneral purposersquo lsquom3xlargersquo and then lsquoReview and Launchrsquo

Confirm and launch

Review the details (ignore the warnings) and click on Launch

31 More details 9

labibi Documentation Release 10

(First time through) generate a new key pair

If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance

(Next times through) select an existing key pair

Select a key pair and click lsquoLaunchrsquo

10 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Click on View Instances

31 More details 11

labibi Documentation Release 10

Select the public DNS name for later use

312 Logging into your new instance ldquoin the cloudrdquo (Windows version)

First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml

Generate a ppk file from your pem file

(You only need to do this once for each key)

Open puttygen select ldquoLoadrdquo

12 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom

Load it

Now ldquosave private keyrdquo Put it somewhere easy to find

31 More details 13

labibi Documentation Release 10

Now that yoursquove generated your PPK file from your PEM file you can log in To do that

Logging into your EC2 instance with Putty

Open up putty and enter your hostname into the Host Name box

Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo

Log in as ldquoubunturdquo

14 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Declare victory

313 Logging into your new instance ldquoin the cloudrdquo (Mac version)

OK so yoursquove created a running computer How do you get to it

The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet

Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows

First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo

Next start Terminal (in Applications Utilities) and type

chmod og-rwx ~Desktopamazonpem

to set the permissions on the private key file to ldquoclosed to all evildoersrdquo

Then type

ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom

Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop

Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in

31 More details 15

labibi Documentation Release 10

At the end you should see text and a prompt that look like this

A final checklist

bull EC2 instance is running

bull used ubuntu 1404

bull NOT a micro instance (m3xlarge or bigger)

314 Amazon Web Services reference material

Instance types

Instance costs

16 Chapter 3 Getting started with Amazon EC2

CHAPTER 4

Short read quality and trimming

Note Reminder if yoursquore on Windows you should install mobaxterm

OK you should now be logged into your Amazon computer How exciting

41 Prepping the computer

Before we do anything else we need to set up a place to work and install a few things

First letrsquos set up a place to work

sudo chmod a+rwxt mnt

This makes lsquomntrsquo a place where we can put data and working files

Next letrsquos install a few things

sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev

These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below

42 Data source

Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone

43 1 Copying in some data to work with

Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata

mkdir mntdata

Next letrsquos grab part of the data set

17

labibi Documentation Release 10

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz

Now if you type

ls -l

you should see something like

-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz

These are subsets of the original data where we selected for reads that belong to a few particular transcripts

One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further

chmod u-w

Wersquoll talk about what these files are below

44 1 Copying data into a working location

First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data

mkdir mntworkcd mntwork

Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash

ln -fs mntdata

These are FASTQ files ndash letrsquos take a look at them

less 0Hour_ATCACG_L002_R1_001extractfastqgz

(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)

Question

bull why do the files have DNA in the name

bull why are there R1 and R2 in the file names

bull why donrsquot we combine all the files

Links

bull FASTQ Format

45 2 FastQC

Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above

Now run FastQC on two files

18 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz

Now type lsquolsrsquo

ls -d fastqc

to list the files and you should see

0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip

We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you

bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml

Questions

bull What should you pay attention to in the FastQC report

bull Which is ldquobetterrdquo R1 or R2 And why

Links

bull FastQC

bull FastQC tutorial video

46 3 Trimmomatic

Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get

The first thing wersquoll need are the adapters to trim off

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa

Now to run Trimmomatic

TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz

0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully

46 3 Trimmomatic 19

labibi Documentation Release 10

Questions

bull How do you figure out what the parameters mean

bull How do you figure out what parameters to use

bull What adapters do you use

bull What version of Trimmomatic are we using here (And FastQC)

bull Do you think parameters are different for RNAseq and genomic data sets

bull Whatrsquos with these annoyingly long and complicated filenames

bull why are we running R1 and R2 together

For a discussion of optimal RNAseq trimming strategies see MacManes 2014

Links

bull Trimmomatic

47 4 FastQC again

Run FastQC again on the trimmed files

fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz

And now view my copies of these files

bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml

Letrsquos take a look at the output files

less 0Hour_ATCACG_L002_R1_001qcfqgz

(again use spacebar to scroll lsquoqrsquo to exit less)

Questions

bull is the quality trimmed data ldquobetterrdquo than before

bull Does it matter that you still have adapters

48 5 Trim the rest of the sequences

First download the rest of the data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz

20 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

And link it in

cd mntworkln -fs mntdatafastqgz

Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks

Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running

rm -f orphansfq

for filename in _R1_extractfastqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz

$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

save the orphanscat s1_se s2_se gtgt orphansfq

done

Things to mention ndash

bull are comments

bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop

bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about

bull whatrsquos with lsquoorphansfqrsquo

Questions

bull how do you figure out if itrsquos working

ndash copypaste it from Word

48 5 Trim the rest of the sequences 21

labibi Documentation Release 10

ndash put in lots of echo

ndash edit one line at a time

bull how on earth do you figure out how to do this

49 6 Interleave the sequences

Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install

sudo pip install -U setuptoolssudo pip install khmer==13

Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop

for filename in _R1_qcfqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

construct the output filenameoutput=$base_R1_peqcfqgz

interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output

done

gzip orphansfq

Next Running digital normalization

22 Chapter 4 Short read quality and trimming

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 2: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

Contents

1 Welcome 311 1 Learning goals 312 2 Safe space and code of conduct 313 3 Instructor introductions 314 4 Amazon and cloud computing - why 315 5 Sticky notes and how they work + Minute Cards 416 6 Warning 4

2 Non-model organisms and RNAseq 521 The overall process 5

3 Getting started with Amazon EC2 731 More details 7

4 Short read quality and trimming 1741 Prepping the computer 1742 Data source 1743 1 Copying in some data to work with 1744 1 Copying data into a working location 1845 2 FastQC 1846 3 Trimmomatic 1947 4 FastQC again 2048 5 Trim the rest of the sequences 2049 6 Interleave the sequences 22

5 Running digital normalization 23

6 Running the actual assembly 25

7 BLASTing your assembled data 27

8 Annotation of denovo transcriptome 2981 Identify the GeneTranscript relationships 2982 Generate the longest-ORF peptide candidates from the Trinity Assembly 2983 Capturing BLAST Homologies 3084 Characterization of functional annotation features 3085 Integration of all annotations into one database 3186 Output an Annotation Report 32

i

9 Quantification and Differential Expression 3391 Download Express 3392 Align Reads with Bowtie 3393 Quantify Expression using eXpress 3494 Differential Expression 34

10 Remapping your reads to your assembled transcriptome 37

11 Miscellaneous advice 39111 Sequencing depth and number of samples 39112 Downloading your data 39113 Developing your own pipeline 40

12 More resources 41121 Informational resources 41122 Places to share data scripts and results files 41

13 Miscellaneous questions 43

14 Tips and Tricks for working with Remote Computers 45141 Use screen to run things that take a long time 45142 Use CyberDuck to transfer files 45143 Subsetting data 45144 Running full analyses on Amazon Web Services 46

15 Technical information 47

ii

labibi Documentation Release 10

This workshop was given on May 4th and 5th 2015 by C Titus Brown Camile Scott and Tamer Mansour See theworkshop organization page for more information or contact Titus directly

Tutorials

Contents 1

labibi Documentation Release 10

2 Contents

CHAPTER 1

Welcome

11 1 Learning goals

For you

bull get a first (or second) look at tools

bull gain some experience in the basic command line

bull get 80 of way to a complete analysis of some data

For us

bull what are the on campus needs who are the on-campus people

12 2 Safe space and code of conduct

This is intended to be a safe and friendly place for learning

Please see the Software Carpentry workshop Code of Conduct httpsoftware-carpentryorgconducthtml

In particular please ask questions because I guarantee you that your question will help others

13 3 Instructor introductions

Titus Brown - prof here at UC Davis in the School of Vet Med

Tamer Mansour - MDPhD postdoc with Titus

Camille Scott - CS graduate student here at Davis

14 4 Amazon and cloud computing - why

bull simplifies software installation

bull can be used for bigger analyses quite easily

bull good for ldquoburstrdquo capacity (just got a data set)

bull accessible everywhere

3

labibi Documentation Release 10

bull they give us $100 gift certificates

15 5 Sticky notes and how they work + Minute Cards

Basic rules

bull no sticky note - ldquoworking on itrdquo

bull green sticky note - ldquoall is wellrdquo

bull red sticky note - ldquoneed helprdquo

Place the sticky notes where we can see them from the back of the room ndash eg on the back of your laptop

At the end of each session (coffee break lunch end of day) please write down on an index card one thing you learnedand one thing yoursquore still confused about

16 6 Warning

The bit that isnrsquot going to work is at the beginning This is unavoidable To reward you we have a coffee break at1030am

mdash

Next Non-model organisms and RNAseq

4 Chapter 1 Welcome

CHAPTER 2

Non-model organisms and RNAseq

With non-model systems where there is neither a good genome nor a lot of mRNAseq data you have to build yourown transcriptome from scratch ndash so-called ldquode novo transcriptome assemblyrdquo There are a few programs to do this -most notably Trinity and Oases - and we have found little difference

The main problem yoursquoll run into with non-model mRNAseq is that the output is fairly noisy with respect to splicevariants Our experience has been that many of these splice variants are probably ldquorealrdquo ndash in the sense of actuallypresent ndash but may be biological ldquonoiserdquo in the sense that they are not actually functional (See this excellent paper byPickrell and Pritchard making the case) Regardless therersquos little that you can do about this although we will talkabout it a bit on the second day

21 The overall process

_staticnonmodel-rnaseq-pipelinepng

bull Copy over your RNAseq data (from two or more samples)

bull Trim primers and junk from sequence (Short read quality and trimming)

bull Do abundance normalization (Running digital normalization)

bull Assemble everything together (Running the actual assembly)

This gives you an assembled transcriptome consisting of many transcripts and transcript families

At this point you can do one or more of the following

bull Annotate your transcripts (Annotation of denovo transcriptome)

bull Quantify your transcripts and examine differential expression (Quantification and Differential Expression)

bull BLAST your transcripts individually (BLASTing your assembled data)

Next Getting started with Amazon EC2

5

labibi Documentation Release 10

6 Chapter 2 Non-model organisms and RNAseq

CHAPTER 3

Getting started with Amazon EC2

Short version

bull go to httpawsamazoncom log in then ldquoEC2rdquo (upper left)

bull select ldquoLaunch instancerdquo

bull select ldquoUbuntu 1404rdquo from the list

bull select ldquom3xlargerdquo from the list (towards bottom of ldquoGeneral purposerdquo)

bull click ldquoReview and launchrdquo

bull select ldquoLaunchrdquo

bull if your first time through create a key pair otherwise select existing

bull click ldquolaunch instancerdquo

31 More details

311 Start up an EC2 instance

Log in

Go to lsquohttpsawsamazoncomlsquo in a Web browser

Select lsquoMy AccountConsolersquo menu option lsquoAWS Management Consolerdquo

Log in with your username amp password

Click on EC2 (upper left)

7

labibi Documentation Release 10

Select your zone

Many of the resources that we use are hosted by Amazon on the East coast Make sure that your dashboard has lsquoNVirginiarsquo on the upper right

Then click on Launch Instance

Select the machine operating system to boot

Find the ldquoUbuntu Server 1404rdquo image in the first list to show up

8 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Choose the machine size

Select lsquoGeneral purposersquo lsquom3xlargersquo and then lsquoReview and Launchrsquo

Confirm and launch

Review the details (ignore the warnings) and click on Launch

31 More details 9

labibi Documentation Release 10

(First time through) generate a new key pair

If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance

(Next times through) select an existing key pair

Select a key pair and click lsquoLaunchrsquo

10 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Click on View Instances

31 More details 11

labibi Documentation Release 10

Select the public DNS name for later use

312 Logging into your new instance ldquoin the cloudrdquo (Windows version)

First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml

Generate a ppk file from your pem file

(You only need to do this once for each key)

Open puttygen select ldquoLoadrdquo

12 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom

Load it

Now ldquosave private keyrdquo Put it somewhere easy to find

31 More details 13

labibi Documentation Release 10

Now that yoursquove generated your PPK file from your PEM file you can log in To do that

Logging into your EC2 instance with Putty

Open up putty and enter your hostname into the Host Name box

Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo

Log in as ldquoubunturdquo

14 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Declare victory

313 Logging into your new instance ldquoin the cloudrdquo (Mac version)

OK so yoursquove created a running computer How do you get to it

The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet

Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows

First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo

Next start Terminal (in Applications Utilities) and type

chmod og-rwx ~Desktopamazonpem

to set the permissions on the private key file to ldquoclosed to all evildoersrdquo

Then type

ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom

Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop

Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in

31 More details 15

labibi Documentation Release 10

At the end you should see text and a prompt that look like this

A final checklist

bull EC2 instance is running

bull used ubuntu 1404

bull NOT a micro instance (m3xlarge or bigger)

314 Amazon Web Services reference material

Instance types

Instance costs

16 Chapter 3 Getting started with Amazon EC2

CHAPTER 4

Short read quality and trimming

Note Reminder if yoursquore on Windows you should install mobaxterm

OK you should now be logged into your Amazon computer How exciting

41 Prepping the computer

Before we do anything else we need to set up a place to work and install a few things

First letrsquos set up a place to work

sudo chmod a+rwxt mnt

This makes lsquomntrsquo a place where we can put data and working files

Next letrsquos install a few things

sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev

These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below

42 Data source

Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone

43 1 Copying in some data to work with

Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata

mkdir mntdata

Next letrsquos grab part of the data set

17

labibi Documentation Release 10

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz

Now if you type

ls -l

you should see something like

-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz

These are subsets of the original data where we selected for reads that belong to a few particular transcripts

One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further

chmod u-w

Wersquoll talk about what these files are below

44 1 Copying data into a working location

First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data

mkdir mntworkcd mntwork

Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash

ln -fs mntdata

These are FASTQ files ndash letrsquos take a look at them

less 0Hour_ATCACG_L002_R1_001extractfastqgz

(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)

Question

bull why do the files have DNA in the name

bull why are there R1 and R2 in the file names

bull why donrsquot we combine all the files

Links

bull FASTQ Format

45 2 FastQC

Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above

Now run FastQC on two files

18 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz

Now type lsquolsrsquo

ls -d fastqc

to list the files and you should see

0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip

We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you

bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml

Questions

bull What should you pay attention to in the FastQC report

bull Which is ldquobetterrdquo R1 or R2 And why

Links

bull FastQC

bull FastQC tutorial video

46 3 Trimmomatic

Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get

The first thing wersquoll need are the adapters to trim off

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa

Now to run Trimmomatic

TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz

0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully

46 3 Trimmomatic 19

labibi Documentation Release 10

Questions

bull How do you figure out what the parameters mean

bull How do you figure out what parameters to use

bull What adapters do you use

bull What version of Trimmomatic are we using here (And FastQC)

bull Do you think parameters are different for RNAseq and genomic data sets

bull Whatrsquos with these annoyingly long and complicated filenames

bull why are we running R1 and R2 together

For a discussion of optimal RNAseq trimming strategies see MacManes 2014

Links

bull Trimmomatic

47 4 FastQC again

Run FastQC again on the trimmed files

fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz

And now view my copies of these files

bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml

Letrsquos take a look at the output files

less 0Hour_ATCACG_L002_R1_001qcfqgz

(again use spacebar to scroll lsquoqrsquo to exit less)

Questions

bull is the quality trimmed data ldquobetterrdquo than before

bull Does it matter that you still have adapters

48 5 Trim the rest of the sequences

First download the rest of the data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz

20 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

And link it in

cd mntworkln -fs mntdatafastqgz

Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks

Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running

rm -f orphansfq

for filename in _R1_extractfastqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz

$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

save the orphanscat s1_se s2_se gtgt orphansfq

done

Things to mention ndash

bull are comments

bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop

bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about

bull whatrsquos with lsquoorphansfqrsquo

Questions

bull how do you figure out if itrsquos working

ndash copypaste it from Word

48 5 Trim the rest of the sequences 21

labibi Documentation Release 10

ndash put in lots of echo

ndash edit one line at a time

bull how on earth do you figure out how to do this

49 6 Interleave the sequences

Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install

sudo pip install -U setuptoolssudo pip install khmer==13

Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop

for filename in _R1_qcfqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

construct the output filenameoutput=$base_R1_peqcfqgz

interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output

done

gzip orphansfq

Next Running digital normalization

22 Chapter 4 Short read quality and trimming

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 3: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

9 Quantification and Differential Expression 3391 Download Express 3392 Align Reads with Bowtie 3393 Quantify Expression using eXpress 3494 Differential Expression 34

10 Remapping your reads to your assembled transcriptome 37

11 Miscellaneous advice 39111 Sequencing depth and number of samples 39112 Downloading your data 39113 Developing your own pipeline 40

12 More resources 41121 Informational resources 41122 Places to share data scripts and results files 41

13 Miscellaneous questions 43

14 Tips and Tricks for working with Remote Computers 45141 Use screen to run things that take a long time 45142 Use CyberDuck to transfer files 45143 Subsetting data 45144 Running full analyses on Amazon Web Services 46

15 Technical information 47

ii

labibi Documentation Release 10

This workshop was given on May 4th and 5th 2015 by C Titus Brown Camile Scott and Tamer Mansour See theworkshop organization page for more information or contact Titus directly

Tutorials

Contents 1

labibi Documentation Release 10

2 Contents

CHAPTER 1

Welcome

11 1 Learning goals

For you

bull get a first (or second) look at tools

bull gain some experience in the basic command line

bull get 80 of way to a complete analysis of some data

For us

bull what are the on campus needs who are the on-campus people

12 2 Safe space and code of conduct

This is intended to be a safe and friendly place for learning

Please see the Software Carpentry workshop Code of Conduct httpsoftware-carpentryorgconducthtml

In particular please ask questions because I guarantee you that your question will help others

13 3 Instructor introductions

Titus Brown - prof here at UC Davis in the School of Vet Med

Tamer Mansour - MDPhD postdoc with Titus

Camille Scott - CS graduate student here at Davis

14 4 Amazon and cloud computing - why

bull simplifies software installation

bull can be used for bigger analyses quite easily

bull good for ldquoburstrdquo capacity (just got a data set)

bull accessible everywhere

3

labibi Documentation Release 10

bull they give us $100 gift certificates

15 5 Sticky notes and how they work + Minute Cards

Basic rules

bull no sticky note - ldquoworking on itrdquo

bull green sticky note - ldquoall is wellrdquo

bull red sticky note - ldquoneed helprdquo

Place the sticky notes where we can see them from the back of the room ndash eg on the back of your laptop

At the end of each session (coffee break lunch end of day) please write down on an index card one thing you learnedand one thing yoursquore still confused about

16 6 Warning

The bit that isnrsquot going to work is at the beginning This is unavoidable To reward you we have a coffee break at1030am

mdash

Next Non-model organisms and RNAseq

4 Chapter 1 Welcome

CHAPTER 2

Non-model organisms and RNAseq

With non-model systems where there is neither a good genome nor a lot of mRNAseq data you have to build yourown transcriptome from scratch ndash so-called ldquode novo transcriptome assemblyrdquo There are a few programs to do this -most notably Trinity and Oases - and we have found little difference

The main problem yoursquoll run into with non-model mRNAseq is that the output is fairly noisy with respect to splicevariants Our experience has been that many of these splice variants are probably ldquorealrdquo ndash in the sense of actuallypresent ndash but may be biological ldquonoiserdquo in the sense that they are not actually functional (See this excellent paper byPickrell and Pritchard making the case) Regardless therersquos little that you can do about this although we will talkabout it a bit on the second day

21 The overall process

_staticnonmodel-rnaseq-pipelinepng

bull Copy over your RNAseq data (from two or more samples)

bull Trim primers and junk from sequence (Short read quality and trimming)

bull Do abundance normalization (Running digital normalization)

bull Assemble everything together (Running the actual assembly)

This gives you an assembled transcriptome consisting of many transcripts and transcript families

At this point you can do one or more of the following

bull Annotate your transcripts (Annotation of denovo transcriptome)

bull Quantify your transcripts and examine differential expression (Quantification and Differential Expression)

bull BLAST your transcripts individually (BLASTing your assembled data)

Next Getting started with Amazon EC2

5

labibi Documentation Release 10

6 Chapter 2 Non-model organisms and RNAseq

CHAPTER 3

Getting started with Amazon EC2

Short version

bull go to httpawsamazoncom log in then ldquoEC2rdquo (upper left)

bull select ldquoLaunch instancerdquo

bull select ldquoUbuntu 1404rdquo from the list

bull select ldquom3xlargerdquo from the list (towards bottom of ldquoGeneral purposerdquo)

bull click ldquoReview and launchrdquo

bull select ldquoLaunchrdquo

bull if your first time through create a key pair otherwise select existing

bull click ldquolaunch instancerdquo

31 More details

311 Start up an EC2 instance

Log in

Go to lsquohttpsawsamazoncomlsquo in a Web browser

Select lsquoMy AccountConsolersquo menu option lsquoAWS Management Consolerdquo

Log in with your username amp password

Click on EC2 (upper left)

7

labibi Documentation Release 10

Select your zone

Many of the resources that we use are hosted by Amazon on the East coast Make sure that your dashboard has lsquoNVirginiarsquo on the upper right

Then click on Launch Instance

Select the machine operating system to boot

Find the ldquoUbuntu Server 1404rdquo image in the first list to show up

8 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Choose the machine size

Select lsquoGeneral purposersquo lsquom3xlargersquo and then lsquoReview and Launchrsquo

Confirm and launch

Review the details (ignore the warnings) and click on Launch

31 More details 9

labibi Documentation Release 10

(First time through) generate a new key pair

If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance

(Next times through) select an existing key pair

Select a key pair and click lsquoLaunchrsquo

10 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Click on View Instances

31 More details 11

labibi Documentation Release 10

Select the public DNS name for later use

312 Logging into your new instance ldquoin the cloudrdquo (Windows version)

First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml

Generate a ppk file from your pem file

(You only need to do this once for each key)

Open puttygen select ldquoLoadrdquo

12 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom

Load it

Now ldquosave private keyrdquo Put it somewhere easy to find

31 More details 13

labibi Documentation Release 10

Now that yoursquove generated your PPK file from your PEM file you can log in To do that

Logging into your EC2 instance with Putty

Open up putty and enter your hostname into the Host Name box

Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo

Log in as ldquoubunturdquo

14 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Declare victory

313 Logging into your new instance ldquoin the cloudrdquo (Mac version)

OK so yoursquove created a running computer How do you get to it

The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet

Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows

First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo

Next start Terminal (in Applications Utilities) and type

chmod og-rwx ~Desktopamazonpem

to set the permissions on the private key file to ldquoclosed to all evildoersrdquo

Then type

ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom

Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop

Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in

31 More details 15

labibi Documentation Release 10

At the end you should see text and a prompt that look like this

A final checklist

bull EC2 instance is running

bull used ubuntu 1404

bull NOT a micro instance (m3xlarge or bigger)

314 Amazon Web Services reference material

Instance types

Instance costs

16 Chapter 3 Getting started with Amazon EC2

CHAPTER 4

Short read quality and trimming

Note Reminder if yoursquore on Windows you should install mobaxterm

OK you should now be logged into your Amazon computer How exciting

41 Prepping the computer

Before we do anything else we need to set up a place to work and install a few things

First letrsquos set up a place to work

sudo chmod a+rwxt mnt

This makes lsquomntrsquo a place where we can put data and working files

Next letrsquos install a few things

sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev

These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below

42 Data source

Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone

43 1 Copying in some data to work with

Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata

mkdir mntdata

Next letrsquos grab part of the data set

17

labibi Documentation Release 10

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz

Now if you type

ls -l

you should see something like

-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz

These are subsets of the original data where we selected for reads that belong to a few particular transcripts

One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further

chmod u-w

Wersquoll talk about what these files are below

44 1 Copying data into a working location

First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data

mkdir mntworkcd mntwork

Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash

ln -fs mntdata

These are FASTQ files ndash letrsquos take a look at them

less 0Hour_ATCACG_L002_R1_001extractfastqgz

(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)

Question

bull why do the files have DNA in the name

bull why are there R1 and R2 in the file names

bull why donrsquot we combine all the files

Links

bull FASTQ Format

45 2 FastQC

Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above

Now run FastQC on two files

18 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz

Now type lsquolsrsquo

ls -d fastqc

to list the files and you should see

0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip

We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you

bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml

Questions

bull What should you pay attention to in the FastQC report

bull Which is ldquobetterrdquo R1 or R2 And why

Links

bull FastQC

bull FastQC tutorial video

46 3 Trimmomatic

Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get

The first thing wersquoll need are the adapters to trim off

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa

Now to run Trimmomatic

TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz

0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully

46 3 Trimmomatic 19

labibi Documentation Release 10

Questions

bull How do you figure out what the parameters mean

bull How do you figure out what parameters to use

bull What adapters do you use

bull What version of Trimmomatic are we using here (And FastQC)

bull Do you think parameters are different for RNAseq and genomic data sets

bull Whatrsquos with these annoyingly long and complicated filenames

bull why are we running R1 and R2 together

For a discussion of optimal RNAseq trimming strategies see MacManes 2014

Links

bull Trimmomatic

47 4 FastQC again

Run FastQC again on the trimmed files

fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz

And now view my copies of these files

bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml

Letrsquos take a look at the output files

less 0Hour_ATCACG_L002_R1_001qcfqgz

(again use spacebar to scroll lsquoqrsquo to exit less)

Questions

bull is the quality trimmed data ldquobetterrdquo than before

bull Does it matter that you still have adapters

48 5 Trim the rest of the sequences

First download the rest of the data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz

20 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

And link it in

cd mntworkln -fs mntdatafastqgz

Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks

Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running

rm -f orphansfq

for filename in _R1_extractfastqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz

$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

save the orphanscat s1_se s2_se gtgt orphansfq

done

Things to mention ndash

bull are comments

bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop

bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about

bull whatrsquos with lsquoorphansfqrsquo

Questions

bull how do you figure out if itrsquos working

ndash copypaste it from Word

48 5 Trim the rest of the sequences 21

labibi Documentation Release 10

ndash put in lots of echo

ndash edit one line at a time

bull how on earth do you figure out how to do this

49 6 Interleave the sequences

Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install

sudo pip install -U setuptoolssudo pip install khmer==13

Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop

for filename in _R1_qcfqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

construct the output filenameoutput=$base_R1_peqcfqgz

interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output

done

gzip orphansfq

Next Running digital normalization

22 Chapter 4 Short read quality and trimming

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 4: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

This workshop was given on May 4th and 5th 2015 by C Titus Brown Camile Scott and Tamer Mansour See theworkshop organization page for more information or contact Titus directly

Tutorials

Contents 1

labibi Documentation Release 10

2 Contents

CHAPTER 1

Welcome

11 1 Learning goals

For you

bull get a first (or second) look at tools

bull gain some experience in the basic command line

bull get 80 of way to a complete analysis of some data

For us

bull what are the on campus needs who are the on-campus people

12 2 Safe space and code of conduct

This is intended to be a safe and friendly place for learning

Please see the Software Carpentry workshop Code of Conduct httpsoftware-carpentryorgconducthtml

In particular please ask questions because I guarantee you that your question will help others

13 3 Instructor introductions

Titus Brown - prof here at UC Davis in the School of Vet Med

Tamer Mansour - MDPhD postdoc with Titus

Camille Scott - CS graduate student here at Davis

14 4 Amazon and cloud computing - why

bull simplifies software installation

bull can be used for bigger analyses quite easily

bull good for ldquoburstrdquo capacity (just got a data set)

bull accessible everywhere

3

labibi Documentation Release 10

bull they give us $100 gift certificates

15 5 Sticky notes and how they work + Minute Cards

Basic rules

bull no sticky note - ldquoworking on itrdquo

bull green sticky note - ldquoall is wellrdquo

bull red sticky note - ldquoneed helprdquo

Place the sticky notes where we can see them from the back of the room ndash eg on the back of your laptop

At the end of each session (coffee break lunch end of day) please write down on an index card one thing you learnedand one thing yoursquore still confused about

16 6 Warning

The bit that isnrsquot going to work is at the beginning This is unavoidable To reward you we have a coffee break at1030am

mdash

Next Non-model organisms and RNAseq

4 Chapter 1 Welcome

CHAPTER 2

Non-model organisms and RNAseq

With non-model systems where there is neither a good genome nor a lot of mRNAseq data you have to build yourown transcriptome from scratch ndash so-called ldquode novo transcriptome assemblyrdquo There are a few programs to do this -most notably Trinity and Oases - and we have found little difference

The main problem yoursquoll run into with non-model mRNAseq is that the output is fairly noisy with respect to splicevariants Our experience has been that many of these splice variants are probably ldquorealrdquo ndash in the sense of actuallypresent ndash but may be biological ldquonoiserdquo in the sense that they are not actually functional (See this excellent paper byPickrell and Pritchard making the case) Regardless therersquos little that you can do about this although we will talkabout it a bit on the second day

21 The overall process

_staticnonmodel-rnaseq-pipelinepng

bull Copy over your RNAseq data (from two or more samples)

bull Trim primers and junk from sequence (Short read quality and trimming)

bull Do abundance normalization (Running digital normalization)

bull Assemble everything together (Running the actual assembly)

This gives you an assembled transcriptome consisting of many transcripts and transcript families

At this point you can do one or more of the following

bull Annotate your transcripts (Annotation of denovo transcriptome)

bull Quantify your transcripts and examine differential expression (Quantification and Differential Expression)

bull BLAST your transcripts individually (BLASTing your assembled data)

Next Getting started with Amazon EC2

5

labibi Documentation Release 10

6 Chapter 2 Non-model organisms and RNAseq

CHAPTER 3

Getting started with Amazon EC2

Short version

bull go to httpawsamazoncom log in then ldquoEC2rdquo (upper left)

bull select ldquoLaunch instancerdquo

bull select ldquoUbuntu 1404rdquo from the list

bull select ldquom3xlargerdquo from the list (towards bottom of ldquoGeneral purposerdquo)

bull click ldquoReview and launchrdquo

bull select ldquoLaunchrdquo

bull if your first time through create a key pair otherwise select existing

bull click ldquolaunch instancerdquo

31 More details

311 Start up an EC2 instance

Log in

Go to lsquohttpsawsamazoncomlsquo in a Web browser

Select lsquoMy AccountConsolersquo menu option lsquoAWS Management Consolerdquo

Log in with your username amp password

Click on EC2 (upper left)

7

labibi Documentation Release 10

Select your zone

Many of the resources that we use are hosted by Amazon on the East coast Make sure that your dashboard has lsquoNVirginiarsquo on the upper right

Then click on Launch Instance

Select the machine operating system to boot

Find the ldquoUbuntu Server 1404rdquo image in the first list to show up

8 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Choose the machine size

Select lsquoGeneral purposersquo lsquom3xlargersquo and then lsquoReview and Launchrsquo

Confirm and launch

Review the details (ignore the warnings) and click on Launch

31 More details 9

labibi Documentation Release 10

(First time through) generate a new key pair

If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance

(Next times through) select an existing key pair

Select a key pair and click lsquoLaunchrsquo

10 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Click on View Instances

31 More details 11

labibi Documentation Release 10

Select the public DNS name for later use

312 Logging into your new instance ldquoin the cloudrdquo (Windows version)

First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml

Generate a ppk file from your pem file

(You only need to do this once for each key)

Open puttygen select ldquoLoadrdquo

12 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom

Load it

Now ldquosave private keyrdquo Put it somewhere easy to find

31 More details 13

labibi Documentation Release 10

Now that yoursquove generated your PPK file from your PEM file you can log in To do that

Logging into your EC2 instance with Putty

Open up putty and enter your hostname into the Host Name box

Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo

Log in as ldquoubunturdquo

14 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Declare victory

313 Logging into your new instance ldquoin the cloudrdquo (Mac version)

OK so yoursquove created a running computer How do you get to it

The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet

Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows

First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo

Next start Terminal (in Applications Utilities) and type

chmod og-rwx ~Desktopamazonpem

to set the permissions on the private key file to ldquoclosed to all evildoersrdquo

Then type

ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom

Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop

Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in

31 More details 15

labibi Documentation Release 10

At the end you should see text and a prompt that look like this

A final checklist

bull EC2 instance is running

bull used ubuntu 1404

bull NOT a micro instance (m3xlarge or bigger)

314 Amazon Web Services reference material

Instance types

Instance costs

16 Chapter 3 Getting started with Amazon EC2

CHAPTER 4

Short read quality and trimming

Note Reminder if yoursquore on Windows you should install mobaxterm

OK you should now be logged into your Amazon computer How exciting

41 Prepping the computer

Before we do anything else we need to set up a place to work and install a few things

First letrsquos set up a place to work

sudo chmod a+rwxt mnt

This makes lsquomntrsquo a place where we can put data and working files

Next letrsquos install a few things

sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev

These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below

42 Data source

Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone

43 1 Copying in some data to work with

Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata

mkdir mntdata

Next letrsquos grab part of the data set

17

labibi Documentation Release 10

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz

Now if you type

ls -l

you should see something like

-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz

These are subsets of the original data where we selected for reads that belong to a few particular transcripts

One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further

chmod u-w

Wersquoll talk about what these files are below

44 1 Copying data into a working location

First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data

mkdir mntworkcd mntwork

Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash

ln -fs mntdata

These are FASTQ files ndash letrsquos take a look at them

less 0Hour_ATCACG_L002_R1_001extractfastqgz

(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)

Question

bull why do the files have DNA in the name

bull why are there R1 and R2 in the file names

bull why donrsquot we combine all the files

Links

bull FASTQ Format

45 2 FastQC

Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above

Now run FastQC on two files

18 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz

Now type lsquolsrsquo

ls -d fastqc

to list the files and you should see

0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip

We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you

bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml

Questions

bull What should you pay attention to in the FastQC report

bull Which is ldquobetterrdquo R1 or R2 And why

Links

bull FastQC

bull FastQC tutorial video

46 3 Trimmomatic

Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get

The first thing wersquoll need are the adapters to trim off

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa

Now to run Trimmomatic

TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz

0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully

46 3 Trimmomatic 19

labibi Documentation Release 10

Questions

bull How do you figure out what the parameters mean

bull How do you figure out what parameters to use

bull What adapters do you use

bull What version of Trimmomatic are we using here (And FastQC)

bull Do you think parameters are different for RNAseq and genomic data sets

bull Whatrsquos with these annoyingly long and complicated filenames

bull why are we running R1 and R2 together

For a discussion of optimal RNAseq trimming strategies see MacManes 2014

Links

bull Trimmomatic

47 4 FastQC again

Run FastQC again on the trimmed files

fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz

And now view my copies of these files

bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml

Letrsquos take a look at the output files

less 0Hour_ATCACG_L002_R1_001qcfqgz

(again use spacebar to scroll lsquoqrsquo to exit less)

Questions

bull is the quality trimmed data ldquobetterrdquo than before

bull Does it matter that you still have adapters

48 5 Trim the rest of the sequences

First download the rest of the data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz

20 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

And link it in

cd mntworkln -fs mntdatafastqgz

Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks

Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running

rm -f orphansfq

for filename in _R1_extractfastqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz

$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

save the orphanscat s1_se s2_se gtgt orphansfq

done

Things to mention ndash

bull are comments

bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop

bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about

bull whatrsquos with lsquoorphansfqrsquo

Questions

bull how do you figure out if itrsquos working

ndash copypaste it from Word

48 5 Trim the rest of the sequences 21

labibi Documentation Release 10

ndash put in lots of echo

ndash edit one line at a time

bull how on earth do you figure out how to do this

49 6 Interleave the sequences

Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install

sudo pip install -U setuptoolssudo pip install khmer==13

Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop

for filename in _R1_qcfqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

construct the output filenameoutput=$base_R1_peqcfqgz

interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output

done

gzip orphansfq

Next Running digital normalization

22 Chapter 4 Short read quality and trimming

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 5: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

2 Contents

CHAPTER 1

Welcome

11 1 Learning goals

For you

bull get a first (or second) look at tools

bull gain some experience in the basic command line

bull get 80 of way to a complete analysis of some data

For us

bull what are the on campus needs who are the on-campus people

12 2 Safe space and code of conduct

This is intended to be a safe and friendly place for learning

Please see the Software Carpentry workshop Code of Conduct httpsoftware-carpentryorgconducthtml

In particular please ask questions because I guarantee you that your question will help others

13 3 Instructor introductions

Titus Brown - prof here at UC Davis in the School of Vet Med

Tamer Mansour - MDPhD postdoc with Titus

Camille Scott - CS graduate student here at Davis

14 4 Amazon and cloud computing - why

bull simplifies software installation

bull can be used for bigger analyses quite easily

bull good for ldquoburstrdquo capacity (just got a data set)

bull accessible everywhere

3

labibi Documentation Release 10

bull they give us $100 gift certificates

15 5 Sticky notes and how they work + Minute Cards

Basic rules

bull no sticky note - ldquoworking on itrdquo

bull green sticky note - ldquoall is wellrdquo

bull red sticky note - ldquoneed helprdquo

Place the sticky notes where we can see them from the back of the room ndash eg on the back of your laptop

At the end of each session (coffee break lunch end of day) please write down on an index card one thing you learnedand one thing yoursquore still confused about

16 6 Warning

The bit that isnrsquot going to work is at the beginning This is unavoidable To reward you we have a coffee break at1030am

mdash

Next Non-model organisms and RNAseq

4 Chapter 1 Welcome

CHAPTER 2

Non-model organisms and RNAseq

With non-model systems where there is neither a good genome nor a lot of mRNAseq data you have to build yourown transcriptome from scratch ndash so-called ldquode novo transcriptome assemblyrdquo There are a few programs to do this -most notably Trinity and Oases - and we have found little difference

The main problem yoursquoll run into with non-model mRNAseq is that the output is fairly noisy with respect to splicevariants Our experience has been that many of these splice variants are probably ldquorealrdquo ndash in the sense of actuallypresent ndash but may be biological ldquonoiserdquo in the sense that they are not actually functional (See this excellent paper byPickrell and Pritchard making the case) Regardless therersquos little that you can do about this although we will talkabout it a bit on the second day

21 The overall process

_staticnonmodel-rnaseq-pipelinepng

bull Copy over your RNAseq data (from two or more samples)

bull Trim primers and junk from sequence (Short read quality and trimming)

bull Do abundance normalization (Running digital normalization)

bull Assemble everything together (Running the actual assembly)

This gives you an assembled transcriptome consisting of many transcripts and transcript families

At this point you can do one or more of the following

bull Annotate your transcripts (Annotation of denovo transcriptome)

bull Quantify your transcripts and examine differential expression (Quantification and Differential Expression)

bull BLAST your transcripts individually (BLASTing your assembled data)

Next Getting started with Amazon EC2

5

labibi Documentation Release 10

6 Chapter 2 Non-model organisms and RNAseq

CHAPTER 3

Getting started with Amazon EC2

Short version

bull go to httpawsamazoncom log in then ldquoEC2rdquo (upper left)

bull select ldquoLaunch instancerdquo

bull select ldquoUbuntu 1404rdquo from the list

bull select ldquom3xlargerdquo from the list (towards bottom of ldquoGeneral purposerdquo)

bull click ldquoReview and launchrdquo

bull select ldquoLaunchrdquo

bull if your first time through create a key pair otherwise select existing

bull click ldquolaunch instancerdquo

31 More details

311 Start up an EC2 instance

Log in

Go to lsquohttpsawsamazoncomlsquo in a Web browser

Select lsquoMy AccountConsolersquo menu option lsquoAWS Management Consolerdquo

Log in with your username amp password

Click on EC2 (upper left)

7

labibi Documentation Release 10

Select your zone

Many of the resources that we use are hosted by Amazon on the East coast Make sure that your dashboard has lsquoNVirginiarsquo on the upper right

Then click on Launch Instance

Select the machine operating system to boot

Find the ldquoUbuntu Server 1404rdquo image in the first list to show up

8 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Choose the machine size

Select lsquoGeneral purposersquo lsquom3xlargersquo and then lsquoReview and Launchrsquo

Confirm and launch

Review the details (ignore the warnings) and click on Launch

31 More details 9

labibi Documentation Release 10

(First time through) generate a new key pair

If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance

(Next times through) select an existing key pair

Select a key pair and click lsquoLaunchrsquo

10 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Click on View Instances

31 More details 11

labibi Documentation Release 10

Select the public DNS name for later use

312 Logging into your new instance ldquoin the cloudrdquo (Windows version)

First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml

Generate a ppk file from your pem file

(You only need to do this once for each key)

Open puttygen select ldquoLoadrdquo

12 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom

Load it

Now ldquosave private keyrdquo Put it somewhere easy to find

31 More details 13

labibi Documentation Release 10

Now that yoursquove generated your PPK file from your PEM file you can log in To do that

Logging into your EC2 instance with Putty

Open up putty and enter your hostname into the Host Name box

Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo

Log in as ldquoubunturdquo

14 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Declare victory

313 Logging into your new instance ldquoin the cloudrdquo (Mac version)

OK so yoursquove created a running computer How do you get to it

The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet

Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows

First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo

Next start Terminal (in Applications Utilities) and type

chmod og-rwx ~Desktopamazonpem

to set the permissions on the private key file to ldquoclosed to all evildoersrdquo

Then type

ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom

Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop

Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in

31 More details 15

labibi Documentation Release 10

At the end you should see text and a prompt that look like this

A final checklist

bull EC2 instance is running

bull used ubuntu 1404

bull NOT a micro instance (m3xlarge or bigger)

314 Amazon Web Services reference material

Instance types

Instance costs

16 Chapter 3 Getting started with Amazon EC2

CHAPTER 4

Short read quality and trimming

Note Reminder if yoursquore on Windows you should install mobaxterm

OK you should now be logged into your Amazon computer How exciting

41 Prepping the computer

Before we do anything else we need to set up a place to work and install a few things

First letrsquos set up a place to work

sudo chmod a+rwxt mnt

This makes lsquomntrsquo a place where we can put data and working files

Next letrsquos install a few things

sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev

These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below

42 Data source

Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone

43 1 Copying in some data to work with

Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata

mkdir mntdata

Next letrsquos grab part of the data set

17

labibi Documentation Release 10

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz

Now if you type

ls -l

you should see something like

-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz

These are subsets of the original data where we selected for reads that belong to a few particular transcripts

One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further

chmod u-w

Wersquoll talk about what these files are below

44 1 Copying data into a working location

First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data

mkdir mntworkcd mntwork

Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash

ln -fs mntdata

These are FASTQ files ndash letrsquos take a look at them

less 0Hour_ATCACG_L002_R1_001extractfastqgz

(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)

Question

bull why do the files have DNA in the name

bull why are there R1 and R2 in the file names

bull why donrsquot we combine all the files

Links

bull FASTQ Format

45 2 FastQC

Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above

Now run FastQC on two files

18 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz

Now type lsquolsrsquo

ls -d fastqc

to list the files and you should see

0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip

We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you

bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml

Questions

bull What should you pay attention to in the FastQC report

bull Which is ldquobetterrdquo R1 or R2 And why

Links

bull FastQC

bull FastQC tutorial video

46 3 Trimmomatic

Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get

The first thing wersquoll need are the adapters to trim off

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa

Now to run Trimmomatic

TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz

0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully

46 3 Trimmomatic 19

labibi Documentation Release 10

Questions

bull How do you figure out what the parameters mean

bull How do you figure out what parameters to use

bull What adapters do you use

bull What version of Trimmomatic are we using here (And FastQC)

bull Do you think parameters are different for RNAseq and genomic data sets

bull Whatrsquos with these annoyingly long and complicated filenames

bull why are we running R1 and R2 together

For a discussion of optimal RNAseq trimming strategies see MacManes 2014

Links

bull Trimmomatic

47 4 FastQC again

Run FastQC again on the trimmed files

fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz

And now view my copies of these files

bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml

Letrsquos take a look at the output files

less 0Hour_ATCACG_L002_R1_001qcfqgz

(again use spacebar to scroll lsquoqrsquo to exit less)

Questions

bull is the quality trimmed data ldquobetterrdquo than before

bull Does it matter that you still have adapters

48 5 Trim the rest of the sequences

First download the rest of the data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz

20 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

And link it in

cd mntworkln -fs mntdatafastqgz

Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks

Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running

rm -f orphansfq

for filename in _R1_extractfastqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz

$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

save the orphanscat s1_se s2_se gtgt orphansfq

done

Things to mention ndash

bull are comments

bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop

bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about

bull whatrsquos with lsquoorphansfqrsquo

Questions

bull how do you figure out if itrsquos working

ndash copypaste it from Word

48 5 Trim the rest of the sequences 21

labibi Documentation Release 10

ndash put in lots of echo

ndash edit one line at a time

bull how on earth do you figure out how to do this

49 6 Interleave the sequences

Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install

sudo pip install -U setuptoolssudo pip install khmer==13

Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop

for filename in _R1_qcfqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

construct the output filenameoutput=$base_R1_peqcfqgz

interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output

done

gzip orphansfq

Next Running digital normalization

22 Chapter 4 Short read quality and trimming

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 6: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

CHAPTER 1

Welcome

11 1 Learning goals

For you

bull get a first (or second) look at tools

bull gain some experience in the basic command line

bull get 80 of way to a complete analysis of some data

For us

bull what are the on campus needs who are the on-campus people

12 2 Safe space and code of conduct

This is intended to be a safe and friendly place for learning

Please see the Software Carpentry workshop Code of Conduct httpsoftware-carpentryorgconducthtml

In particular please ask questions because I guarantee you that your question will help others

13 3 Instructor introductions

Titus Brown - prof here at UC Davis in the School of Vet Med

Tamer Mansour - MDPhD postdoc with Titus

Camille Scott - CS graduate student here at Davis

14 4 Amazon and cloud computing - why

bull simplifies software installation

bull can be used for bigger analyses quite easily

bull good for ldquoburstrdquo capacity (just got a data set)

bull accessible everywhere

3

labibi Documentation Release 10

bull they give us $100 gift certificates

15 5 Sticky notes and how they work + Minute Cards

Basic rules

bull no sticky note - ldquoworking on itrdquo

bull green sticky note - ldquoall is wellrdquo

bull red sticky note - ldquoneed helprdquo

Place the sticky notes where we can see them from the back of the room ndash eg on the back of your laptop

At the end of each session (coffee break lunch end of day) please write down on an index card one thing you learnedand one thing yoursquore still confused about

16 6 Warning

The bit that isnrsquot going to work is at the beginning This is unavoidable To reward you we have a coffee break at1030am

mdash

Next Non-model organisms and RNAseq

4 Chapter 1 Welcome

CHAPTER 2

Non-model organisms and RNAseq

With non-model systems where there is neither a good genome nor a lot of mRNAseq data you have to build yourown transcriptome from scratch ndash so-called ldquode novo transcriptome assemblyrdquo There are a few programs to do this -most notably Trinity and Oases - and we have found little difference

The main problem yoursquoll run into with non-model mRNAseq is that the output is fairly noisy with respect to splicevariants Our experience has been that many of these splice variants are probably ldquorealrdquo ndash in the sense of actuallypresent ndash but may be biological ldquonoiserdquo in the sense that they are not actually functional (See this excellent paper byPickrell and Pritchard making the case) Regardless therersquos little that you can do about this although we will talkabout it a bit on the second day

21 The overall process

_staticnonmodel-rnaseq-pipelinepng

bull Copy over your RNAseq data (from two or more samples)

bull Trim primers and junk from sequence (Short read quality and trimming)

bull Do abundance normalization (Running digital normalization)

bull Assemble everything together (Running the actual assembly)

This gives you an assembled transcriptome consisting of many transcripts and transcript families

At this point you can do one or more of the following

bull Annotate your transcripts (Annotation of denovo transcriptome)

bull Quantify your transcripts and examine differential expression (Quantification and Differential Expression)

bull BLAST your transcripts individually (BLASTing your assembled data)

Next Getting started with Amazon EC2

5

labibi Documentation Release 10

6 Chapter 2 Non-model organisms and RNAseq

CHAPTER 3

Getting started with Amazon EC2

Short version

bull go to httpawsamazoncom log in then ldquoEC2rdquo (upper left)

bull select ldquoLaunch instancerdquo

bull select ldquoUbuntu 1404rdquo from the list

bull select ldquom3xlargerdquo from the list (towards bottom of ldquoGeneral purposerdquo)

bull click ldquoReview and launchrdquo

bull select ldquoLaunchrdquo

bull if your first time through create a key pair otherwise select existing

bull click ldquolaunch instancerdquo

31 More details

311 Start up an EC2 instance

Log in

Go to lsquohttpsawsamazoncomlsquo in a Web browser

Select lsquoMy AccountConsolersquo menu option lsquoAWS Management Consolerdquo

Log in with your username amp password

Click on EC2 (upper left)

7

labibi Documentation Release 10

Select your zone

Many of the resources that we use are hosted by Amazon on the East coast Make sure that your dashboard has lsquoNVirginiarsquo on the upper right

Then click on Launch Instance

Select the machine operating system to boot

Find the ldquoUbuntu Server 1404rdquo image in the first list to show up

8 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Choose the machine size

Select lsquoGeneral purposersquo lsquom3xlargersquo and then lsquoReview and Launchrsquo

Confirm and launch

Review the details (ignore the warnings) and click on Launch

31 More details 9

labibi Documentation Release 10

(First time through) generate a new key pair

If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance

(Next times through) select an existing key pair

Select a key pair and click lsquoLaunchrsquo

10 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Click on View Instances

31 More details 11

labibi Documentation Release 10

Select the public DNS name for later use

312 Logging into your new instance ldquoin the cloudrdquo (Windows version)

First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml

Generate a ppk file from your pem file

(You only need to do this once for each key)

Open puttygen select ldquoLoadrdquo

12 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom

Load it

Now ldquosave private keyrdquo Put it somewhere easy to find

31 More details 13

labibi Documentation Release 10

Now that yoursquove generated your PPK file from your PEM file you can log in To do that

Logging into your EC2 instance with Putty

Open up putty and enter your hostname into the Host Name box

Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo

Log in as ldquoubunturdquo

14 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Declare victory

313 Logging into your new instance ldquoin the cloudrdquo (Mac version)

OK so yoursquove created a running computer How do you get to it

The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet

Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows

First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo

Next start Terminal (in Applications Utilities) and type

chmod og-rwx ~Desktopamazonpem

to set the permissions on the private key file to ldquoclosed to all evildoersrdquo

Then type

ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom

Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop

Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in

31 More details 15

labibi Documentation Release 10

At the end you should see text and a prompt that look like this

A final checklist

bull EC2 instance is running

bull used ubuntu 1404

bull NOT a micro instance (m3xlarge or bigger)

314 Amazon Web Services reference material

Instance types

Instance costs

16 Chapter 3 Getting started with Amazon EC2

CHAPTER 4

Short read quality and trimming

Note Reminder if yoursquore on Windows you should install mobaxterm

OK you should now be logged into your Amazon computer How exciting

41 Prepping the computer

Before we do anything else we need to set up a place to work and install a few things

First letrsquos set up a place to work

sudo chmod a+rwxt mnt

This makes lsquomntrsquo a place where we can put data and working files

Next letrsquos install a few things

sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev

These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below

42 Data source

Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone

43 1 Copying in some data to work with

Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata

mkdir mntdata

Next letrsquos grab part of the data set

17

labibi Documentation Release 10

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz

Now if you type

ls -l

you should see something like

-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz

These are subsets of the original data where we selected for reads that belong to a few particular transcripts

One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further

chmod u-w

Wersquoll talk about what these files are below

44 1 Copying data into a working location

First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data

mkdir mntworkcd mntwork

Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash

ln -fs mntdata

These are FASTQ files ndash letrsquos take a look at them

less 0Hour_ATCACG_L002_R1_001extractfastqgz

(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)

Question

bull why do the files have DNA in the name

bull why are there R1 and R2 in the file names

bull why donrsquot we combine all the files

Links

bull FASTQ Format

45 2 FastQC

Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above

Now run FastQC on two files

18 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz

Now type lsquolsrsquo

ls -d fastqc

to list the files and you should see

0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip

We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you

bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml

Questions

bull What should you pay attention to in the FastQC report

bull Which is ldquobetterrdquo R1 or R2 And why

Links

bull FastQC

bull FastQC tutorial video

46 3 Trimmomatic

Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get

The first thing wersquoll need are the adapters to trim off

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa

Now to run Trimmomatic

TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz

0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully

46 3 Trimmomatic 19

labibi Documentation Release 10

Questions

bull How do you figure out what the parameters mean

bull How do you figure out what parameters to use

bull What adapters do you use

bull What version of Trimmomatic are we using here (And FastQC)

bull Do you think parameters are different for RNAseq and genomic data sets

bull Whatrsquos with these annoyingly long and complicated filenames

bull why are we running R1 and R2 together

For a discussion of optimal RNAseq trimming strategies see MacManes 2014

Links

bull Trimmomatic

47 4 FastQC again

Run FastQC again on the trimmed files

fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz

And now view my copies of these files

bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml

Letrsquos take a look at the output files

less 0Hour_ATCACG_L002_R1_001qcfqgz

(again use spacebar to scroll lsquoqrsquo to exit less)

Questions

bull is the quality trimmed data ldquobetterrdquo than before

bull Does it matter that you still have adapters

48 5 Trim the rest of the sequences

First download the rest of the data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz

20 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

And link it in

cd mntworkln -fs mntdatafastqgz

Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks

Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running

rm -f orphansfq

for filename in _R1_extractfastqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz

$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

save the orphanscat s1_se s2_se gtgt orphansfq

done

Things to mention ndash

bull are comments

bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop

bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about

bull whatrsquos with lsquoorphansfqrsquo

Questions

bull how do you figure out if itrsquos working

ndash copypaste it from Word

48 5 Trim the rest of the sequences 21

labibi Documentation Release 10

ndash put in lots of echo

ndash edit one line at a time

bull how on earth do you figure out how to do this

49 6 Interleave the sequences

Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install

sudo pip install -U setuptoolssudo pip install khmer==13

Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop

for filename in _R1_qcfqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

construct the output filenameoutput=$base_R1_peqcfqgz

interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output

done

gzip orphansfq

Next Running digital normalization

22 Chapter 4 Short read quality and trimming

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 7: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

bull they give us $100 gift certificates

15 5 Sticky notes and how they work + Minute Cards

Basic rules

bull no sticky note - ldquoworking on itrdquo

bull green sticky note - ldquoall is wellrdquo

bull red sticky note - ldquoneed helprdquo

Place the sticky notes where we can see them from the back of the room ndash eg on the back of your laptop

At the end of each session (coffee break lunch end of day) please write down on an index card one thing you learnedand one thing yoursquore still confused about

16 6 Warning

The bit that isnrsquot going to work is at the beginning This is unavoidable To reward you we have a coffee break at1030am

mdash

Next Non-model organisms and RNAseq

4 Chapter 1 Welcome

CHAPTER 2

Non-model organisms and RNAseq

With non-model systems where there is neither a good genome nor a lot of mRNAseq data you have to build yourown transcriptome from scratch ndash so-called ldquode novo transcriptome assemblyrdquo There are a few programs to do this -most notably Trinity and Oases - and we have found little difference

The main problem yoursquoll run into with non-model mRNAseq is that the output is fairly noisy with respect to splicevariants Our experience has been that many of these splice variants are probably ldquorealrdquo ndash in the sense of actuallypresent ndash but may be biological ldquonoiserdquo in the sense that they are not actually functional (See this excellent paper byPickrell and Pritchard making the case) Regardless therersquos little that you can do about this although we will talkabout it a bit on the second day

21 The overall process

_staticnonmodel-rnaseq-pipelinepng

bull Copy over your RNAseq data (from two or more samples)

bull Trim primers and junk from sequence (Short read quality and trimming)

bull Do abundance normalization (Running digital normalization)

bull Assemble everything together (Running the actual assembly)

This gives you an assembled transcriptome consisting of many transcripts and transcript families

At this point you can do one or more of the following

bull Annotate your transcripts (Annotation of denovo transcriptome)

bull Quantify your transcripts and examine differential expression (Quantification and Differential Expression)

bull BLAST your transcripts individually (BLASTing your assembled data)

Next Getting started with Amazon EC2

5

labibi Documentation Release 10

6 Chapter 2 Non-model organisms and RNAseq

CHAPTER 3

Getting started with Amazon EC2

Short version

bull go to httpawsamazoncom log in then ldquoEC2rdquo (upper left)

bull select ldquoLaunch instancerdquo

bull select ldquoUbuntu 1404rdquo from the list

bull select ldquom3xlargerdquo from the list (towards bottom of ldquoGeneral purposerdquo)

bull click ldquoReview and launchrdquo

bull select ldquoLaunchrdquo

bull if your first time through create a key pair otherwise select existing

bull click ldquolaunch instancerdquo

31 More details

311 Start up an EC2 instance

Log in

Go to lsquohttpsawsamazoncomlsquo in a Web browser

Select lsquoMy AccountConsolersquo menu option lsquoAWS Management Consolerdquo

Log in with your username amp password

Click on EC2 (upper left)

7

labibi Documentation Release 10

Select your zone

Many of the resources that we use are hosted by Amazon on the East coast Make sure that your dashboard has lsquoNVirginiarsquo on the upper right

Then click on Launch Instance

Select the machine operating system to boot

Find the ldquoUbuntu Server 1404rdquo image in the first list to show up

8 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Choose the machine size

Select lsquoGeneral purposersquo lsquom3xlargersquo and then lsquoReview and Launchrsquo

Confirm and launch

Review the details (ignore the warnings) and click on Launch

31 More details 9

labibi Documentation Release 10

(First time through) generate a new key pair

If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance

(Next times through) select an existing key pair

Select a key pair and click lsquoLaunchrsquo

10 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Click on View Instances

31 More details 11

labibi Documentation Release 10

Select the public DNS name for later use

312 Logging into your new instance ldquoin the cloudrdquo (Windows version)

First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml

Generate a ppk file from your pem file

(You only need to do this once for each key)

Open puttygen select ldquoLoadrdquo

12 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom

Load it

Now ldquosave private keyrdquo Put it somewhere easy to find

31 More details 13

labibi Documentation Release 10

Now that yoursquove generated your PPK file from your PEM file you can log in To do that

Logging into your EC2 instance with Putty

Open up putty and enter your hostname into the Host Name box

Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo

Log in as ldquoubunturdquo

14 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Declare victory

313 Logging into your new instance ldquoin the cloudrdquo (Mac version)

OK so yoursquove created a running computer How do you get to it

The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet

Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows

First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo

Next start Terminal (in Applications Utilities) and type

chmod og-rwx ~Desktopamazonpem

to set the permissions on the private key file to ldquoclosed to all evildoersrdquo

Then type

ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom

Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop

Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in

31 More details 15

labibi Documentation Release 10

At the end you should see text and a prompt that look like this

A final checklist

bull EC2 instance is running

bull used ubuntu 1404

bull NOT a micro instance (m3xlarge or bigger)

314 Amazon Web Services reference material

Instance types

Instance costs

16 Chapter 3 Getting started with Amazon EC2

CHAPTER 4

Short read quality and trimming

Note Reminder if yoursquore on Windows you should install mobaxterm

OK you should now be logged into your Amazon computer How exciting

41 Prepping the computer

Before we do anything else we need to set up a place to work and install a few things

First letrsquos set up a place to work

sudo chmod a+rwxt mnt

This makes lsquomntrsquo a place where we can put data and working files

Next letrsquos install a few things

sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev

These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below

42 Data source

Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone

43 1 Copying in some data to work with

Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata

mkdir mntdata

Next letrsquos grab part of the data set

17

labibi Documentation Release 10

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz

Now if you type

ls -l

you should see something like

-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz

These are subsets of the original data where we selected for reads that belong to a few particular transcripts

One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further

chmod u-w

Wersquoll talk about what these files are below

44 1 Copying data into a working location

First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data

mkdir mntworkcd mntwork

Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash

ln -fs mntdata

These are FASTQ files ndash letrsquos take a look at them

less 0Hour_ATCACG_L002_R1_001extractfastqgz

(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)

Question

bull why do the files have DNA in the name

bull why are there R1 and R2 in the file names

bull why donrsquot we combine all the files

Links

bull FASTQ Format

45 2 FastQC

Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above

Now run FastQC on two files

18 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz

Now type lsquolsrsquo

ls -d fastqc

to list the files and you should see

0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip

We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you

bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml

Questions

bull What should you pay attention to in the FastQC report

bull Which is ldquobetterrdquo R1 or R2 And why

Links

bull FastQC

bull FastQC tutorial video

46 3 Trimmomatic

Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get

The first thing wersquoll need are the adapters to trim off

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa

Now to run Trimmomatic

TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz

0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully

46 3 Trimmomatic 19

labibi Documentation Release 10

Questions

bull How do you figure out what the parameters mean

bull How do you figure out what parameters to use

bull What adapters do you use

bull What version of Trimmomatic are we using here (And FastQC)

bull Do you think parameters are different for RNAseq and genomic data sets

bull Whatrsquos with these annoyingly long and complicated filenames

bull why are we running R1 and R2 together

For a discussion of optimal RNAseq trimming strategies see MacManes 2014

Links

bull Trimmomatic

47 4 FastQC again

Run FastQC again on the trimmed files

fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz

And now view my copies of these files

bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml

Letrsquos take a look at the output files

less 0Hour_ATCACG_L002_R1_001qcfqgz

(again use spacebar to scroll lsquoqrsquo to exit less)

Questions

bull is the quality trimmed data ldquobetterrdquo than before

bull Does it matter that you still have adapters

48 5 Trim the rest of the sequences

First download the rest of the data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz

20 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

And link it in

cd mntworkln -fs mntdatafastqgz

Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks

Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running

rm -f orphansfq

for filename in _R1_extractfastqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz

$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

save the orphanscat s1_se s2_se gtgt orphansfq

done

Things to mention ndash

bull are comments

bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop

bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about

bull whatrsquos with lsquoorphansfqrsquo

Questions

bull how do you figure out if itrsquos working

ndash copypaste it from Word

48 5 Trim the rest of the sequences 21

labibi Documentation Release 10

ndash put in lots of echo

ndash edit one line at a time

bull how on earth do you figure out how to do this

49 6 Interleave the sequences

Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install

sudo pip install -U setuptoolssudo pip install khmer==13

Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop

for filename in _R1_qcfqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

construct the output filenameoutput=$base_R1_peqcfqgz

interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output

done

gzip orphansfq

Next Running digital normalization

22 Chapter 4 Short read quality and trimming

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 8: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

CHAPTER 2

Non-model organisms and RNAseq

With non-model systems where there is neither a good genome nor a lot of mRNAseq data you have to build yourown transcriptome from scratch ndash so-called ldquode novo transcriptome assemblyrdquo There are a few programs to do this -most notably Trinity and Oases - and we have found little difference

The main problem yoursquoll run into with non-model mRNAseq is that the output is fairly noisy with respect to splicevariants Our experience has been that many of these splice variants are probably ldquorealrdquo ndash in the sense of actuallypresent ndash but may be biological ldquonoiserdquo in the sense that they are not actually functional (See this excellent paper byPickrell and Pritchard making the case) Regardless therersquos little that you can do about this although we will talkabout it a bit on the second day

21 The overall process

_staticnonmodel-rnaseq-pipelinepng

bull Copy over your RNAseq data (from two or more samples)

bull Trim primers and junk from sequence (Short read quality and trimming)

bull Do abundance normalization (Running digital normalization)

bull Assemble everything together (Running the actual assembly)

This gives you an assembled transcriptome consisting of many transcripts and transcript families

At this point you can do one or more of the following

bull Annotate your transcripts (Annotation of denovo transcriptome)

bull Quantify your transcripts and examine differential expression (Quantification and Differential Expression)

bull BLAST your transcripts individually (BLASTing your assembled data)

Next Getting started with Amazon EC2

5

labibi Documentation Release 10

6 Chapter 2 Non-model organisms and RNAseq

CHAPTER 3

Getting started with Amazon EC2

Short version

bull go to httpawsamazoncom log in then ldquoEC2rdquo (upper left)

bull select ldquoLaunch instancerdquo

bull select ldquoUbuntu 1404rdquo from the list

bull select ldquom3xlargerdquo from the list (towards bottom of ldquoGeneral purposerdquo)

bull click ldquoReview and launchrdquo

bull select ldquoLaunchrdquo

bull if your first time through create a key pair otherwise select existing

bull click ldquolaunch instancerdquo

31 More details

311 Start up an EC2 instance

Log in

Go to lsquohttpsawsamazoncomlsquo in a Web browser

Select lsquoMy AccountConsolersquo menu option lsquoAWS Management Consolerdquo

Log in with your username amp password

Click on EC2 (upper left)

7

labibi Documentation Release 10

Select your zone

Many of the resources that we use are hosted by Amazon on the East coast Make sure that your dashboard has lsquoNVirginiarsquo on the upper right

Then click on Launch Instance

Select the machine operating system to boot

Find the ldquoUbuntu Server 1404rdquo image in the first list to show up

8 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Choose the machine size

Select lsquoGeneral purposersquo lsquom3xlargersquo and then lsquoReview and Launchrsquo

Confirm and launch

Review the details (ignore the warnings) and click on Launch

31 More details 9

labibi Documentation Release 10

(First time through) generate a new key pair

If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance

(Next times through) select an existing key pair

Select a key pair and click lsquoLaunchrsquo

10 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Click on View Instances

31 More details 11

labibi Documentation Release 10

Select the public DNS name for later use

312 Logging into your new instance ldquoin the cloudrdquo (Windows version)

First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml

Generate a ppk file from your pem file

(You only need to do this once for each key)

Open puttygen select ldquoLoadrdquo

12 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom

Load it

Now ldquosave private keyrdquo Put it somewhere easy to find

31 More details 13

labibi Documentation Release 10

Now that yoursquove generated your PPK file from your PEM file you can log in To do that

Logging into your EC2 instance with Putty

Open up putty and enter your hostname into the Host Name box

Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo

Log in as ldquoubunturdquo

14 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Declare victory

313 Logging into your new instance ldquoin the cloudrdquo (Mac version)

OK so yoursquove created a running computer How do you get to it

The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet

Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows

First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo

Next start Terminal (in Applications Utilities) and type

chmod og-rwx ~Desktopamazonpem

to set the permissions on the private key file to ldquoclosed to all evildoersrdquo

Then type

ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom

Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop

Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in

31 More details 15

labibi Documentation Release 10

At the end you should see text and a prompt that look like this

A final checklist

bull EC2 instance is running

bull used ubuntu 1404

bull NOT a micro instance (m3xlarge or bigger)

314 Amazon Web Services reference material

Instance types

Instance costs

16 Chapter 3 Getting started with Amazon EC2

CHAPTER 4

Short read quality and trimming

Note Reminder if yoursquore on Windows you should install mobaxterm

OK you should now be logged into your Amazon computer How exciting

41 Prepping the computer

Before we do anything else we need to set up a place to work and install a few things

First letrsquos set up a place to work

sudo chmod a+rwxt mnt

This makes lsquomntrsquo a place where we can put data and working files

Next letrsquos install a few things

sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev

These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below

42 Data source

Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone

43 1 Copying in some data to work with

Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata

mkdir mntdata

Next letrsquos grab part of the data set

17

labibi Documentation Release 10

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz

Now if you type

ls -l

you should see something like

-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz

These are subsets of the original data where we selected for reads that belong to a few particular transcripts

One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further

chmod u-w

Wersquoll talk about what these files are below

44 1 Copying data into a working location

First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data

mkdir mntworkcd mntwork

Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash

ln -fs mntdata

These are FASTQ files ndash letrsquos take a look at them

less 0Hour_ATCACG_L002_R1_001extractfastqgz

(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)

Question

bull why do the files have DNA in the name

bull why are there R1 and R2 in the file names

bull why donrsquot we combine all the files

Links

bull FASTQ Format

45 2 FastQC

Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above

Now run FastQC on two files

18 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz

Now type lsquolsrsquo

ls -d fastqc

to list the files and you should see

0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip

We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you

bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml

Questions

bull What should you pay attention to in the FastQC report

bull Which is ldquobetterrdquo R1 or R2 And why

Links

bull FastQC

bull FastQC tutorial video

46 3 Trimmomatic

Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get

The first thing wersquoll need are the adapters to trim off

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa

Now to run Trimmomatic

TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz

0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully

46 3 Trimmomatic 19

labibi Documentation Release 10

Questions

bull How do you figure out what the parameters mean

bull How do you figure out what parameters to use

bull What adapters do you use

bull What version of Trimmomatic are we using here (And FastQC)

bull Do you think parameters are different for RNAseq and genomic data sets

bull Whatrsquos with these annoyingly long and complicated filenames

bull why are we running R1 and R2 together

For a discussion of optimal RNAseq trimming strategies see MacManes 2014

Links

bull Trimmomatic

47 4 FastQC again

Run FastQC again on the trimmed files

fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz

And now view my copies of these files

bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml

Letrsquos take a look at the output files

less 0Hour_ATCACG_L002_R1_001qcfqgz

(again use spacebar to scroll lsquoqrsquo to exit less)

Questions

bull is the quality trimmed data ldquobetterrdquo than before

bull Does it matter that you still have adapters

48 5 Trim the rest of the sequences

First download the rest of the data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz

20 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

And link it in

cd mntworkln -fs mntdatafastqgz

Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks

Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running

rm -f orphansfq

for filename in _R1_extractfastqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz

$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

save the orphanscat s1_se s2_se gtgt orphansfq

done

Things to mention ndash

bull are comments

bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop

bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about

bull whatrsquos with lsquoorphansfqrsquo

Questions

bull how do you figure out if itrsquos working

ndash copypaste it from Word

48 5 Trim the rest of the sequences 21

labibi Documentation Release 10

ndash put in lots of echo

ndash edit one line at a time

bull how on earth do you figure out how to do this

49 6 Interleave the sequences

Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install

sudo pip install -U setuptoolssudo pip install khmer==13

Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop

for filename in _R1_qcfqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

construct the output filenameoutput=$base_R1_peqcfqgz

interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output

done

gzip orphansfq

Next Running digital normalization

22 Chapter 4 Short read quality and trimming

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 9: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

6 Chapter 2 Non-model organisms and RNAseq

CHAPTER 3

Getting started with Amazon EC2

Short version

bull go to httpawsamazoncom log in then ldquoEC2rdquo (upper left)

bull select ldquoLaunch instancerdquo

bull select ldquoUbuntu 1404rdquo from the list

bull select ldquom3xlargerdquo from the list (towards bottom of ldquoGeneral purposerdquo)

bull click ldquoReview and launchrdquo

bull select ldquoLaunchrdquo

bull if your first time through create a key pair otherwise select existing

bull click ldquolaunch instancerdquo

31 More details

311 Start up an EC2 instance

Log in

Go to lsquohttpsawsamazoncomlsquo in a Web browser

Select lsquoMy AccountConsolersquo menu option lsquoAWS Management Consolerdquo

Log in with your username amp password

Click on EC2 (upper left)

7

labibi Documentation Release 10

Select your zone

Many of the resources that we use are hosted by Amazon on the East coast Make sure that your dashboard has lsquoNVirginiarsquo on the upper right

Then click on Launch Instance

Select the machine operating system to boot

Find the ldquoUbuntu Server 1404rdquo image in the first list to show up

8 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Choose the machine size

Select lsquoGeneral purposersquo lsquom3xlargersquo and then lsquoReview and Launchrsquo

Confirm and launch

Review the details (ignore the warnings) and click on Launch

31 More details 9

labibi Documentation Release 10

(First time through) generate a new key pair

If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance

(Next times through) select an existing key pair

Select a key pair and click lsquoLaunchrsquo

10 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Click on View Instances

31 More details 11

labibi Documentation Release 10

Select the public DNS name for later use

312 Logging into your new instance ldquoin the cloudrdquo (Windows version)

First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml

Generate a ppk file from your pem file

(You only need to do this once for each key)

Open puttygen select ldquoLoadrdquo

12 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom

Load it

Now ldquosave private keyrdquo Put it somewhere easy to find

31 More details 13

labibi Documentation Release 10

Now that yoursquove generated your PPK file from your PEM file you can log in To do that

Logging into your EC2 instance with Putty

Open up putty and enter your hostname into the Host Name box

Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo

Log in as ldquoubunturdquo

14 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Declare victory

313 Logging into your new instance ldquoin the cloudrdquo (Mac version)

OK so yoursquove created a running computer How do you get to it

The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet

Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows

First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo

Next start Terminal (in Applications Utilities) and type

chmod og-rwx ~Desktopamazonpem

to set the permissions on the private key file to ldquoclosed to all evildoersrdquo

Then type

ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom

Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop

Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in

31 More details 15

labibi Documentation Release 10

At the end you should see text and a prompt that look like this

A final checklist

bull EC2 instance is running

bull used ubuntu 1404

bull NOT a micro instance (m3xlarge or bigger)

314 Amazon Web Services reference material

Instance types

Instance costs

16 Chapter 3 Getting started with Amazon EC2

CHAPTER 4

Short read quality and trimming

Note Reminder if yoursquore on Windows you should install mobaxterm

OK you should now be logged into your Amazon computer How exciting

41 Prepping the computer

Before we do anything else we need to set up a place to work and install a few things

First letrsquos set up a place to work

sudo chmod a+rwxt mnt

This makes lsquomntrsquo a place where we can put data and working files

Next letrsquos install a few things

sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev

These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below

42 Data source

Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone

43 1 Copying in some data to work with

Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata

mkdir mntdata

Next letrsquos grab part of the data set

17

labibi Documentation Release 10

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz

Now if you type

ls -l

you should see something like

-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz

These are subsets of the original data where we selected for reads that belong to a few particular transcripts

One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further

chmod u-w

Wersquoll talk about what these files are below

44 1 Copying data into a working location

First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data

mkdir mntworkcd mntwork

Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash

ln -fs mntdata

These are FASTQ files ndash letrsquos take a look at them

less 0Hour_ATCACG_L002_R1_001extractfastqgz

(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)

Question

bull why do the files have DNA in the name

bull why are there R1 and R2 in the file names

bull why donrsquot we combine all the files

Links

bull FASTQ Format

45 2 FastQC

Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above

Now run FastQC on two files

18 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz

Now type lsquolsrsquo

ls -d fastqc

to list the files and you should see

0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip

We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you

bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml

Questions

bull What should you pay attention to in the FastQC report

bull Which is ldquobetterrdquo R1 or R2 And why

Links

bull FastQC

bull FastQC tutorial video

46 3 Trimmomatic

Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get

The first thing wersquoll need are the adapters to trim off

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa

Now to run Trimmomatic

TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz

0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully

46 3 Trimmomatic 19

labibi Documentation Release 10

Questions

bull How do you figure out what the parameters mean

bull How do you figure out what parameters to use

bull What adapters do you use

bull What version of Trimmomatic are we using here (And FastQC)

bull Do you think parameters are different for RNAseq and genomic data sets

bull Whatrsquos with these annoyingly long and complicated filenames

bull why are we running R1 and R2 together

For a discussion of optimal RNAseq trimming strategies see MacManes 2014

Links

bull Trimmomatic

47 4 FastQC again

Run FastQC again on the trimmed files

fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz

And now view my copies of these files

bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml

Letrsquos take a look at the output files

less 0Hour_ATCACG_L002_R1_001qcfqgz

(again use spacebar to scroll lsquoqrsquo to exit less)

Questions

bull is the quality trimmed data ldquobetterrdquo than before

bull Does it matter that you still have adapters

48 5 Trim the rest of the sequences

First download the rest of the data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz

20 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

And link it in

cd mntworkln -fs mntdatafastqgz

Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks

Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running

rm -f orphansfq

for filename in _R1_extractfastqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz

$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

save the orphanscat s1_se s2_se gtgt orphansfq

done

Things to mention ndash

bull are comments

bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop

bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about

bull whatrsquos with lsquoorphansfqrsquo

Questions

bull how do you figure out if itrsquos working

ndash copypaste it from Word

48 5 Trim the rest of the sequences 21

labibi Documentation Release 10

ndash put in lots of echo

ndash edit one line at a time

bull how on earth do you figure out how to do this

49 6 Interleave the sequences

Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install

sudo pip install -U setuptoolssudo pip install khmer==13

Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop

for filename in _R1_qcfqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

construct the output filenameoutput=$base_R1_peqcfqgz

interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output

done

gzip orphansfq

Next Running digital normalization

22 Chapter 4 Short read quality and trimming

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 10: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

CHAPTER 3

Getting started with Amazon EC2

Short version

bull go to httpawsamazoncom log in then ldquoEC2rdquo (upper left)

bull select ldquoLaunch instancerdquo

bull select ldquoUbuntu 1404rdquo from the list

bull select ldquom3xlargerdquo from the list (towards bottom of ldquoGeneral purposerdquo)

bull click ldquoReview and launchrdquo

bull select ldquoLaunchrdquo

bull if your first time through create a key pair otherwise select existing

bull click ldquolaunch instancerdquo

31 More details

311 Start up an EC2 instance

Log in

Go to lsquohttpsawsamazoncomlsquo in a Web browser

Select lsquoMy AccountConsolersquo menu option lsquoAWS Management Consolerdquo

Log in with your username amp password

Click on EC2 (upper left)

7

labibi Documentation Release 10

Select your zone

Many of the resources that we use are hosted by Amazon on the East coast Make sure that your dashboard has lsquoNVirginiarsquo on the upper right

Then click on Launch Instance

Select the machine operating system to boot

Find the ldquoUbuntu Server 1404rdquo image in the first list to show up

8 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Choose the machine size

Select lsquoGeneral purposersquo lsquom3xlargersquo and then lsquoReview and Launchrsquo

Confirm and launch

Review the details (ignore the warnings) and click on Launch

31 More details 9

labibi Documentation Release 10

(First time through) generate a new key pair

If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance

(Next times through) select an existing key pair

Select a key pair and click lsquoLaunchrsquo

10 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Click on View Instances

31 More details 11

labibi Documentation Release 10

Select the public DNS name for later use

312 Logging into your new instance ldquoin the cloudrdquo (Windows version)

First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml

Generate a ppk file from your pem file

(You only need to do this once for each key)

Open puttygen select ldquoLoadrdquo

12 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom

Load it

Now ldquosave private keyrdquo Put it somewhere easy to find

31 More details 13

labibi Documentation Release 10

Now that yoursquove generated your PPK file from your PEM file you can log in To do that

Logging into your EC2 instance with Putty

Open up putty and enter your hostname into the Host Name box

Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo

Log in as ldquoubunturdquo

14 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Declare victory

313 Logging into your new instance ldquoin the cloudrdquo (Mac version)

OK so yoursquove created a running computer How do you get to it

The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet

Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows

First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo

Next start Terminal (in Applications Utilities) and type

chmod og-rwx ~Desktopamazonpem

to set the permissions on the private key file to ldquoclosed to all evildoersrdquo

Then type

ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom

Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop

Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in

31 More details 15

labibi Documentation Release 10

At the end you should see text and a prompt that look like this

A final checklist

bull EC2 instance is running

bull used ubuntu 1404

bull NOT a micro instance (m3xlarge or bigger)

314 Amazon Web Services reference material

Instance types

Instance costs

16 Chapter 3 Getting started with Amazon EC2

CHAPTER 4

Short read quality and trimming

Note Reminder if yoursquore on Windows you should install mobaxterm

OK you should now be logged into your Amazon computer How exciting

41 Prepping the computer

Before we do anything else we need to set up a place to work and install a few things

First letrsquos set up a place to work

sudo chmod a+rwxt mnt

This makes lsquomntrsquo a place where we can put data and working files

Next letrsquos install a few things

sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev

These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below

42 Data source

Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone

43 1 Copying in some data to work with

Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata

mkdir mntdata

Next letrsquos grab part of the data set

17

labibi Documentation Release 10

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz

Now if you type

ls -l

you should see something like

-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz

These are subsets of the original data where we selected for reads that belong to a few particular transcripts

One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further

chmod u-w

Wersquoll talk about what these files are below

44 1 Copying data into a working location

First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data

mkdir mntworkcd mntwork

Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash

ln -fs mntdata

These are FASTQ files ndash letrsquos take a look at them

less 0Hour_ATCACG_L002_R1_001extractfastqgz

(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)

Question

bull why do the files have DNA in the name

bull why are there R1 and R2 in the file names

bull why donrsquot we combine all the files

Links

bull FASTQ Format

45 2 FastQC

Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above

Now run FastQC on two files

18 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz

Now type lsquolsrsquo

ls -d fastqc

to list the files and you should see

0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip

We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you

bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml

Questions

bull What should you pay attention to in the FastQC report

bull Which is ldquobetterrdquo R1 or R2 And why

Links

bull FastQC

bull FastQC tutorial video

46 3 Trimmomatic

Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get

The first thing wersquoll need are the adapters to trim off

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa

Now to run Trimmomatic

TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz

0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully

46 3 Trimmomatic 19

labibi Documentation Release 10

Questions

bull How do you figure out what the parameters mean

bull How do you figure out what parameters to use

bull What adapters do you use

bull What version of Trimmomatic are we using here (And FastQC)

bull Do you think parameters are different for RNAseq and genomic data sets

bull Whatrsquos with these annoyingly long and complicated filenames

bull why are we running R1 and R2 together

For a discussion of optimal RNAseq trimming strategies see MacManes 2014

Links

bull Trimmomatic

47 4 FastQC again

Run FastQC again on the trimmed files

fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz

And now view my copies of these files

bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml

Letrsquos take a look at the output files

less 0Hour_ATCACG_L002_R1_001qcfqgz

(again use spacebar to scroll lsquoqrsquo to exit less)

Questions

bull is the quality trimmed data ldquobetterrdquo than before

bull Does it matter that you still have adapters

48 5 Trim the rest of the sequences

First download the rest of the data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz

20 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

And link it in

cd mntworkln -fs mntdatafastqgz

Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks

Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running

rm -f orphansfq

for filename in _R1_extractfastqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz

$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

save the orphanscat s1_se s2_se gtgt orphansfq

done

Things to mention ndash

bull are comments

bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop

bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about

bull whatrsquos with lsquoorphansfqrsquo

Questions

bull how do you figure out if itrsquos working

ndash copypaste it from Word

48 5 Trim the rest of the sequences 21

labibi Documentation Release 10

ndash put in lots of echo

ndash edit one line at a time

bull how on earth do you figure out how to do this

49 6 Interleave the sequences

Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install

sudo pip install -U setuptoolssudo pip install khmer==13

Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop

for filename in _R1_qcfqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

construct the output filenameoutput=$base_R1_peqcfqgz

interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output

done

gzip orphansfq

Next Running digital normalization

22 Chapter 4 Short read quality and trimming

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 11: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

Select your zone

Many of the resources that we use are hosted by Amazon on the East coast Make sure that your dashboard has lsquoNVirginiarsquo on the upper right

Then click on Launch Instance

Select the machine operating system to boot

Find the ldquoUbuntu Server 1404rdquo image in the first list to show up

8 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Choose the machine size

Select lsquoGeneral purposersquo lsquom3xlargersquo and then lsquoReview and Launchrsquo

Confirm and launch

Review the details (ignore the warnings) and click on Launch

31 More details 9

labibi Documentation Release 10

(First time through) generate a new key pair

If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance

(Next times through) select an existing key pair

Select a key pair and click lsquoLaunchrsquo

10 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Click on View Instances

31 More details 11

labibi Documentation Release 10

Select the public DNS name for later use

312 Logging into your new instance ldquoin the cloudrdquo (Windows version)

First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml

Generate a ppk file from your pem file

(You only need to do this once for each key)

Open puttygen select ldquoLoadrdquo

12 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom

Load it

Now ldquosave private keyrdquo Put it somewhere easy to find

31 More details 13

labibi Documentation Release 10

Now that yoursquove generated your PPK file from your PEM file you can log in To do that

Logging into your EC2 instance with Putty

Open up putty and enter your hostname into the Host Name box

Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo

Log in as ldquoubunturdquo

14 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Declare victory

313 Logging into your new instance ldquoin the cloudrdquo (Mac version)

OK so yoursquove created a running computer How do you get to it

The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet

Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows

First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo

Next start Terminal (in Applications Utilities) and type

chmod og-rwx ~Desktopamazonpem

to set the permissions on the private key file to ldquoclosed to all evildoersrdquo

Then type

ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom

Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop

Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in

31 More details 15

labibi Documentation Release 10

At the end you should see text and a prompt that look like this

A final checklist

bull EC2 instance is running

bull used ubuntu 1404

bull NOT a micro instance (m3xlarge or bigger)

314 Amazon Web Services reference material

Instance types

Instance costs

16 Chapter 3 Getting started with Amazon EC2

CHAPTER 4

Short read quality and trimming

Note Reminder if yoursquore on Windows you should install mobaxterm

OK you should now be logged into your Amazon computer How exciting

41 Prepping the computer

Before we do anything else we need to set up a place to work and install a few things

First letrsquos set up a place to work

sudo chmod a+rwxt mnt

This makes lsquomntrsquo a place where we can put data and working files

Next letrsquos install a few things

sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev

These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below

42 Data source

Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone

43 1 Copying in some data to work with

Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata

mkdir mntdata

Next letrsquos grab part of the data set

17

labibi Documentation Release 10

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz

Now if you type

ls -l

you should see something like

-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz

These are subsets of the original data where we selected for reads that belong to a few particular transcripts

One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further

chmod u-w

Wersquoll talk about what these files are below

44 1 Copying data into a working location

First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data

mkdir mntworkcd mntwork

Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash

ln -fs mntdata

These are FASTQ files ndash letrsquos take a look at them

less 0Hour_ATCACG_L002_R1_001extractfastqgz

(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)

Question

bull why do the files have DNA in the name

bull why are there R1 and R2 in the file names

bull why donrsquot we combine all the files

Links

bull FASTQ Format

45 2 FastQC

Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above

Now run FastQC on two files

18 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz

Now type lsquolsrsquo

ls -d fastqc

to list the files and you should see

0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip

We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you

bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml

Questions

bull What should you pay attention to in the FastQC report

bull Which is ldquobetterrdquo R1 or R2 And why

Links

bull FastQC

bull FastQC tutorial video

46 3 Trimmomatic

Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get

The first thing wersquoll need are the adapters to trim off

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa

Now to run Trimmomatic

TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz

0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully

46 3 Trimmomatic 19

labibi Documentation Release 10

Questions

bull How do you figure out what the parameters mean

bull How do you figure out what parameters to use

bull What adapters do you use

bull What version of Trimmomatic are we using here (And FastQC)

bull Do you think parameters are different for RNAseq and genomic data sets

bull Whatrsquos with these annoyingly long and complicated filenames

bull why are we running R1 and R2 together

For a discussion of optimal RNAseq trimming strategies see MacManes 2014

Links

bull Trimmomatic

47 4 FastQC again

Run FastQC again on the trimmed files

fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz

And now view my copies of these files

bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml

Letrsquos take a look at the output files

less 0Hour_ATCACG_L002_R1_001qcfqgz

(again use spacebar to scroll lsquoqrsquo to exit less)

Questions

bull is the quality trimmed data ldquobetterrdquo than before

bull Does it matter that you still have adapters

48 5 Trim the rest of the sequences

First download the rest of the data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz

20 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

And link it in

cd mntworkln -fs mntdatafastqgz

Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks

Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running

rm -f orphansfq

for filename in _R1_extractfastqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz

$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

save the orphanscat s1_se s2_se gtgt orphansfq

done

Things to mention ndash

bull are comments

bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop

bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about

bull whatrsquos with lsquoorphansfqrsquo

Questions

bull how do you figure out if itrsquos working

ndash copypaste it from Word

48 5 Trim the rest of the sequences 21

labibi Documentation Release 10

ndash put in lots of echo

ndash edit one line at a time

bull how on earth do you figure out how to do this

49 6 Interleave the sequences

Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install

sudo pip install -U setuptoolssudo pip install khmer==13

Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop

for filename in _R1_qcfqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

construct the output filenameoutput=$base_R1_peqcfqgz

interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output

done

gzip orphansfq

Next Running digital normalization

22 Chapter 4 Short read quality and trimming

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 12: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

Choose the machine size

Select lsquoGeneral purposersquo lsquom3xlargersquo and then lsquoReview and Launchrsquo

Confirm and launch

Review the details (ignore the warnings) and click on Launch

31 More details 9

labibi Documentation Release 10

(First time through) generate a new key pair

If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance

(Next times through) select an existing key pair

Select a key pair and click lsquoLaunchrsquo

10 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Click on View Instances

31 More details 11

labibi Documentation Release 10

Select the public DNS name for later use

312 Logging into your new instance ldquoin the cloudrdquo (Windows version)

First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml

Generate a ppk file from your pem file

(You only need to do this once for each key)

Open puttygen select ldquoLoadrdquo

12 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom

Load it

Now ldquosave private keyrdquo Put it somewhere easy to find

31 More details 13

labibi Documentation Release 10

Now that yoursquove generated your PPK file from your PEM file you can log in To do that

Logging into your EC2 instance with Putty

Open up putty and enter your hostname into the Host Name box

Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo

Log in as ldquoubunturdquo

14 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Declare victory

313 Logging into your new instance ldquoin the cloudrdquo (Mac version)

OK so yoursquove created a running computer How do you get to it

The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet

Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows

First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo

Next start Terminal (in Applications Utilities) and type

chmod og-rwx ~Desktopamazonpem

to set the permissions on the private key file to ldquoclosed to all evildoersrdquo

Then type

ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom

Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop

Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in

31 More details 15

labibi Documentation Release 10

At the end you should see text and a prompt that look like this

A final checklist

bull EC2 instance is running

bull used ubuntu 1404

bull NOT a micro instance (m3xlarge or bigger)

314 Amazon Web Services reference material

Instance types

Instance costs

16 Chapter 3 Getting started with Amazon EC2

CHAPTER 4

Short read quality and trimming

Note Reminder if yoursquore on Windows you should install mobaxterm

OK you should now be logged into your Amazon computer How exciting

41 Prepping the computer

Before we do anything else we need to set up a place to work and install a few things

First letrsquos set up a place to work

sudo chmod a+rwxt mnt

This makes lsquomntrsquo a place where we can put data and working files

Next letrsquos install a few things

sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev

These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below

42 Data source

Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone

43 1 Copying in some data to work with

Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata

mkdir mntdata

Next letrsquos grab part of the data set

17

labibi Documentation Release 10

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz

Now if you type

ls -l

you should see something like

-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz

These are subsets of the original data where we selected for reads that belong to a few particular transcripts

One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further

chmod u-w

Wersquoll talk about what these files are below

44 1 Copying data into a working location

First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data

mkdir mntworkcd mntwork

Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash

ln -fs mntdata

These are FASTQ files ndash letrsquos take a look at them

less 0Hour_ATCACG_L002_R1_001extractfastqgz

(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)

Question

bull why do the files have DNA in the name

bull why are there R1 and R2 in the file names

bull why donrsquot we combine all the files

Links

bull FASTQ Format

45 2 FastQC

Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above

Now run FastQC on two files

18 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz

Now type lsquolsrsquo

ls -d fastqc

to list the files and you should see

0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip

We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you

bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml

Questions

bull What should you pay attention to in the FastQC report

bull Which is ldquobetterrdquo R1 or R2 And why

Links

bull FastQC

bull FastQC tutorial video

46 3 Trimmomatic

Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get

The first thing wersquoll need are the adapters to trim off

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa

Now to run Trimmomatic

TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz

0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully

46 3 Trimmomatic 19

labibi Documentation Release 10

Questions

bull How do you figure out what the parameters mean

bull How do you figure out what parameters to use

bull What adapters do you use

bull What version of Trimmomatic are we using here (And FastQC)

bull Do you think parameters are different for RNAseq and genomic data sets

bull Whatrsquos with these annoyingly long and complicated filenames

bull why are we running R1 and R2 together

For a discussion of optimal RNAseq trimming strategies see MacManes 2014

Links

bull Trimmomatic

47 4 FastQC again

Run FastQC again on the trimmed files

fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz

And now view my copies of these files

bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml

Letrsquos take a look at the output files

less 0Hour_ATCACG_L002_R1_001qcfqgz

(again use spacebar to scroll lsquoqrsquo to exit less)

Questions

bull is the quality trimmed data ldquobetterrdquo than before

bull Does it matter that you still have adapters

48 5 Trim the rest of the sequences

First download the rest of the data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz

20 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

And link it in

cd mntworkln -fs mntdatafastqgz

Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks

Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running

rm -f orphansfq

for filename in _R1_extractfastqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz

$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

save the orphanscat s1_se s2_se gtgt orphansfq

done

Things to mention ndash

bull are comments

bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop

bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about

bull whatrsquos with lsquoorphansfqrsquo

Questions

bull how do you figure out if itrsquos working

ndash copypaste it from Word

48 5 Trim the rest of the sequences 21

labibi Documentation Release 10

ndash put in lots of echo

ndash edit one line at a time

bull how on earth do you figure out how to do this

49 6 Interleave the sequences

Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install

sudo pip install -U setuptoolssudo pip install khmer==13

Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop

for filename in _R1_qcfqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

construct the output filenameoutput=$base_R1_peqcfqgz

interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output

done

gzip orphansfq

Next Running digital normalization

22 Chapter 4 Short read quality and trimming

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 13: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

(First time through) generate a new key pair

If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance

(Next times through) select an existing key pair

Select a key pair and click lsquoLaunchrsquo

10 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Click on View Instances

31 More details 11

labibi Documentation Release 10

Select the public DNS name for later use

312 Logging into your new instance ldquoin the cloudrdquo (Windows version)

First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml

Generate a ppk file from your pem file

(You only need to do this once for each key)

Open puttygen select ldquoLoadrdquo

12 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom

Load it

Now ldquosave private keyrdquo Put it somewhere easy to find

31 More details 13

labibi Documentation Release 10

Now that yoursquove generated your PPK file from your PEM file you can log in To do that

Logging into your EC2 instance with Putty

Open up putty and enter your hostname into the Host Name box

Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo

Log in as ldquoubunturdquo

14 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Declare victory

313 Logging into your new instance ldquoin the cloudrdquo (Mac version)

OK so yoursquove created a running computer How do you get to it

The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet

Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows

First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo

Next start Terminal (in Applications Utilities) and type

chmod og-rwx ~Desktopamazonpem

to set the permissions on the private key file to ldquoclosed to all evildoersrdquo

Then type

ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom

Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop

Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in

31 More details 15

labibi Documentation Release 10

At the end you should see text and a prompt that look like this

A final checklist

bull EC2 instance is running

bull used ubuntu 1404

bull NOT a micro instance (m3xlarge or bigger)

314 Amazon Web Services reference material

Instance types

Instance costs

16 Chapter 3 Getting started with Amazon EC2

CHAPTER 4

Short read quality and trimming

Note Reminder if yoursquore on Windows you should install mobaxterm

OK you should now be logged into your Amazon computer How exciting

41 Prepping the computer

Before we do anything else we need to set up a place to work and install a few things

First letrsquos set up a place to work

sudo chmod a+rwxt mnt

This makes lsquomntrsquo a place where we can put data and working files

Next letrsquos install a few things

sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev

These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below

42 Data source

Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone

43 1 Copying in some data to work with

Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata

mkdir mntdata

Next letrsquos grab part of the data set

17

labibi Documentation Release 10

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz

Now if you type

ls -l

you should see something like

-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz

These are subsets of the original data where we selected for reads that belong to a few particular transcripts

One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further

chmod u-w

Wersquoll talk about what these files are below

44 1 Copying data into a working location

First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data

mkdir mntworkcd mntwork

Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash

ln -fs mntdata

These are FASTQ files ndash letrsquos take a look at them

less 0Hour_ATCACG_L002_R1_001extractfastqgz

(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)

Question

bull why do the files have DNA in the name

bull why are there R1 and R2 in the file names

bull why donrsquot we combine all the files

Links

bull FASTQ Format

45 2 FastQC

Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above

Now run FastQC on two files

18 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz

Now type lsquolsrsquo

ls -d fastqc

to list the files and you should see

0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip

We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you

bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml

Questions

bull What should you pay attention to in the FastQC report

bull Which is ldquobetterrdquo R1 or R2 And why

Links

bull FastQC

bull FastQC tutorial video

46 3 Trimmomatic

Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get

The first thing wersquoll need are the adapters to trim off

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa

Now to run Trimmomatic

TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz

0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully

46 3 Trimmomatic 19

labibi Documentation Release 10

Questions

bull How do you figure out what the parameters mean

bull How do you figure out what parameters to use

bull What adapters do you use

bull What version of Trimmomatic are we using here (And FastQC)

bull Do you think parameters are different for RNAseq and genomic data sets

bull Whatrsquos with these annoyingly long and complicated filenames

bull why are we running R1 and R2 together

For a discussion of optimal RNAseq trimming strategies see MacManes 2014

Links

bull Trimmomatic

47 4 FastQC again

Run FastQC again on the trimmed files

fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz

And now view my copies of these files

bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml

Letrsquos take a look at the output files

less 0Hour_ATCACG_L002_R1_001qcfqgz

(again use spacebar to scroll lsquoqrsquo to exit less)

Questions

bull is the quality trimmed data ldquobetterrdquo than before

bull Does it matter that you still have adapters

48 5 Trim the rest of the sequences

First download the rest of the data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz

20 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

And link it in

cd mntworkln -fs mntdatafastqgz

Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks

Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running

rm -f orphansfq

for filename in _R1_extractfastqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz

$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

save the orphanscat s1_se s2_se gtgt orphansfq

done

Things to mention ndash

bull are comments

bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop

bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about

bull whatrsquos with lsquoorphansfqrsquo

Questions

bull how do you figure out if itrsquos working

ndash copypaste it from Word

48 5 Trim the rest of the sequences 21

labibi Documentation Release 10

ndash put in lots of echo

ndash edit one line at a time

bull how on earth do you figure out how to do this

49 6 Interleave the sequences

Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install

sudo pip install -U setuptoolssudo pip install khmer==13

Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop

for filename in _R1_qcfqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

construct the output filenameoutput=$base_R1_peqcfqgz

interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output

done

gzip orphansfq

Next Running digital normalization

22 Chapter 4 Short read quality and trimming

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 14: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

Click on View Instances

31 More details 11

labibi Documentation Release 10

Select the public DNS name for later use

312 Logging into your new instance ldquoin the cloudrdquo (Windows version)

First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml

Generate a ppk file from your pem file

(You only need to do this once for each key)

Open puttygen select ldquoLoadrdquo

12 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom

Load it

Now ldquosave private keyrdquo Put it somewhere easy to find

31 More details 13

labibi Documentation Release 10

Now that yoursquove generated your PPK file from your PEM file you can log in To do that

Logging into your EC2 instance with Putty

Open up putty and enter your hostname into the Host Name box

Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo

Log in as ldquoubunturdquo

14 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Declare victory

313 Logging into your new instance ldquoin the cloudrdquo (Mac version)

OK so yoursquove created a running computer How do you get to it

The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet

Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows

First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo

Next start Terminal (in Applications Utilities) and type

chmod og-rwx ~Desktopamazonpem

to set the permissions on the private key file to ldquoclosed to all evildoersrdquo

Then type

ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom

Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop

Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in

31 More details 15

labibi Documentation Release 10

At the end you should see text and a prompt that look like this

A final checklist

bull EC2 instance is running

bull used ubuntu 1404

bull NOT a micro instance (m3xlarge or bigger)

314 Amazon Web Services reference material

Instance types

Instance costs

16 Chapter 3 Getting started with Amazon EC2

CHAPTER 4

Short read quality and trimming

Note Reminder if yoursquore on Windows you should install mobaxterm

OK you should now be logged into your Amazon computer How exciting

41 Prepping the computer

Before we do anything else we need to set up a place to work and install a few things

First letrsquos set up a place to work

sudo chmod a+rwxt mnt

This makes lsquomntrsquo a place where we can put data and working files

Next letrsquos install a few things

sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev

These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below

42 Data source

Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone

43 1 Copying in some data to work with

Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata

mkdir mntdata

Next letrsquos grab part of the data set

17

labibi Documentation Release 10

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz

Now if you type

ls -l

you should see something like

-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz

These are subsets of the original data where we selected for reads that belong to a few particular transcripts

One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further

chmod u-w

Wersquoll talk about what these files are below

44 1 Copying data into a working location

First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data

mkdir mntworkcd mntwork

Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash

ln -fs mntdata

These are FASTQ files ndash letrsquos take a look at them

less 0Hour_ATCACG_L002_R1_001extractfastqgz

(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)

Question

bull why do the files have DNA in the name

bull why are there R1 and R2 in the file names

bull why donrsquot we combine all the files

Links

bull FASTQ Format

45 2 FastQC

Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above

Now run FastQC on two files

18 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz

Now type lsquolsrsquo

ls -d fastqc

to list the files and you should see

0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip

We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you

bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml

Questions

bull What should you pay attention to in the FastQC report

bull Which is ldquobetterrdquo R1 or R2 And why

Links

bull FastQC

bull FastQC tutorial video

46 3 Trimmomatic

Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get

The first thing wersquoll need are the adapters to trim off

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa

Now to run Trimmomatic

TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz

0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully

46 3 Trimmomatic 19

labibi Documentation Release 10

Questions

bull How do you figure out what the parameters mean

bull How do you figure out what parameters to use

bull What adapters do you use

bull What version of Trimmomatic are we using here (And FastQC)

bull Do you think parameters are different for RNAseq and genomic data sets

bull Whatrsquos with these annoyingly long and complicated filenames

bull why are we running R1 and R2 together

For a discussion of optimal RNAseq trimming strategies see MacManes 2014

Links

bull Trimmomatic

47 4 FastQC again

Run FastQC again on the trimmed files

fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz

And now view my copies of these files

bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml

Letrsquos take a look at the output files

less 0Hour_ATCACG_L002_R1_001qcfqgz

(again use spacebar to scroll lsquoqrsquo to exit less)

Questions

bull is the quality trimmed data ldquobetterrdquo than before

bull Does it matter that you still have adapters

48 5 Trim the rest of the sequences

First download the rest of the data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz

20 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

And link it in

cd mntworkln -fs mntdatafastqgz

Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks

Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running

rm -f orphansfq

for filename in _R1_extractfastqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz

$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

save the orphanscat s1_se s2_se gtgt orphansfq

done

Things to mention ndash

bull are comments

bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop

bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about

bull whatrsquos with lsquoorphansfqrsquo

Questions

bull how do you figure out if itrsquos working

ndash copypaste it from Word

48 5 Trim the rest of the sequences 21

labibi Documentation Release 10

ndash put in lots of echo

ndash edit one line at a time

bull how on earth do you figure out how to do this

49 6 Interleave the sequences

Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install

sudo pip install -U setuptoolssudo pip install khmer==13

Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop

for filename in _R1_qcfqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

construct the output filenameoutput=$base_R1_peqcfqgz

interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output

done

gzip orphansfq

Next Running digital normalization

22 Chapter 4 Short read quality and trimming

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 15: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

Select the public DNS name for later use

312 Logging into your new instance ldquoin the cloudrdquo (Windows version)

First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml

Generate a ppk file from your pem file

(You only need to do this once for each key)

Open puttygen select ldquoLoadrdquo

12 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom

Load it

Now ldquosave private keyrdquo Put it somewhere easy to find

31 More details 13

labibi Documentation Release 10

Now that yoursquove generated your PPK file from your PEM file you can log in To do that

Logging into your EC2 instance with Putty

Open up putty and enter your hostname into the Host Name box

Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo

Log in as ldquoubunturdquo

14 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Declare victory

313 Logging into your new instance ldquoin the cloudrdquo (Mac version)

OK so yoursquove created a running computer How do you get to it

The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet

Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows

First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo

Next start Terminal (in Applications Utilities) and type

chmod og-rwx ~Desktopamazonpem

to set the permissions on the private key file to ldquoclosed to all evildoersrdquo

Then type

ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom

Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop

Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in

31 More details 15

labibi Documentation Release 10

At the end you should see text and a prompt that look like this

A final checklist

bull EC2 instance is running

bull used ubuntu 1404

bull NOT a micro instance (m3xlarge or bigger)

314 Amazon Web Services reference material

Instance types

Instance costs

16 Chapter 3 Getting started with Amazon EC2

CHAPTER 4

Short read quality and trimming

Note Reminder if yoursquore on Windows you should install mobaxterm

OK you should now be logged into your Amazon computer How exciting

41 Prepping the computer

Before we do anything else we need to set up a place to work and install a few things

First letrsquos set up a place to work

sudo chmod a+rwxt mnt

This makes lsquomntrsquo a place where we can put data and working files

Next letrsquos install a few things

sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev

These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below

42 Data source

Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone

43 1 Copying in some data to work with

Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata

mkdir mntdata

Next letrsquos grab part of the data set

17

labibi Documentation Release 10

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz

Now if you type

ls -l

you should see something like

-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz

These are subsets of the original data where we selected for reads that belong to a few particular transcripts

One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further

chmod u-w

Wersquoll talk about what these files are below

44 1 Copying data into a working location

First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data

mkdir mntworkcd mntwork

Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash

ln -fs mntdata

These are FASTQ files ndash letrsquos take a look at them

less 0Hour_ATCACG_L002_R1_001extractfastqgz

(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)

Question

bull why do the files have DNA in the name

bull why are there R1 and R2 in the file names

bull why donrsquot we combine all the files

Links

bull FASTQ Format

45 2 FastQC

Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above

Now run FastQC on two files

18 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz

Now type lsquolsrsquo

ls -d fastqc

to list the files and you should see

0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip

We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you

bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml

Questions

bull What should you pay attention to in the FastQC report

bull Which is ldquobetterrdquo R1 or R2 And why

Links

bull FastQC

bull FastQC tutorial video

46 3 Trimmomatic

Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get

The first thing wersquoll need are the adapters to trim off

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa

Now to run Trimmomatic

TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz

0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully

46 3 Trimmomatic 19

labibi Documentation Release 10

Questions

bull How do you figure out what the parameters mean

bull How do you figure out what parameters to use

bull What adapters do you use

bull What version of Trimmomatic are we using here (And FastQC)

bull Do you think parameters are different for RNAseq and genomic data sets

bull Whatrsquos with these annoyingly long and complicated filenames

bull why are we running R1 and R2 together

For a discussion of optimal RNAseq trimming strategies see MacManes 2014

Links

bull Trimmomatic

47 4 FastQC again

Run FastQC again on the trimmed files

fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz

And now view my copies of these files

bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml

Letrsquos take a look at the output files

less 0Hour_ATCACG_L002_R1_001qcfqgz

(again use spacebar to scroll lsquoqrsquo to exit less)

Questions

bull is the quality trimmed data ldquobetterrdquo than before

bull Does it matter that you still have adapters

48 5 Trim the rest of the sequences

First download the rest of the data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz

20 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

And link it in

cd mntworkln -fs mntdatafastqgz

Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks

Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running

rm -f orphansfq

for filename in _R1_extractfastqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz

$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

save the orphanscat s1_se s2_se gtgt orphansfq

done

Things to mention ndash

bull are comments

bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop

bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about

bull whatrsquos with lsquoorphansfqrsquo

Questions

bull how do you figure out if itrsquos working

ndash copypaste it from Word

48 5 Trim the rest of the sequences 21

labibi Documentation Release 10

ndash put in lots of echo

ndash edit one line at a time

bull how on earth do you figure out how to do this

49 6 Interleave the sequences

Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install

sudo pip install -U setuptoolssudo pip install khmer==13

Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop

for filename in _R1_qcfqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

construct the output filenameoutput=$base_R1_peqcfqgz

interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output

done

gzip orphansfq

Next Running digital normalization

22 Chapter 4 Short read quality and trimming

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 16: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom

Load it

Now ldquosave private keyrdquo Put it somewhere easy to find

31 More details 13

labibi Documentation Release 10

Now that yoursquove generated your PPK file from your PEM file you can log in To do that

Logging into your EC2 instance with Putty

Open up putty and enter your hostname into the Host Name box

Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo

Log in as ldquoubunturdquo

14 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Declare victory

313 Logging into your new instance ldquoin the cloudrdquo (Mac version)

OK so yoursquove created a running computer How do you get to it

The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet

Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows

First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo

Next start Terminal (in Applications Utilities) and type

chmod og-rwx ~Desktopamazonpem

to set the permissions on the private key file to ldquoclosed to all evildoersrdquo

Then type

ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom

Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop

Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in

31 More details 15

labibi Documentation Release 10

At the end you should see text and a prompt that look like this

A final checklist

bull EC2 instance is running

bull used ubuntu 1404

bull NOT a micro instance (m3xlarge or bigger)

314 Amazon Web Services reference material

Instance types

Instance costs

16 Chapter 3 Getting started with Amazon EC2

CHAPTER 4

Short read quality and trimming

Note Reminder if yoursquore on Windows you should install mobaxterm

OK you should now be logged into your Amazon computer How exciting

41 Prepping the computer

Before we do anything else we need to set up a place to work and install a few things

First letrsquos set up a place to work

sudo chmod a+rwxt mnt

This makes lsquomntrsquo a place where we can put data and working files

Next letrsquos install a few things

sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev

These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below

42 Data source

Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone

43 1 Copying in some data to work with

Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata

mkdir mntdata

Next letrsquos grab part of the data set

17

labibi Documentation Release 10

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz

Now if you type

ls -l

you should see something like

-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz

These are subsets of the original data where we selected for reads that belong to a few particular transcripts

One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further

chmod u-w

Wersquoll talk about what these files are below

44 1 Copying data into a working location

First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data

mkdir mntworkcd mntwork

Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash

ln -fs mntdata

These are FASTQ files ndash letrsquos take a look at them

less 0Hour_ATCACG_L002_R1_001extractfastqgz

(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)

Question

bull why do the files have DNA in the name

bull why are there R1 and R2 in the file names

bull why donrsquot we combine all the files

Links

bull FASTQ Format

45 2 FastQC

Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above

Now run FastQC on two files

18 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz

Now type lsquolsrsquo

ls -d fastqc

to list the files and you should see

0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip

We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you

bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml

Questions

bull What should you pay attention to in the FastQC report

bull Which is ldquobetterrdquo R1 or R2 And why

Links

bull FastQC

bull FastQC tutorial video

46 3 Trimmomatic

Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get

The first thing wersquoll need are the adapters to trim off

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa

Now to run Trimmomatic

TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz

0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully

46 3 Trimmomatic 19

labibi Documentation Release 10

Questions

bull How do you figure out what the parameters mean

bull How do you figure out what parameters to use

bull What adapters do you use

bull What version of Trimmomatic are we using here (And FastQC)

bull Do you think parameters are different for RNAseq and genomic data sets

bull Whatrsquos with these annoyingly long and complicated filenames

bull why are we running R1 and R2 together

For a discussion of optimal RNAseq trimming strategies see MacManes 2014

Links

bull Trimmomatic

47 4 FastQC again

Run FastQC again on the trimmed files

fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz

And now view my copies of these files

bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml

Letrsquos take a look at the output files

less 0Hour_ATCACG_L002_R1_001qcfqgz

(again use spacebar to scroll lsquoqrsquo to exit less)

Questions

bull is the quality trimmed data ldquobetterrdquo than before

bull Does it matter that you still have adapters

48 5 Trim the rest of the sequences

First download the rest of the data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz

20 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

And link it in

cd mntworkln -fs mntdatafastqgz

Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks

Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running

rm -f orphansfq

for filename in _R1_extractfastqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz

$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

save the orphanscat s1_se s2_se gtgt orphansfq

done

Things to mention ndash

bull are comments

bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop

bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about

bull whatrsquos with lsquoorphansfqrsquo

Questions

bull how do you figure out if itrsquos working

ndash copypaste it from Word

48 5 Trim the rest of the sequences 21

labibi Documentation Release 10

ndash put in lots of echo

ndash edit one line at a time

bull how on earth do you figure out how to do this

49 6 Interleave the sequences

Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install

sudo pip install -U setuptoolssudo pip install khmer==13

Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop

for filename in _R1_qcfqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

construct the output filenameoutput=$base_R1_peqcfqgz

interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output

done

gzip orphansfq

Next Running digital normalization

22 Chapter 4 Short read quality and trimming

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 17: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

Now that yoursquove generated your PPK file from your PEM file you can log in To do that

Logging into your EC2 instance with Putty

Open up putty and enter your hostname into the Host Name box

Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo

Log in as ldquoubunturdquo

14 Chapter 3 Getting started with Amazon EC2

labibi Documentation Release 10

Declare victory

313 Logging into your new instance ldquoin the cloudrdquo (Mac version)

OK so yoursquove created a running computer How do you get to it

The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet

Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows

First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo

Next start Terminal (in Applications Utilities) and type

chmod og-rwx ~Desktopamazonpem

to set the permissions on the private key file to ldquoclosed to all evildoersrdquo

Then type

ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom

Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop

Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in

31 More details 15

labibi Documentation Release 10

At the end you should see text and a prompt that look like this

A final checklist

bull EC2 instance is running

bull used ubuntu 1404

bull NOT a micro instance (m3xlarge or bigger)

314 Amazon Web Services reference material

Instance types

Instance costs

16 Chapter 3 Getting started with Amazon EC2

CHAPTER 4

Short read quality and trimming

Note Reminder if yoursquore on Windows you should install mobaxterm

OK you should now be logged into your Amazon computer How exciting

41 Prepping the computer

Before we do anything else we need to set up a place to work and install a few things

First letrsquos set up a place to work

sudo chmod a+rwxt mnt

This makes lsquomntrsquo a place where we can put data and working files

Next letrsquos install a few things

sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev

These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below

42 Data source

Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone

43 1 Copying in some data to work with

Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata

mkdir mntdata

Next letrsquos grab part of the data set

17

labibi Documentation Release 10

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz

Now if you type

ls -l

you should see something like

-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz

These are subsets of the original data where we selected for reads that belong to a few particular transcripts

One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further

chmod u-w

Wersquoll talk about what these files are below

44 1 Copying data into a working location

First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data

mkdir mntworkcd mntwork

Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash

ln -fs mntdata

These are FASTQ files ndash letrsquos take a look at them

less 0Hour_ATCACG_L002_R1_001extractfastqgz

(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)

Question

bull why do the files have DNA in the name

bull why are there R1 and R2 in the file names

bull why donrsquot we combine all the files

Links

bull FASTQ Format

45 2 FastQC

Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above

Now run FastQC on two files

18 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz

Now type lsquolsrsquo

ls -d fastqc

to list the files and you should see

0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip

We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you

bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml

Questions

bull What should you pay attention to in the FastQC report

bull Which is ldquobetterrdquo R1 or R2 And why

Links

bull FastQC

bull FastQC tutorial video

46 3 Trimmomatic

Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get

The first thing wersquoll need are the adapters to trim off

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa

Now to run Trimmomatic

TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz

0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully

46 3 Trimmomatic 19

labibi Documentation Release 10

Questions

bull How do you figure out what the parameters mean

bull How do you figure out what parameters to use

bull What adapters do you use

bull What version of Trimmomatic are we using here (And FastQC)

bull Do you think parameters are different for RNAseq and genomic data sets

bull Whatrsquos with these annoyingly long and complicated filenames

bull why are we running R1 and R2 together

For a discussion of optimal RNAseq trimming strategies see MacManes 2014

Links

bull Trimmomatic

47 4 FastQC again

Run FastQC again on the trimmed files

fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz

And now view my copies of these files

bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml

Letrsquos take a look at the output files

less 0Hour_ATCACG_L002_R1_001qcfqgz

(again use spacebar to scroll lsquoqrsquo to exit less)

Questions

bull is the quality trimmed data ldquobetterrdquo than before

bull Does it matter that you still have adapters

48 5 Trim the rest of the sequences

First download the rest of the data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz

20 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

And link it in

cd mntworkln -fs mntdatafastqgz

Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks

Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running

rm -f orphansfq

for filename in _R1_extractfastqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz

$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

save the orphanscat s1_se s2_se gtgt orphansfq

done

Things to mention ndash

bull are comments

bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop

bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about

bull whatrsquos with lsquoorphansfqrsquo

Questions

bull how do you figure out if itrsquos working

ndash copypaste it from Word

48 5 Trim the rest of the sequences 21

labibi Documentation Release 10

ndash put in lots of echo

ndash edit one line at a time

bull how on earth do you figure out how to do this

49 6 Interleave the sequences

Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install

sudo pip install -U setuptoolssudo pip install khmer==13

Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop

for filename in _R1_qcfqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

construct the output filenameoutput=$base_R1_peqcfqgz

interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output

done

gzip orphansfq

Next Running digital normalization

22 Chapter 4 Short read quality and trimming

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 18: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

Declare victory

313 Logging into your new instance ldquoin the cloudrdquo (Mac version)

OK so yoursquove created a running computer How do you get to it

The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet

Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows

First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo

Next start Terminal (in Applications Utilities) and type

chmod og-rwx ~Desktopamazonpem

to set the permissions on the private key file to ldquoclosed to all evildoersrdquo

Then type

ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom

Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop

Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in

31 More details 15

labibi Documentation Release 10

At the end you should see text and a prompt that look like this

A final checklist

bull EC2 instance is running

bull used ubuntu 1404

bull NOT a micro instance (m3xlarge or bigger)

314 Amazon Web Services reference material

Instance types

Instance costs

16 Chapter 3 Getting started with Amazon EC2

CHAPTER 4

Short read quality and trimming

Note Reminder if yoursquore on Windows you should install mobaxterm

OK you should now be logged into your Amazon computer How exciting

41 Prepping the computer

Before we do anything else we need to set up a place to work and install a few things

First letrsquos set up a place to work

sudo chmod a+rwxt mnt

This makes lsquomntrsquo a place where we can put data and working files

Next letrsquos install a few things

sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev

These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below

42 Data source

Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone

43 1 Copying in some data to work with

Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata

mkdir mntdata

Next letrsquos grab part of the data set

17

labibi Documentation Release 10

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz

Now if you type

ls -l

you should see something like

-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz

These are subsets of the original data where we selected for reads that belong to a few particular transcripts

One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further

chmod u-w

Wersquoll talk about what these files are below

44 1 Copying data into a working location

First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data

mkdir mntworkcd mntwork

Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash

ln -fs mntdata

These are FASTQ files ndash letrsquos take a look at them

less 0Hour_ATCACG_L002_R1_001extractfastqgz

(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)

Question

bull why do the files have DNA in the name

bull why are there R1 and R2 in the file names

bull why donrsquot we combine all the files

Links

bull FASTQ Format

45 2 FastQC

Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above

Now run FastQC on two files

18 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz

Now type lsquolsrsquo

ls -d fastqc

to list the files and you should see

0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip

We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you

bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml

Questions

bull What should you pay attention to in the FastQC report

bull Which is ldquobetterrdquo R1 or R2 And why

Links

bull FastQC

bull FastQC tutorial video

46 3 Trimmomatic

Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get

The first thing wersquoll need are the adapters to trim off

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa

Now to run Trimmomatic

TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz

0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully

46 3 Trimmomatic 19

labibi Documentation Release 10

Questions

bull How do you figure out what the parameters mean

bull How do you figure out what parameters to use

bull What adapters do you use

bull What version of Trimmomatic are we using here (And FastQC)

bull Do you think parameters are different for RNAseq and genomic data sets

bull Whatrsquos with these annoyingly long and complicated filenames

bull why are we running R1 and R2 together

For a discussion of optimal RNAseq trimming strategies see MacManes 2014

Links

bull Trimmomatic

47 4 FastQC again

Run FastQC again on the trimmed files

fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz

And now view my copies of these files

bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml

Letrsquos take a look at the output files

less 0Hour_ATCACG_L002_R1_001qcfqgz

(again use spacebar to scroll lsquoqrsquo to exit less)

Questions

bull is the quality trimmed data ldquobetterrdquo than before

bull Does it matter that you still have adapters

48 5 Trim the rest of the sequences

First download the rest of the data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz

20 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

And link it in

cd mntworkln -fs mntdatafastqgz

Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks

Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running

rm -f orphansfq

for filename in _R1_extractfastqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz

$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

save the orphanscat s1_se s2_se gtgt orphansfq

done

Things to mention ndash

bull are comments

bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop

bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about

bull whatrsquos with lsquoorphansfqrsquo

Questions

bull how do you figure out if itrsquos working

ndash copypaste it from Word

48 5 Trim the rest of the sequences 21

labibi Documentation Release 10

ndash put in lots of echo

ndash edit one line at a time

bull how on earth do you figure out how to do this

49 6 Interleave the sequences

Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install

sudo pip install -U setuptoolssudo pip install khmer==13

Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop

for filename in _R1_qcfqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

construct the output filenameoutput=$base_R1_peqcfqgz

interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output

done

gzip orphansfq

Next Running digital normalization

22 Chapter 4 Short read quality and trimming

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 19: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

At the end you should see text and a prompt that look like this

A final checklist

bull EC2 instance is running

bull used ubuntu 1404

bull NOT a micro instance (m3xlarge or bigger)

314 Amazon Web Services reference material

Instance types

Instance costs

16 Chapter 3 Getting started with Amazon EC2

CHAPTER 4

Short read quality and trimming

Note Reminder if yoursquore on Windows you should install mobaxterm

OK you should now be logged into your Amazon computer How exciting

41 Prepping the computer

Before we do anything else we need to set up a place to work and install a few things

First letrsquos set up a place to work

sudo chmod a+rwxt mnt

This makes lsquomntrsquo a place where we can put data and working files

Next letrsquos install a few things

sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev

These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below

42 Data source

Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone

43 1 Copying in some data to work with

Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata

mkdir mntdata

Next letrsquos grab part of the data set

17

labibi Documentation Release 10

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz

Now if you type

ls -l

you should see something like

-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz

These are subsets of the original data where we selected for reads that belong to a few particular transcripts

One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further

chmod u-w

Wersquoll talk about what these files are below

44 1 Copying data into a working location

First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data

mkdir mntworkcd mntwork

Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash

ln -fs mntdata

These are FASTQ files ndash letrsquos take a look at them

less 0Hour_ATCACG_L002_R1_001extractfastqgz

(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)

Question

bull why do the files have DNA in the name

bull why are there R1 and R2 in the file names

bull why donrsquot we combine all the files

Links

bull FASTQ Format

45 2 FastQC

Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above

Now run FastQC on two files

18 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz

Now type lsquolsrsquo

ls -d fastqc

to list the files and you should see

0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip

We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you

bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml

Questions

bull What should you pay attention to in the FastQC report

bull Which is ldquobetterrdquo R1 or R2 And why

Links

bull FastQC

bull FastQC tutorial video

46 3 Trimmomatic

Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get

The first thing wersquoll need are the adapters to trim off

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa

Now to run Trimmomatic

TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz

0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully

46 3 Trimmomatic 19

labibi Documentation Release 10

Questions

bull How do you figure out what the parameters mean

bull How do you figure out what parameters to use

bull What adapters do you use

bull What version of Trimmomatic are we using here (And FastQC)

bull Do you think parameters are different for RNAseq and genomic data sets

bull Whatrsquos with these annoyingly long and complicated filenames

bull why are we running R1 and R2 together

For a discussion of optimal RNAseq trimming strategies see MacManes 2014

Links

bull Trimmomatic

47 4 FastQC again

Run FastQC again on the trimmed files

fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz

And now view my copies of these files

bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml

Letrsquos take a look at the output files

less 0Hour_ATCACG_L002_R1_001qcfqgz

(again use spacebar to scroll lsquoqrsquo to exit less)

Questions

bull is the quality trimmed data ldquobetterrdquo than before

bull Does it matter that you still have adapters

48 5 Trim the rest of the sequences

First download the rest of the data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz

20 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

And link it in

cd mntworkln -fs mntdatafastqgz

Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks

Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running

rm -f orphansfq

for filename in _R1_extractfastqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz

$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

save the orphanscat s1_se s2_se gtgt orphansfq

done

Things to mention ndash

bull are comments

bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop

bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about

bull whatrsquos with lsquoorphansfqrsquo

Questions

bull how do you figure out if itrsquos working

ndash copypaste it from Word

48 5 Trim the rest of the sequences 21

labibi Documentation Release 10

ndash put in lots of echo

ndash edit one line at a time

bull how on earth do you figure out how to do this

49 6 Interleave the sequences

Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install

sudo pip install -U setuptoolssudo pip install khmer==13

Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop

for filename in _R1_qcfqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

construct the output filenameoutput=$base_R1_peqcfqgz

interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output

done

gzip orphansfq

Next Running digital normalization

22 Chapter 4 Short read quality and trimming

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 20: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

CHAPTER 4

Short read quality and trimming

Note Reminder if yoursquore on Windows you should install mobaxterm

OK you should now be logged into your Amazon computer How exciting

41 Prepping the computer

Before we do anything else we need to set up a place to work and install a few things

First letrsquos set up a place to work

sudo chmod a+rwxt mnt

This makes lsquomntrsquo a place where we can put data and working files

Next letrsquos install a few things

sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev

These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below

42 Data source

Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone

43 1 Copying in some data to work with

Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata

mkdir mntdata

Next letrsquos grab part of the data set

17

labibi Documentation Release 10

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz

Now if you type

ls -l

you should see something like

-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz

These are subsets of the original data where we selected for reads that belong to a few particular transcripts

One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further

chmod u-w

Wersquoll talk about what these files are below

44 1 Copying data into a working location

First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data

mkdir mntworkcd mntwork

Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash

ln -fs mntdata

These are FASTQ files ndash letrsquos take a look at them

less 0Hour_ATCACG_L002_R1_001extractfastqgz

(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)

Question

bull why do the files have DNA in the name

bull why are there R1 and R2 in the file names

bull why donrsquot we combine all the files

Links

bull FASTQ Format

45 2 FastQC

Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above

Now run FastQC on two files

18 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz

Now type lsquolsrsquo

ls -d fastqc

to list the files and you should see

0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip

We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you

bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml

Questions

bull What should you pay attention to in the FastQC report

bull Which is ldquobetterrdquo R1 or R2 And why

Links

bull FastQC

bull FastQC tutorial video

46 3 Trimmomatic

Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get

The first thing wersquoll need are the adapters to trim off

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa

Now to run Trimmomatic

TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz

0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully

46 3 Trimmomatic 19

labibi Documentation Release 10

Questions

bull How do you figure out what the parameters mean

bull How do you figure out what parameters to use

bull What adapters do you use

bull What version of Trimmomatic are we using here (And FastQC)

bull Do you think parameters are different for RNAseq and genomic data sets

bull Whatrsquos with these annoyingly long and complicated filenames

bull why are we running R1 and R2 together

For a discussion of optimal RNAseq trimming strategies see MacManes 2014

Links

bull Trimmomatic

47 4 FastQC again

Run FastQC again on the trimmed files

fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz

And now view my copies of these files

bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml

Letrsquos take a look at the output files

less 0Hour_ATCACG_L002_R1_001qcfqgz

(again use spacebar to scroll lsquoqrsquo to exit less)

Questions

bull is the quality trimmed data ldquobetterrdquo than before

bull Does it matter that you still have adapters

48 5 Trim the rest of the sequences

First download the rest of the data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz

20 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

And link it in

cd mntworkln -fs mntdatafastqgz

Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks

Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running

rm -f orphansfq

for filename in _R1_extractfastqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz

$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

save the orphanscat s1_se s2_se gtgt orphansfq

done

Things to mention ndash

bull are comments

bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop

bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about

bull whatrsquos with lsquoorphansfqrsquo

Questions

bull how do you figure out if itrsquos working

ndash copypaste it from Word

48 5 Trim the rest of the sequences 21

labibi Documentation Release 10

ndash put in lots of echo

ndash edit one line at a time

bull how on earth do you figure out how to do this

49 6 Interleave the sequences

Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install

sudo pip install -U setuptoolssudo pip install khmer==13

Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop

for filename in _R1_qcfqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

construct the output filenameoutput=$base_R1_peqcfqgz

interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output

done

gzip orphansfq

Next Running digital normalization

22 Chapter 4 Short read quality and trimming

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 21: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz

Now if you type

ls -l

you should see something like

-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz

These are subsets of the original data where we selected for reads that belong to a few particular transcripts

One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further

chmod u-w

Wersquoll talk about what these files are below

44 1 Copying data into a working location

First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data

mkdir mntworkcd mntwork

Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash

ln -fs mntdata

These are FASTQ files ndash letrsquos take a look at them

less 0Hour_ATCACG_L002_R1_001extractfastqgz

(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)

Question

bull why do the files have DNA in the name

bull why are there R1 and R2 in the file names

bull why donrsquot we combine all the files

Links

bull FASTQ Format

45 2 FastQC

Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above

Now run FastQC on two files

18 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz

Now type lsquolsrsquo

ls -d fastqc

to list the files and you should see

0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip

We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you

bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml

Questions

bull What should you pay attention to in the FastQC report

bull Which is ldquobetterrdquo R1 or R2 And why

Links

bull FastQC

bull FastQC tutorial video

46 3 Trimmomatic

Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get

The first thing wersquoll need are the adapters to trim off

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa

Now to run Trimmomatic

TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz

0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully

46 3 Trimmomatic 19

labibi Documentation Release 10

Questions

bull How do you figure out what the parameters mean

bull How do you figure out what parameters to use

bull What adapters do you use

bull What version of Trimmomatic are we using here (And FastQC)

bull Do you think parameters are different for RNAseq and genomic data sets

bull Whatrsquos with these annoyingly long and complicated filenames

bull why are we running R1 and R2 together

For a discussion of optimal RNAseq trimming strategies see MacManes 2014

Links

bull Trimmomatic

47 4 FastQC again

Run FastQC again on the trimmed files

fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz

And now view my copies of these files

bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml

Letrsquos take a look at the output files

less 0Hour_ATCACG_L002_R1_001qcfqgz

(again use spacebar to scroll lsquoqrsquo to exit less)

Questions

bull is the quality trimmed data ldquobetterrdquo than before

bull Does it matter that you still have adapters

48 5 Trim the rest of the sequences

First download the rest of the data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz

20 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

And link it in

cd mntworkln -fs mntdatafastqgz

Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks

Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running

rm -f orphansfq

for filename in _R1_extractfastqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz

$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

save the orphanscat s1_se s2_se gtgt orphansfq

done

Things to mention ndash

bull are comments

bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop

bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about

bull whatrsquos with lsquoorphansfqrsquo

Questions

bull how do you figure out if itrsquos working

ndash copypaste it from Word

48 5 Trim the rest of the sequences 21

labibi Documentation Release 10

ndash put in lots of echo

ndash edit one line at a time

bull how on earth do you figure out how to do this

49 6 Interleave the sequences

Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install

sudo pip install -U setuptoolssudo pip install khmer==13

Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop

for filename in _R1_qcfqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

construct the output filenameoutput=$base_R1_peqcfqgz

interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output

done

gzip orphansfq

Next Running digital normalization

22 Chapter 4 Short read quality and trimming

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 22: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz

Now type lsquolsrsquo

ls -d fastqc

to list the files and you should see

0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip

We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you

bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml

Questions

bull What should you pay attention to in the FastQC report

bull Which is ldquobetterrdquo R1 or R2 And why

Links

bull FastQC

bull FastQC tutorial video

46 3 Trimmomatic

Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get

The first thing wersquoll need are the adapters to trim off

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa

Now to run Trimmomatic

TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz

0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully

46 3 Trimmomatic 19

labibi Documentation Release 10

Questions

bull How do you figure out what the parameters mean

bull How do you figure out what parameters to use

bull What adapters do you use

bull What version of Trimmomatic are we using here (And FastQC)

bull Do you think parameters are different for RNAseq and genomic data sets

bull Whatrsquos with these annoyingly long and complicated filenames

bull why are we running R1 and R2 together

For a discussion of optimal RNAseq trimming strategies see MacManes 2014

Links

bull Trimmomatic

47 4 FastQC again

Run FastQC again on the trimmed files

fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz

And now view my copies of these files

bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml

Letrsquos take a look at the output files

less 0Hour_ATCACG_L002_R1_001qcfqgz

(again use spacebar to scroll lsquoqrsquo to exit less)

Questions

bull is the quality trimmed data ldquobetterrdquo than before

bull Does it matter that you still have adapters

48 5 Trim the rest of the sequences

First download the rest of the data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz

20 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

And link it in

cd mntworkln -fs mntdatafastqgz

Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks

Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running

rm -f orphansfq

for filename in _R1_extractfastqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz

$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

save the orphanscat s1_se s2_se gtgt orphansfq

done

Things to mention ndash

bull are comments

bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop

bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about

bull whatrsquos with lsquoorphansfqrsquo

Questions

bull how do you figure out if itrsquos working

ndash copypaste it from Word

48 5 Trim the rest of the sequences 21

labibi Documentation Release 10

ndash put in lots of echo

ndash edit one line at a time

bull how on earth do you figure out how to do this

49 6 Interleave the sequences

Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install

sudo pip install -U setuptoolssudo pip install khmer==13

Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop

for filename in _R1_qcfqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

construct the output filenameoutput=$base_R1_peqcfqgz

interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output

done

gzip orphansfq

Next Running digital normalization

22 Chapter 4 Short read quality and trimming

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 23: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

Questions

bull How do you figure out what the parameters mean

bull How do you figure out what parameters to use

bull What adapters do you use

bull What version of Trimmomatic are we using here (And FastQC)

bull Do you think parameters are different for RNAseq and genomic data sets

bull Whatrsquos with these annoyingly long and complicated filenames

bull why are we running R1 and R2 together

For a discussion of optimal RNAseq trimming strategies see MacManes 2014

Links

bull Trimmomatic

47 4 FastQC again

Run FastQC again on the trimmed files

fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz

And now view my copies of these files

bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml

bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml

Letrsquos take a look at the output files

less 0Hour_ATCACG_L002_R1_001qcfqgz

(again use spacebar to scroll lsquoqrsquo to exit less)

Questions

bull is the quality trimmed data ldquobetterrdquo than before

bull Does it matter that you still have adapters

48 5 Trim the rest of the sequences

First download the rest of the data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz

20 Chapter 4 Short read quality and trimming

labibi Documentation Release 10

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

And link it in

cd mntworkln -fs mntdatafastqgz

Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks

Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running

rm -f orphansfq

for filename in _R1_extractfastqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz

$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

save the orphanscat s1_se s2_se gtgt orphansfq

done

Things to mention ndash

bull are comments

bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop

bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about

bull whatrsquos with lsquoorphansfqrsquo

Questions

bull how do you figure out if itrsquos working

ndash copypaste it from Word

48 5 Trim the rest of the sequences 21

labibi Documentation Release 10

ndash put in lots of echo

ndash edit one line at a time

bull how on earth do you figure out how to do this

49 6 Interleave the sequences

Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install

sudo pip install -U setuptoolssudo pip install khmer==13

Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop

for filename in _R1_qcfqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

construct the output filenameoutput=$base_R1_peqcfqgz

interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output

done

gzip orphansfq

Next Running digital normalization

22 Chapter 4 Short read quality and trimming

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 24: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

And link it in

cd mntworkln -fs mntdatafastqgz

Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks

Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running

rm -f orphansfq

for filename in _R1_extractfastqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz

$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25

save the orphanscat s1_se s2_se gtgt orphansfq

done

Things to mention ndash

bull are comments

bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop

bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about

bull whatrsquos with lsquoorphansfqrsquo

Questions

bull how do you figure out if itrsquos working

ndash copypaste it from Word

48 5 Trim the rest of the sequences 21

labibi Documentation Release 10

ndash put in lots of echo

ndash edit one line at a time

bull how on earth do you figure out how to do this

49 6 Interleave the sequences

Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install

sudo pip install -U setuptoolssudo pip install khmer==13

Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop

for filename in _R1_qcfqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

construct the output filenameoutput=$base_R1_peqcfqgz

interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output

done

gzip orphansfq

Next Running digital normalization

22 Chapter 4 Short read quality and trimming

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 25: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

ndash put in lots of echo

ndash edit one line at a time

bull how on earth do you figure out how to do this

49 6 Interleave the sequences

Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install

sudo pip install -U setuptoolssudo pip install khmer==13

Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop

for filename in _R1_qcfqgzdo

first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base

now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2

construct the output filenameoutput=$base_R1_peqcfqgz

interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output

done

gzip orphansfq

Next Running digital normalization

22 Chapter 4 Short read quality and trimming

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 26: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

CHAPTER 5

Running digital normalization

Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014

Digital normalization works by eliminating high abundance reads that are unnecessary for assembly

First wersquoll run it on the interleaved files we generated in the previous section

cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)

Next run diginorm on the orphaned reads (from trimming)

normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz

Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash

filter-abundpy -V normC20k20ct keep

See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects

Now take all of the paired-end files and split them into paired and orphaned reads

for filename in pekeepabundfiltdo

extract-paired-readspy $filenamedone

Put all the orphaned reads in one place

cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz

And now rename the paired-end files to something nice

for filename in peqcfqgzkeepabundfiltpedo

base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output

done

23

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 27: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

Now if you type

ls dnfqgz

yoursquoll see all of the files that you need to move on to the next step ndash

0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz

Letrsquos remove some of the detritus before moving on

rm pe se abundfilt keeprm normC20k20ct

Next Running the actual assembly

24 Chapter 5 Running digital normalization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 28: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

CHAPTER 6

Running the actual assembly

Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler

First install some prerequisites for Trinity

sudo apt-get -y install bowtie samtools

Next install Trinity v206

cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity

cd trinitymake

Go into the work directory and prepare the data

cd mntworkfor i in dnfqgzdo

split-paired-readspy $idone

cat 1 gt leftfqcat 2 gt rightfq

Now run the Trinity assembler

~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check

This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so

curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta

Change the filename and rename all the sequences

gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz

This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences

25

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 29: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

Next BLASTing your assembled data

26 Chapter 6 Running the actual assembly

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 30: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

CHAPTER 7

BLASTing your assembled data

First install a few prerequisites

sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev

Next grab things needed for the BLAST server

sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed

sudo ln -s usrbinblastall usrlocalbin

Install the BLAST server and configure it

cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh

cd blastkitwwwsudo ln -fs $PWD varwwwblastkit

mkdir fileschmod a+rxwt fileschmod +x homeubuntu

cd homeubuntublastkitpython checkpy

Now copy in your newly created transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa

cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa

You can now access your BLAST server at httpltamazon machine namegtblastkit

Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http

If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog

27

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 31: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

28 Chapter 7 BLASTing your assembled data

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 32: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

CHAPTER 8

Annotation of denovo transcriptome

81 Identify the GeneTranscript relationships

we can generate this file like so

cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map

Letrsquos have a look on the map

less Trinityfastagene_trans_map

Components genes and isoforms

bull The different (irsquos) that correspond to the same (g) represent isoforms

bull The different (grsquos) could represent different genes (or parts of genes)

bull The component (TR|c) often contain related genes (paralogs or gene fragments)

Check the Trinityseq forum for more details

82 Generate the longest-ORF peptide candidates from the Trinity As-sembly

We need to install Transdecoder to do this job

cdsudo cpan URIEscape

Note type yes for all interactive questions

curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake

Now we can run the Transdecoder software to identify the longest-ORF peptide

29

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 33: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta

Check the Transdecoder output

less Trinityfastatransdecoder_dirlongest_orfspep

83 Capturing BLAST Homologies

Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)

sudo apt-get install -y ncbi-blast+

Get the required sequence databases and prepare local blast databases

1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot

Run blast to find homologies

(a) search Trinity transcripts

blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6

(b) search Transdecoder-predicted proteins

blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6

2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence

wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot

perform similar searches using uniref90 as the target database rename output files accordingly

blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6

I have ran them overnight already You can download these files to save time

wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6

84 Characterization of functional annotation features

1 identify protein domains we need to install HMMER and download the Pfam domains database

30 Chapter 8 Annotation of denovo transcriptome

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 34: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

sudo apt-get install -y hmmer

Then we can run hmmer to identify the protein domains

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog

2 We can predict other features like

bull signal peptides using signalP

bull transmembrane regions using tmHMM

bull rRNA transcripts using RNAMMER

85 Integration of all annotations into one database

install Trinotate

cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate

install sqlite

sudo apt-get install sqlite3

We need also the DBI perl package

sudo cpan DBIsudo cpan DBDSQLite

Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site

cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz

Load transcripts and coding regions We have three data types

1 Transcript sequences (de novo assembled transcripts or reference transcripts)

2 Protein sequences (currently as defined by TransDecoder)

3 GeneTranscript relationships

~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep

Loading BLAST homologies

~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6

Optional load Uniref90 blast hits

85 Integration of all annotations into one database 31

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 35: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6

Optional Loading functional annotation features

~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout

86 Output an Annotation Report

~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls

There are 2 arguments that we can use to control the accuracy of annotation

-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations

ndashpfam_cutoff ltstringgt

1 lsquoDNCrsquo domain noise cutoff (default)

2 lsquoDGCrsquo domain gathering cutoff

3 lsquoDTCrsquo domain trusted cutoff

4 lsquoSNCrsquo sequence noise cutoff

5 lsquoSGCrsquo sequence gathering cutoff

6 lsquoSTCrsquo sequence trusted cutoff

let us see the output Open a new shell

scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls

32 Chapter 8 Annotation of denovo transcriptome

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 36: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

CHAPTER 9

Quantification and Differential Expression

First make sure yoursquove downloaded all the original raw data

cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz

and link it in

cd mntworkln -fs mntdatafastqgz

91 Download Express

Now get express

cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz

92 Align Reads with Bowtie

Next build an index file for your assembly

33

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 37: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed

Using the index we built wersquoll align the reads from a few of our samples back to our assembly

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam

bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam

93 Quantify Expression using eXpress

Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)

~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam

~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam

This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command

head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs

You should see something like this

bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03

94 Differential Expression

First install R and edgeR

sudo apt-get install -y r-base-core r-bioc-edger csvtool

Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format

34 Chapter 9 Quantification and Differential Expression

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 38: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt

Wersquoll be using edgeR to do the basic differential expression analysis of our counts

To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it

Links

bull False Discovery Rate

bull Learn R with Swirl

So download the script

cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR

Now we run the differential expression script with

Rscript diff_expR

This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value

You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf

94 Differential Expression 35

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 39: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

36 Chapter 9 Quantification and Differential Expression

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 40: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

CHAPTER 10

Remapping your reads to your assembled transcriptome

First wersquoll need to make sure bowtie2 is installed

sudo apt-get install -y bowtie2

Now create a bowtie2 index out of your transcriptome

cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome

And then finally count the number of reads that map to your transcriptome

zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull

You should get something like

9718 overall alignment rate

37

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 41: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

38 Chapter 10 Remapping your reads to your assembled transcriptome

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 42: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

CHAPTER 11

Miscellaneous advice

111 Sequencing depth and number of samples

Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )

In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes

112 Downloading your data

If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get

hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6

URI ftprnaseqmodelQecheJa6titanbchmsuedu

you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link

In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC

To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC

mkdir ~testdatacd ~testdata

wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata

This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do

wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz

39

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 43: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

Tada

113 Developing your own pipeline

Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips

1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial

2 Generate a data subset (the first few 100k reads for example)

2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such

3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works

4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)

5 Make sure the qsub script works on your same small data

6 Scale up to a big test data set

7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)

8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare

Next More resources

40 Chapter 11 Miscellaneous advice

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 44: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

CHAPTER 12

More resources

121 Informational resources

UT (Austin) Sequencing Core prices - costs and yields for sequencing

ANGUS - summer NGS course - lots of resources and materials and book reference

Data Carpentry - intro to R etc

Software Carpentry - more scripting Python etc

122 Places to share data scripts and results files

Figshare

41

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 45: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

42 Chapter 12 More resources

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 46: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

CHAPTER 13

Miscellaneous questions

1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly

This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)

The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly

The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it

The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)

2 What are ldquotranscript familiesrdquo

Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence

So transcript families are Trinityrsquos best guess at transcripts that come from the same locus

3 What should we look at in FastQC results for RNAseq data

The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position

4 How do we transfer our data to Amazon (or any remote computer)

There are two options ndash

If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)

If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out

5 How do we use Amazon to run full analyses

See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo

6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses

43

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 47: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you

7 How do we know if our reference transcriptome is ldquogood enoughrdquo

See Remapping your reads to your assembled transcriptome

8 How do I choose the set of tools to use

Our recommendations in order

(a) Find a tool that a nearby lab is using and start there

(b) Look at tools and workflows that are used in published papers by groups working in your area

(c) Look for good tutorials online

44 Chapter 13 Miscellaneous questions

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 48: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

CHAPTER 14

Tips and Tricks for working with Remote Computers

141 Use screen to run things that take a long time

Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo

For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen

screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz

The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example

bull close your terminal window

bull open up a new one and connect into your Amazon machine

bull type lsquoscreen -rrsquo to reconnect into your running screen

(See amazonusing-screen for a more complete rundown on instructions)

142 Use CyberDuck to transfer files

To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class

143 Subsetting data

If you want to generate a small subset of a FASTQ file for testing you can do something like this

gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz

This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file

45

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 49: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

labibi Documentation Release 10

144 Running full analyses on Amazon Web Services

You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash

1 you need to get your data onto that machine

2 you need to be prepared to let things run for a long time

3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space

Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help

To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)

By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)

df -k mnt

You should see something like this

Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt

which tells you that mnt has 40 GB of disk space

To add disk space to your Amazon instance see this set of instructions

httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml

The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data

46 Chapter 14 Tips and Tricks for working with Remote Computers

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information
Page 50: labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th, 2015, by C. Titus Brown, Camile Scott, and Tamer Mansour. Seethe workshop organization

CHAPTER 15

Technical information

The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel

47

  • Welcome
    • 1 Learning goals
    • 2 Safe space and code of conduct
    • 3 Instructor introductions
    • 4 Amazon and cloud computing - why
    • 5 Sticky notes and how they work + Minute Cards
    • 6 Warning
      • Non-model organisms and RNAseq
        • The overall process
          • Getting started with Amazon EC2
            • More details
              • Short read quality and trimming
                • Prepping the computer
                • Data source
                • 1 Copying in some data to work with
                • 1 Copying data into a working location
                • 2 FastQC
                • 3 Trimmomatic
                • 4 FastQC again
                • 5 Trim the rest of the sequences
                • 6 Interleave the sequences
                  • Running digital normalization
                  • Running the actual assembly
                  • BLASTing your assembled data
                  • Annotation of denovo transcriptome
                    • Identify the GeneTranscript relationships
                    • Generate the longest-ORF peptide candidates from the Trinity Assembly
                    • Capturing BLAST Homologies
                    • Characterization of functional annotation features
                    • Integration of all annotations into one database
                    • Output an Annotation Report
                      • Quantification and Differential Expression
                        • Download Express
                        • Align Reads with Bowtie
                        • Quantify Expression using eXpress
                        • Differential Expression
                          • Remapping your reads to your assembled transcriptome
                          • Miscellaneous advice
                            • Sequencing depth and number of samples
                            • Downloading your data
                            • Developing your own pipeline
                              • More resources
                                • Informational resources
                                • Places to share data scripts and results files
                                  • Miscellaneous questions
                                  • Tips and Tricks for working with Remote Computers
                                    • Use screen to run things that take a long time
                                    • Use CyberDuck to transfer files
                                    • Subsetting data
                                    • Running full analyses on Amazon Web Services
                                      • Technical information

Recommended