Web viewThe RNA-seq experiment was analyzed using the Tuxedo Protocol with four ... and paste it...

RNA Seq Lab I: Analyzing cuffdiff output from an RNA-seq dataset1

Bio 461 Developmental Biology Lab Saint Louis University Dr. Judith Ogilvie

Objectives:

Obtain GO terms and other gene attributes for differentially expressed genes Obtain genomic DNA and mRNA sequences for candidate genes Edit and annotate sequences

In this training module you will analyze cuffdiff output from an Illumina RNA-seq data set that my lab conducted to identify differentially expressed mRNAs isolated from postnatal day 4 (P4) and P6 mouse retinas, both from wild type (wt) and from the mutant rd1 mouse. In Part A we will filter the list of differentially expressed genes through the Ensemble BioMart database and the DAVID database to find Gene Ontology (GO) terms and other gene attributes. Part B will use the UCSC Genome Browser to obtain sequence information for your genes of interest. In Part C we will import sequence information for candidate genes into a sequence editing and annotation program called ApE.

The RNA-seq experiment was analyzed using the Tuxedo Protocol with four different 2-way comparisons: (1) wt P4 compared to wt P6, (2) rd1 P4 compared to rd1 P6; (3) wt P4 compared to rd1 P4, and (4) wt P6 compared to rd1 P6 (see data spreadsheet). During this lab activity you will sort through the differentially regulated genes and pick out a handful of your choice for in-class validation using quantitative reverse transcriptase PCR (qRT-PCR).

The Excel spreadsheet list of genes that are significantly differentially expressed is tabbed with the following information:

Transcript: Transcript ID # Nearest Ref Id: Ensembl transcript ID # Gene: gene identifier # Alias: gene name abbreviation fold change: fold change of P6 wt/P6 rd1 FPKM Direction: Up or Down regulated P6 wt FPKM: fragments/Kb of exon/million fragments for each gene in P6 wt sample P6 rd1 FPKM: fragments/Kb of exon/million fragments for each gene in P6 rd1 sample Q-value: probability of observed expression change being “real” Description: Gene name

Part A: Assigning full gene names and Gene Ontology or GO Terms

Genes can be sorted using the Sort function under the “data” tab in excel. The initial spreadsheet is sorted first by direction (down regulated above up regulated genes), then by q-value, and within a significance range by fold change. You can change this order. You may also want a shorter list to work with. Copy your list into a new sheet in the excel spreadsheet. Sort in a way that separates the genes you are interested in and delete the rest. Be sure to rename the sheet to indicate what is in this list. For example, you may want to include only

1 Adapted from a lab developed by Ray Enke at James Madison UniversityCold Spring Harbor Laboratory, DNA Learning Center, 1 Bungtown Road, Cold Spring Harbor, NY 11724 1

http://genome.ucsc.edu/

http://david.abcc.ncifcrf.gov/tools.jsp

http://www.ensembl.org/biomart/martview/5c37c05b201c816e6eccc0bd69715a74

genes that are downregulated or you may want to include only genes with more than a 2-fold change. Note that the second of our analysis tools prefers lists that are not more than 500 genes. If your total list is shorter than this, you probably want to work with the complete list.

To pick “interesting” genes out of the list, we need to get some additional information about each of them. A gene ontology or GO term is short descriptor of a gene product’s function. There are three different kinds of GO Terms: Biological function, Molecular function, and Cellular component. We are most interested in the biological function. Why? There are two very useful free tools for identifying GO terms. The results should be the same, but the output appears in very different formats. We will start with a database called DAVID (Database for Annotation, Visualization and Integrated Discovery) which will allow us to select for only GO terms associated with biological function.

Navigate to DAVID (http://david.abcc.ncifcrf.gov/tools.jsp) On the left side, click the “upload” tab. From your CuffDiff output file, select and copy the entire column A (Nearest Ref ID) and paste it into

window labeled “A. Paste a list.” Select Identifier>>> Ensembl Transcript ID; List type>>> Gene list; Click “Submit List.”

You now have a number of DAVID tools you can use to analyze the data.

Click on “Gene Functional Classification Tool.” The top row will show an “enrichment score” for each group of functionally related genes. A larger score means this cluster is more enriched than a smaller score. You can download the data as a tab delimited file that can be imported to an excel spreadsheet. When you are done with this data, go back to the previous screen.

Click on “Functional Annotation Tool” Annotation Summary Results>>>uncheck “Check defaults” >>> Clear All >>> Expand “Gene_Ontology”

by clicking on the plus sign >>> check GOTERM_BP_FAT to include only GO terms for Biological Processes.

You will have 3 options. You can click on all three and download the files.o Functional Annotation Clustering will provide clusters of related GO terms similar to the clusters

of functionally related genes above. The second column lists the GO terms in that cluster. The “count” represents how many of your genes have been annotated with that GO term and the P_value tells you how significant the enrichment is.

o Functional Annotation Chart includes each of the GO terms in the previous analysis, sorted by P-value, without clustering.

o Functional Annotation Table provides a table listing each gene and all of the GO terms associated with that gene.

The output may be cumbersome to sort through to identify interesting genes, so we will also use a database called Ensemble BioMart to assign GO terms to each differentially expressed gene. This tool gives a very nice spreadsheet, but does not allow you to select what kind of GO terms to include. BioMart prefers lists of less than 500 genes.

Navigate to Ensemble BioMart (http://useast.ensembl.org/biomart/martview/) Choose database>>>Ensembl Genes 78>>>Choose dataset>>>Mus musculus genes (GRCm38.p3) Filters>>>Gene>>>check ID list limit>>>select Associated gene names from dropdown

These commands tell the database that we are going to filter a list of gene name abbreviations, labeled Alias in your spreadsheet, through the annotated mouse genome (Mus musculus). In the RNA-seq spreadsheet, copy the entire column B (or a subset that you have selected) and paste the gene aliases into the BioMart search window. The next set of commands will tell the database what information we want back from our search.

Cold Spring Harbor Laboratory, DNA Learning Center, 1 Bungtown Road, Cold Spring Harbor, NY 11724 2

http://useast.ensembl.org/biomart/martview/

http://useast.ensembl.org/biomart/martview/



There are many options you can select (see list under Attributes). For this exercise we will select the full gene name, a description of the gene, and associated GO terms for each gene as our output:

Attributes>>>gene Check only the following boxes under Gene: Description, Associate gene name Check only the following box under External: GO Term name Select Results (top tab)>>> for Export results to select File>>>XLS>>>check Unique results

only>>>Go. If it times out, simply repeat.

You now have a new spreadsheet with the Alias/Associated gene name, Description/Full gene name and GO terms for each of the genes. Note that genes with multiple GO terms are repeated in multiple rows. For example, using this information we now know that Cngb1 is a cyclic nucleotide gated ion channel that is involved in phototransduction and smell. Copy or move this spreadsheet into the existing RNA-seq data spreadsheet as a new sheet. Use this spreadsheet to search for keywords of interest. This list will include all of the GO terms. You may want to start by sorting by GO term and delete the terms that are not of interest, such as cellular components. Some terms will be so similar that you may want to delete them to simplify the list (eg. “phototransduction” and “phototransduction, visible light” in the example below.

What kind of genes do you want to find to validate? My lab is most interested in photoreceptor development and degeneration. What keywords would you search for if you were interested in this process?

Control F in Excel>>>enter search term>>>Find next

Working in groups of 3, decide what genes you are going to study. Before you leave today, groups will select 1 process or gene family to study. Each person in the group will select at least 2 genes to work on. Once you’ve decided on your genes, write them on the board so we do not have duplicates.

Once your group has selected several genes, find them in the RNA-seq spreadsheet and copy those specific rows of data into the workbook tab named “genes to validate” so they are organized together. Add columns for the several GO terms.

Part B. Obtaining Sequences from the UCSC Genome Browser

Next we will find genomic DNA and mRNA sequences for each of our genes of interest. The UCSC Genome Browser consists of a suite of tools for viewing and mining of genomic data. We will use some of the basic features of the browser to collect sequences for the chicken Rhodopsin gene. Navigate to the UCSC Genome Browser homepage: http://genome.ucsc.edu/

Select Genomes or Genome Browser In the pull down menus select Group>>>mammal; genome>>>mouse; assembly>>>2011; enter

Rhodopsin as the search term>>>submit Select Rho at chr6 from the result page to access the genome browser view.


http://genome.ucsc.edu/

This takes you to a view of the entire Rhodopsin gene on the mouse chromosome 6 with multiple other tracks showing data corresponding to this genetic location. For simplicity, first deselect all tracks and start from scratch.

Directly under the viewer select the hide all option to hide all tracks Under genes and gene predictions select the Ensemble Genes and RefSeq Genes options

with full display options Select Refresh

The Rhodopsin gene with annotated exons (solid bars) and introns (arrowed lines) is now displayed in the viewer with corresponding genome coordinates (Ensemble annotation in red, RefSeq annotation in blue).

The direction of the arrowed line indicates which strand the gene is encoded on. Arrows pointing to the right indicate the gene is coded 5’ to 3’ on the top strand (left to right in this view), arrows pointing to the left indicate the gene is coded 5’ to 3’ on the bottom strand (right to left in this view). Looking at the viewer you can easily see that Rho is coded on the top strand with 5 exons and 4 introns (exon 1 on the far left).

If you are interested in exploring the many additional features and tracks available in the genome browser, the site contains an excellent tutorial that you I encourage you to check out (http://www.openhelix.com/ucsc

Obtaining sequence information:

To obtain sequence information for a gene or a genetic region, click on the gene name or gene ID on the left side of the viewer (eg “Rho”). This brings you to a page index where you can access more info about your gene. Under the Links to sequence heading you have options to view the genomic DNA, mRNA, or protein sequence for this region. We will collect gDNA and mRNA sequences for Rhodopsin. Select the Genomic sequence link 1st to go to a sequence formatting page. Get the Rho sequence with the following formatting options and paste it into a Word file.

5’UTRs, CDS exons, 3’UTRs, introns One FASTA record per gene Exons in upper case, everything else in lower case Submit

This outputs the Rhodopsin genomic DNA sequence with all exons in upper case and introns and everything else in lower case. Visually, you should be able to pick out the 5 exons by seeing where the upper case letters are separated from lower case. For now, copy/paste the entire sequence into a MS Word file. Go back to the Rho index page and select the mRNA sequence link. This outputs the mRNA sequence (with Ts instead of Us), that is all of the exonic coding sequence stitched together with the intronic sequences spliced out. Copy/paste this sequence into a Word file as well.

C. Editing and annotating sequences in ApE

For your take home assignment you will import and annotate these sequences using a program called A plasmid Editor (ApE). ApE is freeware used for sequence analysis developed by Wayne Davis at the University of Utah. The programed is installed on all of the #3033 lab computers but you will also need to install


http://www.openhelix.com/ucsc

it on a computer that you can use outside of class. It can also be easily downloaded at http://biologylabs.utah.edu/jorgensen/wayned/ape/ and installed onto your personal computers (note: there are slightly different installation instructions for Mac users).

Creating a new sequence file

Open ApE and copy/paste your gDNA and mRNA sequences that you obtained from UCSC into separate new DNA entries. Be sure not to paste in the FASTA sequence tags. You'll notice the software warns you if you try to paste in illegal letters (ie, not ATGC) and will remove them.

To find a particular sequence, press "Command F" or click on the "binoculars" icon or select "Find" under the "Edit" menu. Input the sequence you're looking for (type or copy/paste) into the search field and click "Find next". Use this feature to search either of your sequences for the 3rd exon of the Rho gene:

MmRho 3rd exon:

GTACATCCCTGAGGGCATGCAATGTTCATGCGGGATTGACTACTACACACTCAAGCCTGAGGTCAACAACGAATCCTTTGTCATCTACATGTTCGTGGTCCACTTCACCATTCCTATGATCGTCATCTTCTTCTGCTATGGGCAGCTGGTCTTCACAGTCAAGGAG

Annotating sequence features

A nucleic acid sequence looks like nothing more than a bunch of random As Cs Gs and Ts. To make sense of it we will annotate our sequences, meaning we will highlight a few features as points of reference. To annotate, highlight a portion of sequence with the cursor then select Features>>>New Feature. Give the feature a name (eg “exon 3”) and select a Forward color to highlight your sequence. (Note, for annotating features on the reverse strand,such as reverse primers, select the “Rev-Com” option on the top right of the edit feature screen. Hit OK. You should see exon 3 annotated in your sequence as whatever color you selected. You can select areas of your sequence to find precise nt location or size of a highlighted region using the metrics at the top of your sequence.

What nucleotides does exon 3 start and end at on the gDNA sequence? How big is exon 3 in bp? What nt represents the junction between exon 2 and exon 3 in the mRNA sequence?

To view and print your annotated sequence map, select Enzymes>>>Text Map. Keep default settings for configurations and hit OK. (There can also be done by selecting the “text map” icon in the tool bar above the sequence; 3rd icon from the left). This command gives you your sequence + annotations in a printable format. Right click to print OR save your annotated text map sequence to a Word document as a screenshot and print via Word. There are a number of other features not described by this guide that you can explore on your own if you like.

Due in next class March 19 (individual assignment)

Text map print out of your MmRho genomic DNA sequence with exon 2 and exon 3 annotated (print out does not have to be in color)

Indicate the nt that each exon starts at and how big each exon is (mark these numbers in pen or pencil on your print outs

Text map print out of your MmRho mRNA sequence with exon 2 and exon 3 annotated (does not have to be in color)

Don’t forget to put your name on the printouts! See examples below


http://biologylabs.utah.edu/jorgensen/wayned/ape/

Example ApE mRNA sequence:

Here’s the same sequence in “text map” view:


Date post:	06-Feb-2018
Category:	Documents
Upload:	lequynh
View:	213 times
Download:	0 times

Web viewThe RNA-seq experiment was analyzed using the Tuxedo Protocol with four ... and paste it...

Documents