Analysis of RNA sequencing data sets using the Galaxy environment
Dr. Orr ShomroniDr. Andreas Leha
14-15.09.2017
Outline
Day 1 topics:
● Introduction to RNA-seq workflow and Galaxy
● Sequence read formats and quality assessment
● Read alignment to the genome and quantification of expression
Day 2:
● Experiment design
● Analysis of differential expression
● Functional enrichment analysis of differentially expressed (DE) genes
Analysis of RNA sequencing data setsPart I – from base to count
Dr. Orr ShomroniMicroarray and Deep-sequencing core facility
14.09.2017
https://ycl6.gitbooks.io/rna-seq-data-analysis/rna-seq_analysis_workflow.html
The RNA-seq workflow
● Differentially expressed genes across several conditions of an experiment
● “Simple” – two conditions:● Wild type vs. gene knockout mouse● Healthy person vs. cancer patient● Control vs. treatment with drug
● Complexity can increase arbitrarily:● Many conditions, confounding factors, time course experiments, etc.
RNA-seq workflow I – Hypothesis (a.k.a. the research question)
● Important to ensure (statistical) validity of results● Depends on the hypothesis:
● Cell cultures or animals/patients?● Phenotypic effect mild or severe?● Inclusion of non-coding RNA?● ...
● Affects choice of protocols for culturing, RNA extraction, sample preparation, sequencing, bioinformatics and esp.
number of replicates per condition!→ Involve statistician/bioinformatician from the beginning!
RNA-seq workflow I – Experimental design
www.yeastern.com
RNA-seq workflow I – RNA purification
● RNA extraction● Trizol / ready-to-use kits
● Requires 100 ng to 1 microG of cell material
● RNA integrity number (RIN) – designed to estimate the integrity of total RNA samples
● If RIN is high enough, continue to library preparation
www.abmgood.com
RNA-seq workflow I – library preparation
● Library preparation should be carried out by experienced technicians
● For simple differential expression analysis, we recommend mRNA sequencing
● Cheaper
● RiboZero step to remove rRNA can result in some contamination
www.illumina.com
RNA-seq workflow I – Sequencing
Different technologies, but Illumina's sequencing by synthesis (SBS) approach usually used for RNA-seq→ cycle-specific fluorescence intensities
RNA-seq workflow I – Sequencing processing
● Post-processing of intensity values
● basecalling: convert sequence of intensities to nucleotide sequences (“reads”)
● demultiplexing: assign reads to samples based on their adapter sequences (“barcodes”)
→ Sample-specific sequence read files
● Fragments can be sequenced from one or both ends→ “unpaired”/”single-end” vs. “paired-end”
● RNA-seq often run with single-end
biocluster.ucr.edu
RNA-seq workflow II – FASTQ – the sequencing read file format
● “Raw” reads from sample-specific fragments
● Per-base quality information (Phred score 33)
RNA-seq workflow II – FASTQ processing
Steps towards identifying differential expression of genes between samples:
1) Quality assessment of raw reads
2) Alignment of reads to the genome
3) Quantification of gene expression
QC of Raw Reads
Read Alignment
QuantificationHow can I do that on my own?
Galaxy
● Open source, web-based platform for data intensive biomedical research developed at Penn State and Johns Hopkins University
● Many (NGS) bioinformatics tools available as “plug-ins”
● “Container-based” – server runs in a container that can be installed and customized on other systems→ many instances of Galaxy running worldwide
● User works on “histories” of data and processes, data can be shared with other users
● Galaxy@GWDG: https://galaxy.gwdg.de/
Galaxy – practical I
● Open https://galaxy.gwdg.de/ and login with your GWDG/course account
Galaxy – practical I
Uploading data into Galaxy – a sandbox example:● Go to www.ensembl.org● Click “Downloads”, then “Download data via FTP”● Click on “GTF” for Human Gene sets● Download “Homo_sapiens.GRCh38.90.gtf.gz” to your PC● Go back to Galaxy● Click “Get Data”, then “Upload File from your computer”● Choose local file from your PC (check “Download” folder)● If successful, close the window● Optional: rename history (click on “unnamed history”)
You should see this: Your history should look like this:
Galaxy practical I
● Uploading data may be time-consuming
● Galaxy allows importing data from public repositories and sharing data with other users
● We shared a data set from a published study:
Published January 2017
Galaxy practical I
● “Shared Data” → “Data Libraries” → “RNA_Seq_CourseData” → “Raw Data”
● 3 control condition samples (“GFP...”), 3 overexpression samples (“PCDH7...”)
● Click any of the files to inspect data
● Add all files to your history; several options:
● Individually open files and click “to History” (slow)
● Mark files in folder view and click “to History” (fast)
● Mark whole folder and click “to History” (fast)
● Import into existing history, go to Main menu and click the eye symbol for one of the samples
Here, we demonstrate that overexpression of PCDH7 potently synergizes with lung cancer drivers, including mutant KRAS and EGFR, inducing transformation of human bronchial epithelial cells (HBEC) and promoting tumorigenesis in vivo.
You should see this:
Zoom in to see FastQ file features
← base quality information
← read length
← read nucleotide sequence
RNA-seq workflow II – essential questions about quality control
● How many reads should I have?
● >=25 million reads required for representative transcriptome profile of model organisms such as human and mouse
● PCR introduces many (uninformative) duplicates
● How good are the reads?
● Assess signal-to-noise ratio of sequencing
● Determine proportion of ambigous bases (“N”)
● Identify fraction of adapters, contamination, etc.
RNA-seq workflow II – Phred scores reflecting on basecall accuracy
How good are the bases/reads? → Phred scale: logarithmic scale of basecall accuracy
Common threshold for good quality
Phred Quality Score Probability of Incorrect Basecall Basecall accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10000 99.99%
50 1 in 100000 99.999%
RNA-seq workflow II – Quality control indices
Further quality indices:● Distribution of nucleotide frequencies across the sequences● GC content per sequence● Fraction of “N”● Length distribution of sequences● Sequence duplication level● Amount of overrepresented sequences and short (6-8 bp) stretches of
nucleotides (“k-mers”)● Adapter content → “trimming” may be required
RNA-seq workflow II – FastQC: A quality control tool for high throughput sequence data
Systematically assess quality for NGS samples in Galaxy
→ FastQC
● Open source tool● Runs on all platforms● Assess various quality parameters including contamination by adapters● Allows to provide contamination sequences by user● Generates intuitively interpretable output and visualization
RNA-seq workflow II – FastQC per base quality scores
Galaxy practical II – Quality control with FastQC
● General Sequencing → Quality Control → FastQC and read the description
● Click “Multiple datasets” andselect all FASTQ files from your history
● Click “Execute”
Galaxy practical II – Quality control with FastQC
● Execution calls several instances of the FastQC program, which are “scheduled” by the server→ execution time depends on file size, number of files, number of users and server load
● After a few minutes you should see FastQC results in your history (hit refresh symbol if not)
● As soon as any job is finished you can inspect the results → choose Webpage, then eye symbol
● Scroll through the Webpage→ we are here to answer your questions!
● FastQC RawData contains detailed reports
RNA-seq workflow III – Short read alignment
Goal: determine the origin of sequenced reads w.r.t. the genome
http://www.nature.com/nbt/journal/v27/n5/fig_tab/nbt0509-455_F2.html
RNA-seq workflow III – Short read alignment
Sequence alignment:Re-arrangement of two or more biological sequences to identify corresponding nucleotides/amino acids
Example:
sequence 1: ACATCGAsequence 2: ACTAGCTA
possible alignment:
ACATCG--AAC-TAGCTA
RNA-seq workflow III – Short read alignment
Terminology:● match: two residues in a position match ● mismatch: residue is substituted by different residue● gap: residue(s) is/are inserted or deleted
ACATCG--AAC-TAGCTA
match
mismatch
insertion
deletion
RNA-seq workflow III – Short read alignment
Quality of an aligment: alignment score: sum of quality of position matches
Example: position scores: match=+1, mismatch=-1, gap=-1
possibility 1: possibility 2:
score: 5*1 + 4*(-1)=1 score: 5*1 + 5*(-1)=0
A C A T C G - - AA C - T A G C T A
A C A T C - G - - AA C - T - A G C T A
RNA-seq workflow III – Short read alignment
Global vs local aligment:● Global: align sequences end-to-end● Local: find optimal placement of (sub)sequence(s) within longer
sequence
RNA-seq workflow III – Short read alignment
Application of sequence alignment:● Homology detection: identify best match of a sequence to many
sequences in a database → e.g. NCBI BLAST
● Identify conserved sites via multiple alignments of related protein sequences → e.g. EMBL-EBI Clustal Omega
● Short read alignment (“mapping”): Identify origin of a sequence w.r.t. a genomic reference sequence→ e.g. Bowtie, BWA, TopHat, STAR, HiSAT, ...
RNA-seq workflow III – Short read alignment
Reference sequence:complement of DNA sequences (genome) or mRNA sequences (transcriptome) from an organism
● usually provided as (multi-)Fasta file containing one sequence per chromosome/transcript
● completeness and complexity depends on organism's genome project advance:
● human genome (GRCh38.p11): 24 almost complete chromosomal sequences + mitochondrial genome + ~170 orphan regions, ~6% undetermined nucleotides (“N”)
● western clawed frog: 400,000 “scaffolds” with ~12% “N”
Organism Assembly Length (Mb)
Chromosomes Genes
Human (Homo sapiens) GRCh38.p11 3253.85 22 chromosomes, 2 sex chromosomes and non-nuclear mitochondrial DNA
60298
African clawed frog (Xenopus laevis)
Xenopus_laevis_v2
2718.43 18 chromosomes, non-nuclear mitochondrial DNA
36776
RNA-seq workflow III – Short read alignment
#Organism/Name SubGroup Size (Mb) Chrs Organelles Plasmids AssembliesLocusta migratoria Insects 5759.8- - - 1Orycteropus afer Mammals 4444.08- 1- 1Chrysochloris asiatica Mammals 4210.11- 1- 1Parhyale hawaiensis Other Animals 4023.76- - - 1Elephantulus edwardii Mammals 3843.98- - - 1Apodemus sylvaticus Mammals 3758.14- - - 1Dasypus novemcinctus Mammals 3631.52- 1- 1Procavia capensis Mammals 3602.18- - - 1Monodelphis domestica Mammals 3598.44 9 1- 1Carlito syrichta Mammals 3453.86- 1- 1Myodes glareolus Mammals 3443.07- - - 1Pongo abelii Mammals 3441.24 24 1- 2Cervus elaphus Mammals 3438.62 35- - 1Dryococelus australis Insects 3416.45- - - 1Pan paniscus Mammals 3286.64 24 2- 1Choloepus hoffmanni Mammals 3286.01- - - 1Loxosceles reclusa Other Animals 3262.48- - - 1Homo sapiens Mammals 3253.85 48 1- 61
RNA-seq workflow III – Short read alignment
Transcriptome sizes are substantially smaller, e.g. human transcriptome:● 20,338 coding genes● 22,521 non-coding genes
● 5,363 small non-coding● 14,720 long non-coding● 2,222 misc non-coding
Total number of transcripts can be much higher:● 200,310 gene transcripts
RNA-seq workflow III – Short read alignment
Goal: determine (optimal) mapping of each sequencing read to reference genome/transcriptome
@SRR2549634.1 SEB9BZKS1:279:C4JALACXX:8:1101:1292:2222/1NCCCCTTGGTCACCTTGCTTGATTATCGTAGCACCTTTGGGGACGGACTTC
@SRR2549634.2 SEB9BZKS1:279:C4JALACXX:8:1101:1771:2249/1GTTAGATGCAACTCTTGGCCATAAATCGGCACATTCCTTACCGACTGGACC
@SRR2549634.3 SEB9BZKS1:279:C4JALACXX:8:1101:4645:2229/1NGAATGGTATGTTGCTGGACCTCAGAAGGATGTTCAAAACCACAGTCAATG
@SRR2549634.4 SEB9BZKS1:279:C4JALACXX:8:1101:4518:2229/1NTGGATCCTCAAATCCCACCACATCCATCCAAGGATCATGATTAAAAGCGT
@SRR2549634.5 SEB9BZKS1:279:C4JALACXX:8:1101:5231:2241/1NTGGGTATTCACTGAAAGCTTCAACACACATTGGCTTAGATGGAACGAACT
@SRR2549634.6 SEB9BZKS1:279:C4JALACXX:8:1101:5383:2243/1TGGGTGTAGACATCTTCAACACCAGCCAATTGCAACAACTTTTTGACAGCT
@SRR2549634.7 SEB9BZKS1:279:C4JALACXX:8:1101:7221:2245/1TGGAAATGTTGTCCAGAGTTATCTGGATGATCTAACGTGGGGTTATTGTTT
@SRR2549634.8 SEB9BZKS1:279:C4JALACXX:8:1101:8304:2249/1GCCAGACAGAGGTTTTTCAAATTAGGAAATGTTTGAGCCAATGTGGAAATT
@SRR2549634.9 SEB9BZKS1:279:C4JALACXX:8:1101:9168:2233/1NCTATTTTCATCATCTGATTGAAAAAAAACATTGAAAATATACTCATCATT
@SRR2549634.10 SEB9BZKS1:279:C4JALACXX:8:1101:9915:2241/1NGTGGACAAGATTCTTGGAGCCTTACCCTTGTGTGGACCCATACCGAAGTG
RNA-seq workflow III – Short read alignment
● Mapping = always local alignment● Reads from RNA can span exons
→ “spliced” (gapped) alignment necessary
RNA-seq workflow III – Short read alignment
● Exact alignment of each read to each genome position is very slow → efficient algorithms make use of precomputed tables of short word occurrences in the reference sequence (“hashing”,”indexing”)
● Example:
ACATCGAT consists ofACA CAT ATC TCG CGA GAT
words of length 3AAAAACAAGAAT…
ACA…
TTT
occurrences in the genomechr1:12345-12347,...chr3:9876-9874,...
---chrX:81838-81840
…chr13:123-125,...
…chr1:2435-2437,...
RNA-seq workflow III – Short read alignment
Galaxy@GWDG provides three read alignment tools:● RNA STAR* –
● Advantage: one of the most sensitive, precise, versatile and fast read alignment programs● Disadvantage: memory-intensive
● HISAT2** - fast and sensitive, can be run on a laptop● TopHat*** - fast splice junction mapper, uses Bowtie2 and then analyzes the mapping
results to identify splice junctions between exons● genome indexes precomputed for human and mouse
*Dobin et al., Bioinformatics, 2013**Kim et al., Nature Methods, 2015***Kim et. al., Genome Biology, 2013
Galaxy practical part III – short read alignment
● Transcriptomics → Mapping → HISAT2
● Select “unpaired reads”
● Choose one(!!!) of the six FASTQ files
● Select “Homo_sapiens...” as a reference genome
● Click “Execute”
● When job is scheduled click on HISAT2 again and read the description
Note: mapping will take a while (~30min.)!
Galaxy practical part III – short read alignment
RNA-seq workflow III – Short read alignment
Visualization of alignments as stacked read sequences:
RNA-seq workflow III – Short read alignment
More flexible: Genome browsers● Visualization of reads, splice patterns, mutations etc.● Integration of annotation, public data, known SNPs etc.● UCSC online genome browser: genome.ucsc.edu● Downloadable and usable from Galaxy:
IGV from Broad Institute* software.broadinstitute.org/software/igv/
*Robinson et al., Nature Biotechnology, 2011
The RNA-seq workflow III – Short read alignment
RNA-seq workflow III – Short read alignment
Read coverage: # of reads matching a position/region● Allows statements about gene expression level (RNA-seq)● High coverage helps to identify genomic variants● Depends on sequencing depth
RNA-seq workflow III – Short read alignment
SAM = Sequence Alignment/Map format● Human-readable standard format for alignment characterization● Contains general information on alignment program/parameters and
reference sequence used ● One entry per alignment with information on location, quality and more● BAM = Binary (compressed) version● samtools: popular tool for SAM/BAM file manipulation
RNA-seq workflow III – Short read alignment
RNA-seq workflow III – Short read alignment
RNA-seq workflow III – Short read alignment
Several metrics allow statements about the total sample alignment quality:● Total number of mapped reads (→ coverage) and fraction of reads
mapping to the genome...● ...uniquely: evidence for particular gene/transcript● ...multiply: paralogs, CNV, ribosomal RNA, ...● ...not at all: contamination, genomic DNA, ...
● # mismatches● # novel splice junctions● ...
RNA-seq workflow III – Short read alignment
Example mapping output:
● Click on the finished job and inspect the mapping statistics
● Click the info icon to assess information on the job details including version of the software used
Galaxy practical part III – short read alignment
● Start IGV on your system (search on Desktop)● Open “.bat” file
● Choose “Human Hg38” as a reference genome
● Go to the locus field and enter “PCDH7”
Galaxy practical part III – short read alignment
● Shared Data → Data Libraries → RNA_seq_CourseData → Aligned Files
● Import all alignment (“BAM”) files into your history
● Go to main view (“Analyze Data”)
● Click on any of the alignment files from GFP and click “display with IGV local”
● Click on any of the alignment files from PCDH7 and click “display with IGV local”
● Go to IGV, zoom in on the first exon of PCDH7
● Right-click on the data tracks and choose “Collapsed”
RNAseq-workflow IV - quantification of expression
Gene expression quantificationGoal: estimate the gene expression level from counting reads overlapping annotated genes
discoveringthegenome.org
RNAseq-workflow IV – quantification of expression
● Annotations are often available from genome project websites or Ensembl
● Standard format for annotations is the general feature format (GFF) or gene transfer format (GTF)● Tab-delimited files with information on gene structures● 10 fields including flexible “Attributes”
RNAseq-workflow IV – quantification of expression
● The file we down-/uploaded earlier is an annotation in GTF format for the human genome
RNAseq-workflow IV - quantification of expression
Standard procedure: count number of reads that overlap features (here: exons of a gene) and summarize on meta-feature (here: gene) level
RNAseq-workflow IV - quantification of expression
Questions and pitfalls when counting mapped reads
● Consider multiply mapped reads?● Count on gene or exon/transcript
level?● How to count partially mapping
reads?● How to treat overlapping features?● ...
RNAseq-workflow IV - quantification of expression
● Galaxy@GWDG provides featureCounts* tool for fast and flexible quantification
● Transcriptomics → Counting → featureCounts and read the description
● Click “Multiple datasets” and select all imported alignment files
● load the annotation file (the GTF file) from your history
● Click “Execute”
quantification should take between 1 to 10 min.
*Liao et al., Bioinformatics, 2014
Galaxy practical part IV – gene expression quantification
● When any dataset is finished, click on eye symbol
● Copy identifier of a gene with >1000 reads assigned and paste it into Ensembl search window
● Optional: rename files according to alignment input
RNA workflow addendum – Summary of quality from multiple samples
● Quality assessment of 6 samples – easy enough to do one by one● What about more?
● Solution: MultiQC● Supports summary logs from multiple software, including FastQC, STAR, Bowtie2,
featureCounts, etc.
● Generates a single HTML file, summarizing all results in a single, interactive report
RNA workflow addendum – Summary of quality from multiple samples
Galaxy practical addendum – quality summary (FastQC)
Questions?
Galaxy practical addendum – quality summary