Introduction to Bioinformatics: RNA-Seq Analysis
Hamza Farooq, MSc.LMP Seminar Series
15 October 2018
2Module #: Title of Module
Outline of seminar
- Next Generation Sequencing (NGS) overview
- Types of data are generated from raw sequenced read files (fastq, SAM, BAM)
- How to download publicly available RNA-Seq data from Gene Expression Omnibus
- Preliminary differential expression analysis of publicly available RNA-Seq data using Galaxy
Module 1
Genome VariationTwo unrelated humans have genomes that are ~99.8%similar by sequence (~ 3-4 million differences).Most differences are small, e.g. Single Nucleotide Polymorphisms (SNPs).
Human and chimpanzee genomes are about 96%similar
Pictures: http://www.dana.org/news/publications/detail.aspx?id=24536,http:// en.wikipedia.org/wiki/Chimpanzee
bioinformatics.ca
bioinformatics.ca
Sanger Sequencing
Slide credit: AaronQuinlan
use dideoxynucleotides toinhibit elongation of a DNAstrand
separate strands withgel electrophoresis
Sequencing genomes in
Months and Years
Sequencing genomes in
HOURS/Minutes !!
Technology revolution
Projects cost: Billions $ Thousands $
DNA sequencing: DNAPolymerase
C
T
A C
A
G
G
G
A
C
Single-stranded DNA template
Free nucleotides
DNApolymerase
+ + DNA Pol
G
C G
C
G C
A T
DN
AP
ol
Strand synthesis
zip!
DNA polymerase moves along the template in onedirection, integrating complementary nucleotides as it goes
=
G C3’ 5’
bioinformatics.ca
Template Cluster
Polymerasechain reaction(PCR)
Sequencing by synthesis (Solexa/Illumina)
Repeatedly inject mixture of color-labeled
nucleotides (A, C, G and T) and DNApolymerase. When a complementary nucleotide is added to a cluster, the corresponding color of light is emitted. Capture images of this as it happens.
DNAPolymerase DNA
Polymerase
+
~
~
Pretend these areclusters
(snap)
Shown here is just the firstsequencing cycle
bioinformatics.ca
Sequencing by synthesis (Solexa/Illumina)
~
Line up images and, for each cluster, turn the series of light signals into corresponding series of nucleotides
~
~
~
Cycle 1 Cycle 2
~
~Cycle 3
~
~
Cycle 4
~
~
Cycle 5
Sequencing by synthesis(Solexa/Illumina)
bioinformatics.ca
FASTQ format
Sample1_R1.fastq.gz
Sample2_R1.fastq.gz
End 1 End 2
Each sample will generate between 5Gb (100x WES) to 300Gb (100x WGS) of data
@ERR127302.1 HWI-EAS350_0441:1:1:1055:4898#0/1
GGCTCATCTTGAACTGGGTGGCGACCGTCCCTGGCCCCTTCTTGACACCCAGCGCNNNNNNNNNNNNNNNNA
+
4=B@D99BDDDDDDD:DD?B<<=?>6B#############################################
@ERR127302.2 HWI-EAS350_0441:1:1:1056:1163#0/1
GAATGAGAGGCCCTCCCCGTGGAGGCATGGTATCCGGCCGAGGGGGCTTAGTCATNNNNNNNNNNNNNNNNC
+
B?,B2,?=?1?1B?D@?:@?DB3>AD,8DD??-B?#####################################
@ERR127302.3 HWI-EAS350_0441:1:1:1057:13164#0/1
GGCCGCAGTGCCATTGAGCTCACCAAAATGCTCTGTGAAATCCTGCAGGTTGGGGANNNNNNNNNNNNNNGA
+
DFBH?GDEG>GEGGDHH>HBDBEGD8G<GG<DGGGCB><82???@DDBBDDGGE##################
file:///Users/flefebvr/Downloads/fq.txt
1 of 1 13-05-31 10:43 AM
Sample1_R2.fastq.gz
Sample2_R2.fastq.gz
HeaderSequencePlace holderQuality
FASTQ format
Sample1_R1.fastq.gz
Sample2_R1.fastq.gz
End 1 End 2
Each sample will generate between 5Gb (100x WES) to 300Gb (100x WGS) of data
@ERR127302.1 HWI-EAS350_0441:1:1:1055:4898#0/1
GGCTCATCTTGAACTGGGTGGCGACCGTCCCTGGCCCCTTCTTGACACCCAGCGCNNNNNNNNNNNNNNNNA
+
4=B@D99BDDDDDDD:DD?B<<=?>6B#############################################
@ERR127302.2 HWI-EAS350_0441:1:1:1056:1163#0/1
GAATGAGAGGCCCTCCCCGTGGAGGCATGGTATCCGGCCGAGGGGGCTTAGTCATNNNNNNNNNNNNNNNNC
+
B?,B2,?=?1?1B?D@?:@?DB3>AD,8DD??-B?#####################################
@ERR127302.3 HWI-EAS350_0441:1:1:1057:13164#0/1
GGCCGCAGTGCCATTGAGCTCACCAAAATGCTCTGTGAAATCCTGCAGGTTGGGGANNNNNNNNNNNNNNGA
+
DFBH?GDEG>GEGGDHH>HBDBEGD8G<GG<DGGGCB><82???@DDBBDDGGE##################
file:///Users/flefebvr/Downloads/fq.txt
1 of 1 13-05-31 10:43 AM
Sample1_R2.fastq.gz
Sample2_R2.fastq.gz
Instrument: flowcell lane: tile number: x: y # index for multiplexed sample: member of pair
FASTQ format
Sample1_R1.fastq.gz
Sample2_R1.fastq.gz
End 1 End 2
Each sample will generate between 5Gb (100x WES) to 300Gb (100x WGS) of data
@ERR127302.1 HWI-EAS350_0441:1:1:1055:4898#0/1
GGCTCATCTTGAACTGGGTGGCGACCGTCCCTGGCCCCTTCTTGACACCCAGCGCNNNNNNNNNNNNNNNNA
+
4=B@D99BDDDDDDD:DD?B<<=?>6B#############################################
@ERR127302.2 HWI-EAS350_0441:1:1:1056:1163#0/1
GAATGAGAGGCCCTCCCCGTGGAGGCATGGTATCCGGCCGAGGGGGCTTAGTCATNNNNNNNNNNNNNNNNC
+
B?,B2,?=?1?1B?D@?:@?DB3>AD,8DD??-B?#####################################
@ERR127302.3 HWI-EAS350_0441:1:1:1057:13164#0/1
GGCCGCAGTGCCATTGAGCTCACCAAAATGCTCTGTGAAATCCTGCAGGTTGGGGANNNNNNNNNNNNNNGA
+
DFBH?GDEG>GEGGDHH>HBDBEGD8G<GG<DGGGCB><82???@DDBBDDGGE##################
file:///Users/flefebvr/Downloads/fq.txt
1 of 1 13-05-31 10:43 AM
Sample1_R2.fastq.gz
Sample2_R2.fastq.gz
HeaderSequencePlace holderQuality
FASTQ format
Sample1_R1.fastq.gz
Sample2_R1.fastq.gz
End 1 End 2
Each sample will generate between 5Gb (100x WES) to 300Gb (100x WGS) of data
@ERR127302.1 HWI-EAS350_0441:1:1:1055:4898#0/1
GGCTCATCTTGAACTGGGTGGCGACCGTCCCTGGCCCCTTCTTGACACCCAGCGCNNNNNNNNNNNNNNNNA
+
4=B@D99BDDDDDDD:DD?B<<=?>6B#############################################
@ERR127302.2 HWI-EAS350_0441:1:1:1056:1163#0/1
GAATGAGAGGCCCTCCCCGTGGAGGCATGGTATCCGGCCGAGGGGGCTTAGTCATNNNNNNNNNNNNNNNNC
+
B?,B2,?=?1?1B?D@?:@?DB3>AD,8DD??-B?#####################################
@ERR127302.3 HWI-EAS350_0441:1:1:1057:13164#0/1
GGCCGCAGTGCCATTGAGCTCACCAAAATGCTCTGTGAAATCCTGCAGGTTGGGGANNNNNNNNNNNNNNGA
+
DFBH?GDEG>GEGGDHH>HBDBEGD8G<GG<DGGGCB><82???@DDBBDDGGE##################
file:///Users/flefebvr/Downloads/fq.txt
1 of 1 13-05-31 10:43 AM
Sample1_R2.fastq.gz
Sample2_R2.fastq.gz
HeaderSequencePlace holderQuality
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
After generation of FASTQ data
• What to do with the obtained reads?
• Most of the time the reads will be aligned to a reference genome
– Leverage high quality assemblies of existing species with each individual sequencing
SAM/BAM
• Used to store alignments
• SAM = text, BAM = binary
SRR013667.1 99 19 8882171 60 76M = 8882214 119
NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGA
#>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>
Read name Flag Reference Position CIGAR Mate Position
Bases
Base Qualities
Sample1.bam
Sample2.bamSRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAG
between 10Gb to 500Gb each bam
SAM: Sequence Alignment/Map format
Summary
• NGS technology allows generation of hundreds of millions of reads for different high throughput purposes, including transcript quantification and genome sequencing.
• FastQ format – DNA sequence + quality of each nucleotide for each read from sequencer.
• SAM format – FastQ + alignment info (chromosome, start, end for each reads)
• BAM format – SAM converted to binary form to conserve space
How to get and processsequencing data
• Overview of 3 useful websites:
- Gene Expression Omnibus
- Repository for descriptions of sequencing daa generated from studies
- Sequence Read Archive
- Repository for downloading publicly accessible sequencing data
- Galaxy
- Online platform for free bioinformatics processing
Gene Expression Omnibus
Gene Expression Omnibus: Datasets
Sequence Read Archive
Galaxy
Workflows / Pipelines
Workflows / Pipelines
Demo of “at home” RNA-Seq Analysis
First, we need our input data (GEO)
Description of the sample
SRA Interactive Download Page
All files related to that experiment are retrieved
Can filter the data before downloading, and can also download in FASTA format (ie. FASTQ but with no quality information)
Alternatively, SRA command line
- Available for all OS formats- Can download reads more precisely
Disclaimer
• The dataset being analyzed is a FASTA file using old sequencing technology• Subsampled reads for carcinoma vs matched normal
samples
• However, the steps to analyze publicly downloaded or your own data using Galaxy would be similar
• Key take away: familiarize yourself with the Galaxy work environment
General overview of RNA-Seq Pipeline
Raw FASTQ Data
QC Passed Reads
Aligned BAM
Quantified Transcripts
Final DE Gene List
QC Checking, Adapted Trimming, low quality
base trimming
FastQC, CutAdapt, Trimmgalore
Alignment to reference genome/ transcriptome
STAR, HISAT2, TopHat
Quantification of transcripts
HTSeq, StringTie, Cufflinks, RSEM
Normalization and Differential Expression
between Genes
DESeq2, BallGown, CuffDiff, EBSeq
Going to Galaxy
Uploading Data
Uploading Data
Uploading Data
Uploading Data
*Names changed, and custom reference and gtf(annotation files) uploaded
Alignment to Reference
Alignment to Reference
Alignment to Reference
Transcript Quantification
We’ll want the “assembled transcripts” output
Merging Quantified Transcripts
This will produce a singular combined matrix that has our counts matrix
Differential Expression
Click on execute
Viewing Results
Exporting Results
Exporting Results
How does the whole workflow look?
Extracting Workflow
Extracting Workflow
Extracting Workflow
Publicly Available Datasets
TCGA: The Cancer Genome Atlas
Publicly Available Datasets
International Cancer Genome Consortium
Publicly Available Datasets
R2: Genomics Analysis and Visualization Platform
• The lecture slides shown today were adapted from the instructors at the Canadian Bioinformatics Workshop (bioinformaticsdotca.gihub.io)
– Workshops range from High Throughput sequencing, RNA-Seq analysis, Epigenomic data analysis etc.
– Their course content is free to use!!
• Look at biostars.org for questions and help along the way
Acknowledgements
Thank you
Any questions?