+ All Categories
Home > Documents > Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq...

Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq...

Date post: 25-Jul-2020
Category:
Upload: others
View: 25 times
Download: 0 times
Share this document with a friend
56
Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018
Transcript
Page 1: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Introduction to Bioinformatics: RNA-Seq Analysis

Hamza Farooq, MSc.LMP Seminar Series

15 October 2018

Page 2: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

2Module #: Title of Module

Page 3: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Outline of seminar

- Next Generation Sequencing (NGS) overview

- Types of data are generated from raw sequenced read files (fastq, SAM, BAM)

- How to download publicly available RNA-Seq data from Gene Expression Omnibus

- Preliminary differential expression analysis of publicly available RNA-Seq data using Galaxy

Page 4: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Module 1

Genome VariationTwo unrelated humans have genomes that are ~99.8%similar by sequence (~ 3-4 million differences).Most differences are small, e.g. Single Nucleotide Polymorphisms (SNPs).

Human and chimpanzee genomes are about 96%similar

Pictures: http://www.dana.org/news/publications/detail.aspx?id=24536,http:// en.wikipedia.org/wiki/Chimpanzee

bioinformatics.ca

Page 5: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

bioinformatics.ca

Sanger Sequencing

Slide credit: AaronQuinlan

use dideoxynucleotides toinhibit elongation of a DNAstrand

separate strands withgel electrophoresis

Page 6: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Sequencing genomes in

Months and Years

Sequencing genomes in

HOURS/Minutes !!

Technology revolution

Projects cost: Billions $ Thousands $

Page 7: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

DNA sequencing: DNAPolymerase

C

T

A C

A

G

G

G

A

C

Single-stranded DNA template

Free nucleotides

DNApolymerase

+ + DNA Pol

G

C G

C

G C

A T

DN

AP

ol

Strand synthesis

zip!

DNA polymerase moves along the template in onedirection, integrating complementary nucleotides as it goes

=

G C3’ 5’

bioinformatics.ca

Page 8: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Template Cluster

Polymerasechain reaction(PCR)

Sequencing by synthesis (Solexa/Illumina)

Page 9: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Repeatedly inject mixture of color-labeled

nucleotides (A, C, G and T) and DNApolymerase. When a complementary nucleotide is added to a cluster, the corresponding color of light is emitted. Capture images of this as it happens.

DNAPolymerase DNA

Polymerase

+

~

~

Pretend these areclusters

(snap)

Shown here is just the firstsequencing cycle

bioinformatics.ca

Sequencing by synthesis (Solexa/Illumina)

Page 10: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

~

Line up images and, for each cluster, turn the series of light signals into corresponding series of nucleotides

~

~

~

Cycle 1 Cycle 2

~

~Cycle 3

~

~

Cycle 4

~

~

Cycle 5

Sequencing by synthesis(Solexa/Illumina)

bioinformatics.ca

Page 11: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

FASTQ format

Sample1_R1.fastq.gz

Sample2_R1.fastq.gz

End 1 End 2

Each sample will generate between 5Gb (100x WES) to 300Gb (100x WGS) of data

@ERR127302.1 HWI-EAS350_0441:1:1:1055:4898#0/1

GGCTCATCTTGAACTGGGTGGCGACCGTCCCTGGCCCCTTCTTGACACCCAGCGCNNNNNNNNNNNNNNNNA

+

4=B@D99BDDDDDDD:DD?B<<=?>6B#############################################

@ERR127302.2 HWI-EAS350_0441:1:1:1056:1163#0/1

GAATGAGAGGCCCTCCCCGTGGAGGCATGGTATCCGGCCGAGGGGGCTTAGTCATNNNNNNNNNNNNNNNNC

+

B?,B2,?=?1?1B?D@?:@?DB3>AD,8DD??-B?#####################################

@ERR127302.3 HWI-EAS350_0441:1:1:1057:13164#0/1

GGCCGCAGTGCCATTGAGCTCACCAAAATGCTCTGTGAAATCCTGCAGGTTGGGGANNNNNNNNNNNNNNGA

+

DFBH?GDEG>GEGGDHH>HBDBEGD8G<GG<DGGGCB><82???@DDBBDDGGE##################

file:///Users/flefebvr/Downloads/fq.txt

1 of 1 13-05-31 10:43 AM

Sample1_R2.fastq.gz

Sample2_R2.fastq.gz

HeaderSequencePlace holderQuality

Page 12: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

FASTQ format

Sample1_R1.fastq.gz

Sample2_R1.fastq.gz

End 1 End 2

Each sample will generate between 5Gb (100x WES) to 300Gb (100x WGS) of data

@ERR127302.1 HWI-EAS350_0441:1:1:1055:4898#0/1

GGCTCATCTTGAACTGGGTGGCGACCGTCCCTGGCCCCTTCTTGACACCCAGCGCNNNNNNNNNNNNNNNNA

+

4=B@D99BDDDDDDD:DD?B<<=?>6B#############################################

@ERR127302.2 HWI-EAS350_0441:1:1:1056:1163#0/1

GAATGAGAGGCCCTCCCCGTGGAGGCATGGTATCCGGCCGAGGGGGCTTAGTCATNNNNNNNNNNNNNNNNC

+

B?,B2,?=?1?1B?D@?:@?DB3>AD,8DD??-B?#####################################

@ERR127302.3 HWI-EAS350_0441:1:1:1057:13164#0/1

GGCCGCAGTGCCATTGAGCTCACCAAAATGCTCTGTGAAATCCTGCAGGTTGGGGANNNNNNNNNNNNNNGA

+

DFBH?GDEG>GEGGDHH>HBDBEGD8G<GG<DGGGCB><82???@DDBBDDGGE##################

file:///Users/flefebvr/Downloads/fq.txt

1 of 1 13-05-31 10:43 AM

Sample1_R2.fastq.gz

Sample2_R2.fastq.gz

Instrument: flowcell lane: tile number: x: y # index for multiplexed sample: member of pair

Page 13: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

FASTQ format

Sample1_R1.fastq.gz

Sample2_R1.fastq.gz

End 1 End 2

Each sample will generate between 5Gb (100x WES) to 300Gb (100x WGS) of data

@ERR127302.1 HWI-EAS350_0441:1:1:1055:4898#0/1

GGCTCATCTTGAACTGGGTGGCGACCGTCCCTGGCCCCTTCTTGACACCCAGCGCNNNNNNNNNNNNNNNNA

+

4=B@D99BDDDDDDD:DD?B<<=?>6B#############################################

@ERR127302.2 HWI-EAS350_0441:1:1:1056:1163#0/1

GAATGAGAGGCCCTCCCCGTGGAGGCATGGTATCCGGCCGAGGGGGCTTAGTCATNNNNNNNNNNNNNNNNC

+

B?,B2,?=?1?1B?D@?:@?DB3>AD,8DD??-B?#####################################

@ERR127302.3 HWI-EAS350_0441:1:1:1057:13164#0/1

GGCCGCAGTGCCATTGAGCTCACCAAAATGCTCTGTGAAATCCTGCAGGTTGGGGANNNNNNNNNNNNNNGA

+

DFBH?GDEG>GEGGDHH>HBDBEGD8G<GG<DGGGCB><82???@DDBBDDGGE##################

file:///Users/flefebvr/Downloads/fq.txt

1 of 1 13-05-31 10:43 AM

Sample1_R2.fastq.gz

Sample2_R2.fastq.gz

HeaderSequencePlace holderQuality

Page 14: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

FASTQ format

Sample1_R1.fastq.gz

Sample2_R1.fastq.gz

End 1 End 2

Each sample will generate between 5Gb (100x WES) to 300Gb (100x WGS) of data

@ERR127302.1 HWI-EAS350_0441:1:1:1055:4898#0/1

GGCTCATCTTGAACTGGGTGGCGACCGTCCCTGGCCCCTTCTTGACACCCAGCGCNNNNNNNNNNNNNNNNA

+

4=B@D99BDDDDDDD:DD?B<<=?>6B#############################################

@ERR127302.2 HWI-EAS350_0441:1:1:1056:1163#0/1

GAATGAGAGGCCCTCCCCGTGGAGGCATGGTATCCGGCCGAGGGGGCTTAGTCATNNNNNNNNNNNNNNNNC

+

B?,B2,?=?1?1B?D@?:@?DB3>AD,8DD??-B?#####################################

@ERR127302.3 HWI-EAS350_0441:1:1:1057:13164#0/1

GGCCGCAGTGCCATTGAGCTCACCAAAATGCTCTGTGAAATCCTGCAGGTTGGGGANNNNNNNNNNNNNNGA

+

DFBH?GDEG>GEGGDHH>HBDBEGD8G<GG<DGGGCB><82???@DDBBDDGGE##################

file:///Users/flefebvr/Downloads/fq.txt

1 of 1 13-05-31 10:43 AM

Sample1_R2.fastq.gz

Sample2_R2.fastq.gz

HeaderSequencePlace holderQuality

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

Page 15: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

After generation of FASTQ data

• What to do with the obtained reads?

• Most of the time the reads will be aligned to a reference genome

– Leverage high quality assemblies of existing species with each individual sequencing

Page 16: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

SAM/BAM

• Used to store alignments

• SAM = text, BAM = binary

SRR013667.1 99 19 8882171 60 76M = 8882214 119

NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGA

#>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>

Read name Flag Reference Position CIGAR Mate Position

Bases

Base Qualities

Sample1.bam

Sample2.bamSRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAG

between 10Gb to 500Gb each bam

SAM: Sequence Alignment/Map format

Page 17: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Summary

• NGS technology allows generation of hundreds of millions of reads for different high throughput purposes, including transcript quantification and genome sequencing.

• FastQ format – DNA sequence + quality of each nucleotide for each read from sequencer.

• SAM format – FastQ + alignment info (chromosome, start, end for each reads)

• BAM format – SAM converted to binary form to conserve space

Page 18: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

How to get and processsequencing data

• Overview of 3 useful websites:

- Gene Expression Omnibus

- Repository for descriptions of sequencing daa generated from studies

- Sequence Read Archive

- Repository for downloading publicly accessible sequencing data

- Galaxy

- Online platform for free bioinformatics processing

Page 19: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Gene Expression Omnibus

Page 20: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Gene Expression Omnibus: Datasets

Page 21: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Sequence Read Archive

Page 22: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Galaxy

Page 23: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Workflows / Pipelines

Page 24: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Workflows / Pipelines

Page 25: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Demo of “at home” RNA-Seq Analysis

Page 26: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

First, we need our input data (GEO)

Page 27: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018
Page 28: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Description of the sample

Page 29: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018
Page 30: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

SRA Interactive Download Page

All files related to that experiment are retrieved

Can filter the data before downloading, and can also download in FASTA format (ie. FASTQ but with no quality information)

Page 31: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Alternatively, SRA command line

- Available for all OS formats- Can download reads more precisely

Page 32: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Disclaimer

• The dataset being analyzed is a FASTA file using old sequencing technology• Subsampled reads for carcinoma vs matched normal

samples

• However, the steps to analyze publicly downloaded or your own data using Galaxy would be similar

• Key take away: familiarize yourself with the Galaxy work environment

Page 33: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

General overview of RNA-Seq Pipeline

Raw FASTQ Data

QC Passed Reads

Aligned BAM

Quantified Transcripts

Final DE Gene List

QC Checking, Adapted Trimming, low quality

base trimming

FastQC, CutAdapt, Trimmgalore

Alignment to reference genome/ transcriptome

STAR, HISAT2, TopHat

Quantification of transcripts

HTSeq, StringTie, Cufflinks, RSEM

Normalization and Differential Expression

between Genes

DESeq2, BallGown, CuffDiff, EBSeq

Page 34: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Going to Galaxy

Page 35: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Uploading Data

Page 36: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Uploading Data

Page 37: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Uploading Data

Page 38: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Uploading Data

*Names changed, and custom reference and gtf(annotation files) uploaded

Page 39: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Alignment to Reference

Page 40: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Alignment to Reference

Page 41: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Alignment to Reference

Page 42: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Transcript Quantification

We’ll want the “assembled transcripts” output

Page 43: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Merging Quantified Transcripts

This will produce a singular combined matrix that has our counts matrix

Page 44: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Differential Expression

Click on execute

Page 45: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Viewing Results

Page 46: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Exporting Results

Page 47: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Exporting Results

Page 48: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

How does the whole workflow look?

Page 49: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Extracting Workflow

Page 50: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Extracting Workflow

Page 51: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Extracting Workflow

Page 52: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Publicly Available Datasets

TCGA: The Cancer Genome Atlas

Page 53: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Publicly Available Datasets

International Cancer Genome Consortium

Page 54: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Publicly Available Datasets

R2: Genomics Analysis and Visualization Platform

Page 55: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

• The lecture slides shown today were adapted from the instructors at the Canadian Bioinformatics Workshop (bioinformaticsdotca.gihub.io)

– Workshops range from High Throughput sequencing, RNA-Seq analysis, Epigenomic data analysis etc.

– Their course content is free to use!!

• Look at biostars.org for questions and help along the way

Acknowledgements

Page 56: Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq Analysis Hamza Farooq, MSc. LMP Seminar Series 15 October 2018

Thank you

Any questions?


Recommended