+ All Categories
Home > Documents > Aligning Reads Ramesh Hariharan Strand Life Sciences IISc

Aligning Reads Ramesh Hariharan Strand Life Sciences IISc

Date post: 23-Feb-2016
Category:
Upload: necia
View: 32 times
Download: 0 times
Share this document with a friend
Description:
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc. What is Read Alignment?. Subject’s Genome. Where do these match in the Reference?. AGGCTACGCATTTCCCATAAAGACCCACGCTTAAGTTC. AGGCTACGCAT G TCCCATAA T GACCCAC A CTTAAGTTC. Reference Genome. - PowerPoint PPT Presentation
Popular Tags:
31
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc
Transcript
Page 1: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

Aligning Reads

Ramesh Hariharan

Strand Life SciencesIISc

Page 2: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

What is Read Alignment?

Page 3: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

AGGCTACGCATTTCCCATAAAGACCCACGCTTAAGTTC

Subject’s Genome

AGGCTACGCATGTCCCATAATGACCCACACTTAAGTTC

Reference Genome

Where do these

match in the

Reference?

Close but not quite

the same as the

Subject’s Genome

Page 4: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

What does “Match” mean?

Page 5: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

AGGCTACGCATGTCCCATAATGACCCACACTTAAGTTC

Reference Genome

GCTACGCA

Exact Match

CATAAAGAC

With Mismatche

s

CACTT_AGT

With Gaps

Page 6: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

Why mismatches and gaps?

Page 7: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

The subject genome could be different from the reference

Page 8: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

Reads

Reference

Genome

SNP

Deletion

Mismatches and Gaps

Page 9: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

The reading process could be erroneous

Page 10: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

How many mismatches and gaps?

Page 11: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

Short reads ~50, few

mismatches and gaps

Long reads, ~1000, many

more mismatches

and gaps

Page 12: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

How do aligners fare?

Page 13: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

BWA: Very few

mismatches and gaps

CoBWebBWA-SW:

Many mismatches

and gaps

BowTie: only

mismatches, no gaps

No paired read

handling

No handling of adaptor

trimming for small RNA

Separate handling for

RNASeq

BowTie2

Page 14: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

How does an Aligner work?

Page 15: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

For simplicity, assume Exact Match

Page 16: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

For each read, scan the entire reference genome sequence

SLOW!!!!

Page 17: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

C G A C GThe

ReferenceC

C

G

T

T

A C

A G

A C

T

Index the Reference

Page 18: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

How can we find Exact Matches of a read quickly with this index?

Page 19: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

C G A C GThe

ReferenceC

C

G

T

T

A C

A G

A C

T

C G C

Page 20: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

The problem: 24GB

Page 21: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

Can this structure be compressed?

Page 22: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

C G A C $

A C $ C GC G A C $C $ C G AG A C $ C$ C G A C

The Reference

This column is the BWT

All its circular shifts, sorted

lexicographically

The Index: now an array instead

of a tree

The Burrows-Wheeler

based Index

Sampled to reduce memory at the

expense of speed (Ferragina and

Manzini)

Page 23: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

How about Mismatches and Gaps?

Page 24: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

BWA, BWA-SW and BowTie force mismatches and gaps into the BW Index searching procedure

Page 25: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

CoBWeb uses the BW Index to find a ‘seed’ exact match and does Smith-Waterman around this

seed

This 15-mer occurs at

locations x1, x2…

This 15-mer occurs at

locations x3, x4… This whole 30-

mer occurs at location x5

Page 26: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

Dynamic Programming

• Given a location in the reference with an read anchor, how well does the read match here?

Reference

Read

Anchor 14 mer

• Smith-Waterman (optimized for large gaps)

Page 27: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

Comparison with BWA

Read Length 50

Read Length

150

20% faster than BWA with

comparable results

CoBWeb: 3 mismatches and 2 gaps

BWA: 2 mismatches + 1 gap of possibly multiple length

Page 28: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

Comparison with BWA-SW

Read Length

400

8 mismatches

plus 10 gaps

CoBWeb BWA-SW

Reads 1m 1m

Time taken 1130s 2242s

Incorrectly Mapped 12598 9819

5650 mapped

incorrecty by BWA-SW

The remainder

has poor BWA mapping quality

Page 29: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

Avadis NGS

Page 30: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

Avadis NGS Alignment, DNA Var Detection,

RNASeq, ChIPSeq, Small RNASeq

Page 31: Aligning Reads  Ramesh  Hariharan Strand Life Sciences IISc

Thank You


Recommended