Aligning Reads Ramesh Hariharan Strand Life Sciences IISc

Post on 23-Feb-2016

32 views 0 download

Tags:

description

Aligning Reads Ramesh Hariharan Strand Life Sciences IISc. What is Read Alignment?. Subject’s Genome. Where do these match in the Reference?. AGGCTACGCATTTCCCATAAAGACCCACGCTTAAGTTC. AGGCTACGCAT G TCCCATAA T GACCCAC A CTTAAGTTC. Reference Genome. - PowerPoint PPT Presentation

transcript

Aligning Reads

Ramesh Hariharan

Strand Life SciencesIISc

What is Read Alignment?

AGGCTACGCATTTCCCATAAAGACCCACGCTTAAGTTC

Subject’s Genome

AGGCTACGCATGTCCCATAATGACCCACACTTAAGTTC

Reference Genome

Where do these

match in the

Reference?

Close but not quite

the same as the

Subject’s Genome

What does “Match” mean?

AGGCTACGCATGTCCCATAATGACCCACACTTAAGTTC

Reference Genome

GCTACGCA

Exact Match

CATAAAGAC

With Mismatche

s

CACTT_AGT

With Gaps

Why mismatches and gaps?

The subject genome could be different from the reference

Reads

Reference

Genome

SNP

Deletion

Mismatches and Gaps

The reading process could be erroneous

How many mismatches and gaps?

Short reads ~50, few

mismatches and gaps

Long reads, ~1000, many

more mismatches

and gaps

How do aligners fare?

BWA: Very few

mismatches and gaps

CoBWebBWA-SW:

Many mismatches

and gaps

BowTie: only

mismatches, no gaps

No paired read

handling

No handling of adaptor

trimming for small RNA

Separate handling for

RNASeq

BowTie2

How does an Aligner work?

For simplicity, assume Exact Match

For each read, scan the entire reference genome sequence

SLOW!!!!

C G A C GThe

ReferenceC

C

G

T

T

A C

A G

A C

T

Index the Reference

How can we find Exact Matches of a read quickly with this index?

C G A C GThe

ReferenceC

C

G

T

T

A C

A G

A C

T

C G C

The problem: 24GB

Can this structure be compressed?

C G A C $

A C $ C GC G A C $C $ C G AG A C $ C$ C G A C

The Reference

This column is the BWT

All its circular shifts, sorted

lexicographically

The Index: now an array instead

of a tree

The Burrows-Wheeler

based Index

Sampled to reduce memory at the

expense of speed (Ferragina and

Manzini)

How about Mismatches and Gaps?

BWA, BWA-SW and BowTie force mismatches and gaps into the BW Index searching procedure

CoBWeb uses the BW Index to find a ‘seed’ exact match and does Smith-Waterman around this

seed

This 15-mer occurs at

locations x1, x2…

This 15-mer occurs at

locations x3, x4… This whole 30-

mer occurs at location x5

Dynamic Programming

• Given a location in the reference with an read anchor, how well does the read match here?

Reference

Read

Anchor 14 mer

• Smith-Waterman (optimized for large gaps)

Comparison with BWA

Read Length 50

Read Length

150

20% faster than BWA with

comparable results

CoBWeb: 3 mismatches and 2 gaps

BWA: 2 mismatches + 1 gap of possibly multiple length

Comparison with BWA-SW

Read Length

400

8 mismatches

plus 10 gaps

CoBWeb BWA-SW

Reads 1m 1m

Time taken 1130s 2242s

Incorrectly Mapped 12598 9819

5650 mapped

incorrecty by BWA-SW

The remainder

has poor BWA mapping quality

Avadis NGS

Avadis NGS Alignment, DNA Var Detection,

RNASeq, ChIPSeq, Small RNASeq

Thank You