Download - Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

Analysis and improvement of map-reduce

data distribution in read mapping applications

Joan Protasio

[email protected]

Contents

• Introduction and problem statement

• Objectives

• Possible solutions: Map Reduce paradigm

• Hadoop and MR MAQ

• Experimentation and Results

• Conclusions

DNA Sequencing

• Refering to the suit of biochemical methods and techniques used to obtain the real order

of nucleotides of a genome.

•Lots of Giga bases Nucleotides (gbps)

•23 chromosomes in Humans

•Redundant DNA (40%)

Adenine (A)

Guanine (G)

Cytosine (C)

Thymine (T)

DNA Sequencing

• From 60’s the suit of sequencing tools has experienced a fast evolution and lots of

improvements: Maxam Gilbert, the Dideoxy method of Frederik Sanger etc.

• 10 years from the first draft of Human Genome with the success of Human Genome

Project (HPG) by John Craig Venter.

• The Next Generation Sequencing involves two steps:

-Sequecing step : Sequencers generates raw data (Short Reads) of a genome.

-Read Mapping step : by Computational treatment.

Mapping step

Records are mappedcovering the referencegenome. Algorithms, oraligners, searches thebest alignment for allrecords (that’s called thescore).

Big size of input datasets

High parallelism of Data

n bases

>> 22 GB RAM

>> Days of Execution

MAQ algorithm

• One of most experienced Read Mapping algorithms ( but single thread version).

• Programmed for Sequencers which generates reads of short size.

• Success factor in mapping of 94,2 %.

• Time Complexity: O(N log N + L + (2-k NL))

• Spatial Complexity : max. precission (3Gaps) => 6 Hash tables * L

Inconvenients of MAQ

Whole Human Genome Ref. : 3,3 x 109 bps

Reads of Pollonator Sequencer: 300.000 (250bps)

Conventional PC Pentium 4 : 1GB RAM, 3,2 Ghz (9.000 MIPS)

1,64 x 106 3 x 109 0,1

NON-VIABLE!!! HPC needs

Hash Table : ({A,C,G,T}k x k ).6 == (414 x 14 ).6

Contents


• Objectives




• Conclusions

Objectives

• Implementation of a Read Mapping algorithm based on MAQ and oriented for

High Performance Computing using Map Reduce paradigm and the

framework of Hadoop.

• Use the aligner method of Seed and Extend.

• Perfomance Analysis and scalability.

Contents


• Objectives




• Conclusions

Possible solutions: Map Reduce

Paradigm

• Parallel programming paradigm oriented for projects which requires Scalability.

•Data intensive applications (terabytes, petabytes).

• Compute intensive applications (Batch jobs, CPU big consume).

• Big and easy abstraction for programmers.

• Functional programming: funcions Map, Reduce.

Map Reduce paradigm : Architecture

• Input and data generated during the job are stored in the Distributed File

System (DFS).

• Each node execute phases: Map (M) and Reduce (R) (Shuffle, sort &

function R).

• Data generated by Map are tuples Key, value.

• Shuffle phase group by Key.

Function Map

Map (reg) →list(key_n,value_n)

list(key,value) →list(key, list(value))

(Hola, 1)

(Hola, 1)

(Hola, 1)

.

(és, 1)

.

(prova,1)

key Value

List (c,v)

Group per key

“Hola”

Group per key

“és”

Function Shuffle & Sort

(key,list(value)) →(key, value)

Function Reduce

1+1+1 == 3

1 == 1

això, 1

de, 1

és, 1

Hola, 3

prova,1

text, 1

un, 1

Contents


• Objectives




• Conclusions

Highly Inspired in MAQ

MR Maq

Implements Seed and Extend method

Uses the Framework of Hadoop

Seed step

Seed and Extend method

• Principle: Is easier to fit a small part of reads than the whole read.

Extend step

• Once the algorithm finds an exact match checks the rest of the read.

Seed

n bases

Function Map

MR MAQ

tuples (key, value) (k_new, (Rest_of_read, ...))

IsRef

?

Whole Short·Read idFASTA

Template App.

IsRef

?

Long fragment of Genome

idPreProc

Template App.

tuples (key , value) n bases

Seed (Seed & Extend method)Apply

Template Function Gap MAQ

new reads from

original

[11110000 ] , [ 00001111 ]

Implements the Seed step

Function Shuffle & Sort

MR MAQ

Tuples_Genome (key , value) Tuples_short·Read (key ,value)

Group_i

Function Reduce

n bases

(key, list(values))

Seed Whole Reads

(total-n) bases

IsRef

?

Whole Short·Read idFASTA

Template App.

Implements the Extend step

Group per key (Seed) tuples of reads and genome

Contents


• Objectives




• Conclusions

Experimentation and Results

- Processor Pentium D 3,40 Ghz.

- 1,00 Gb RAM memory.

- 2048 Kb cache memory in 2 levels ( L1 i L2) per

core.

- IDE Hard disks.

- O.S. Debian Ubuntu 9,04 with kernel 2.6.28-11.37.

- Hadoop release 0.19.2.

- Jdk 1.7

Test Case, Environment definition

Input Description

• Whole Human Chromosome 22 (40 x 106 bps ).

• 100.000 Short·Reads with a High quality (36 bps).

Experimentation 1: Real case (1 Node)

•Objective: Tunning up to optimize total time execution.

• User parameters: 1 Map, 1 Reduce.

•Administrador parameters: block size HDFS: 32MB, buffer size of map=100MB, %

occupation of map buffer etc.

Fase Temps Exec. (seg)

Map 201

Reduce 277


Map 201

Shuffle 195

Sort 7

Reduce cpu 75

Experimentation 1: Real case (1 Node)

•MAP : Most of time doing E/S.

• Reduce: Compute intesive CPU

•Disk access (HDFS) slower than buffer access (mem)


Map 152

Shuffle 147

Sort 3

Reduce cpu 31

With tunning up:

-Map increases Throughput 24,37%

-Shuffle 24,61%

-Sort 57,14%

-Reduce 58,66%.

User params

•4 Maps x Core

•1 Reduce x Core

Administrador params

• HDFS Block size ≈ buffer map size

•High % Buffer occupattion : disk access

Extensive testing

Experimentation: Time Execution

Nodes Map

Reduce

cpu Total Teòric

1 152 31 183 183

2 62 15 88 91,5

4 30 9 54 45,75

Experimentation: Time Execution and Speed-Up

Experimentation: Efficiency

Contents


• Objectives




• Conclusions

Conclusions

• MR MAQ linearly scales and succeeds for all executions.

• Efficiency of Map is excellent but the Reduce one could improve (Amdahl Law)

• Abstaction of Framework becomes positive for the programming step but it’s

not good for the experimentation and tunning up steps.

Lines of Continuity

• Execution with some genomes concurrently.

• Personalize queueing module (FIFO by default)

• Graphical interface for the application

Thank you.

Analysis and improvement of map-reduce

data distribution in read mapping applications

Joan Protasio

[email protected]