+ All Categories
Home > Documents > Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement...

Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement...

Date post: 23-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
27
Analysis and improvement of map - reduce data distribution in read mapping applications Joan Protasio [email protected]
Transcript
Page 1: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

Analysis and improvement of map-reduce

data distribution in read mapping applications

Joan Protasio

[email protected]

Page 2: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

Contents

• Introduction and problem statement

• Objectives

• Possible solutions: Map Reduce paradigm

• Hadoop and MR MAQ

• Experimentation and Results

• Conclusions

Page 3: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

DNA Sequencing

• Refering to the suit of biochemical methods and techniques used to obtain the real order

of nucleotides of a genome.

•Lots of Giga bases Nucleotides (gbps)

•23 chromosomes in Humans

•Redundant DNA (40%)

Adenine (A)

Guanine (G)

Cytosine (C)

Thymine (T)

Page 4: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

DNA Sequencing

• From 60’s the suit of sequencing tools has experienced a fast evolution and lots of

improvements: Maxam Gilbert, the Dideoxy method of Frederik Sanger etc.

• 10 years from the first draft of Human Genome with the success of Human Genome

Project (HPG) by John Craig Venter.

• The Next Generation Sequencing involves two steps:

-Sequecing step : Sequencers generates raw data (Short Reads) of a genome.

-Read Mapping step : by Computational treatment.

Page 5: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

Mapping step

Records are mappedcovering the referencegenome. Algorithms, oraligners, searches thebest alignment for allrecords (that’s called thescore).

Big size of input datasets

High parallelism of Data

n bases

Page 6: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

>> 22 GB RAM

>> Days of Execution

MAQ algorithm

• One of most experienced Read Mapping algorithms ( but single thread version).

• Programmed for Sequencers which generates reads of short size.

• Success factor in mapping of 94,2 %.

• Time Complexity: O(N log N + L + (2-k NL))

• Spatial Complexity : max. precission (3Gaps) => 6 Hash tables * L

Inconvenients of MAQ

Whole Human Genome Ref. : 3,3 x 109 bps

Reads of Pollonator Sequencer: 300.000 (250bps)

Conventional PC Pentium 4 : 1GB RAM, 3,2 Ghz (9.000 MIPS)

1,64 x 106 3 x 109 0,1

NON-VIABLE!!! HPC needs

Hash Table : ({A,C,G,T}k x k ).6 == (414 x 14 ).6

Page 7: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

Contents

• Introduction and problem statement

• Objectives

• Possible solutions: Map Reduce paradigm

• Hadoop and MR MAQ

• Experimentation and Results

• Conclusions

Page 8: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

Objectives

• Implementation of a Read Mapping algorithm based on MAQ and oriented for

High Performance Computing using Map Reduce paradigm and the

framework of Hadoop.

• Use the aligner method of Seed and Extend.

• Perfomance Analysis and scalability.

Page 9: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

Contents

• Introduction and problem statement

• Objectives

• Possible solutions: Map Reduce paradigm

• Hadoop and MR MAQ

• Experimentation and Results

• Conclusions

Page 10: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

Possible solutions: Map Reduce

Paradigm

• Parallel programming paradigm oriented for projects which requires Scalability.

•Data intensive applications (terabytes, petabytes).

• Compute intensive applications (Batch jobs, CPU big consume).

• Big and easy abstraction for programmers.

• Functional programming: funcions Map, Reduce.

Page 11: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

Map Reduce paradigm : Architecture

• Input and data generated during the job are stored in the Distributed File

System (DFS).

• Each node execute phases: Map (M) and Reduce (R) (Shuffle, sort &

function R).

• Data generated by Map are tuples Key, value.

• Shuffle phase group by Key.

Page 12: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

Function Map

Map (reg) →list(key_n,value_n)

list(key,value) →list(key, list(value))

(Hola, 1)

(Hola, 1)

(Hola, 1)

.

(és, 1)

.

(prova,1)

key Value

List (c,v)

Group per key

“Hola”

Group per key

“és”

Function Shuffle & Sort

(key,list(value)) →(key, value)

Function Reduce

1+1+1 == 3

1 == 1

això, 1

de, 1

és, 1

Hola, 3

prova,1

text, 1

un, 1

Page 13: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

Contents

• Introduction and problem statement

• Objectives

• Possible solutions: Map Reduce paradigm

• Hadoop and MR MAQ

• Experimentation and Results

• Conclusions

Page 14: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

Highly Inspired in MAQ

MR Maq

Implements Seed and Extend method

Uses the Framework of Hadoop

Seed step

Seed and Extend method

• Principle: Is easier to fit a small part of reads than the whole read.

Extend step

• Once the algorithm finds an exact match checks the rest of the read.

Seed

n bases

Page 15: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

Function Map

MR MAQ

tuples (key, value) (k_new, (Rest_of_read, ...))

IsRef

?

Whole Short·Read idFASTA

Template App.

IsRef

?

Long fragment of Genome

idPreProc

Template App.

tuples (key , value) n bases

Seed (Seed & Extend method)Apply

Template Function Gap MAQ

new reads from

original

[11110000 ] , [ 00001111 ]

Implements the Seed step

Page 16: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

Function Shuffle & Sort

MR MAQ

Tuples_Genome (key , value) Tuples_short·Read (key ,value)

Group_i

Function Reduce

n bases

(key, list(values))

Seed Whole Reads

(total-n) bases

IsRef

?

Whole Short·Read idFASTA

Template App.

Implements the Extend step

Group per key (Seed) tuples of reads and genome

Page 17: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

Contents

• Introduction and problem statement

• Objectives

• Possible solutions: Map Reduce paradigm

• Hadoop and MR MAQ

• Experimentation and Results

• Conclusions

Page 18: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

Experimentation and Results

- Processor Pentium D 3,40 Ghz.

- 1,00 Gb RAM memory.

- 2048 Kb cache memory in 2 levels ( L1 i L2) per

core.

- IDE Hard disks.

- O.S. Debian Ubuntu 9,04 with kernel 2.6.28-11.37.

- Hadoop release 0.19.2.

- Jdk 1.7

Test Case, Environment definition

Input Description

• Whole Human Chromosome 22 (40 x 106 bps ).

• 100.000 Short·Reads with a High quality (36 bps).

Page 19: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

Experimentation 1: Real case (1 Node)

•Objective: Tunning up to optimize total time execution.

• User parameters: 1 Map, 1 Reduce.

•Administrador parameters: block size HDFS: 32MB, buffer size of map=100MB, %

occupation of map buffer etc.

Fase Temps Exec. (seg)

Map 201

Reduce 277

Fase Temps Exec. (seg)

Map 201

Shuffle 195

Sort 7

Reduce cpu 75

Page 20: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

Experimentation 1: Real case (1 Node)

•MAP : Most of time doing E/S.

• Reduce: Compute intesive CPU

•Disk access (HDFS) slower than buffer access (mem)

Fase Temps Exec. (seg)

Map 152

Shuffle 147

Sort 3

Reduce cpu 31

With tunning up:

-Map increases Throughput 24,37%

-Shuffle 24,61%

-Sort 57,14%

-Reduce 58,66%.

User params

•4 Maps x Core

•1 Reduce x Core

Administrador params

• HDFS Block size ≈ buffer map size

•High % Buffer occupattion : disk access

Extensive testing

Page 21: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

Experimentation: Time Execution

Nodes Map

Reduce

cpu Total Teòric

1 152 31 183 183

2 62 15 88 91,5

4 30 9 54 45,75

Page 22: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

Experimentation: Time Execution and Speed-Up

Page 23: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

Experimentation: Efficiency

Page 24: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

Contents

• Introduction and problem statement

• Objectives

• Possible solutions: Map Reduce paradigm

• Hadoop and MR MAQ

• Experimentation and Results

• Conclusions

Page 25: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

Conclusions

• MR MAQ linearly scales and succeeds for all executions.

• Efficiency of Map is excellent but the Reduce one could improve (Amdahl Law)

• Abstaction of Framework becomes positive for the programming step but it’s

not good for the experimentation and tunning up steps.

Lines of Continuity

• Execution with some genomes concurrently.

• Personalize queueing module (FIFO by default)

• Graphical interface for the application

Page 26: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

Thank you.

Page 27: Analysis and improvement of map-reduce data …...Contents •Introduction and problem statement •Objectives •Possible solutions: Map Reduce paradigm •Hadoop and MR MAQ •Experimentation

Analysis and improvement of map-reduce

data distribution in read mapping applications

Joan Protasio

[email protected]


Recommended