Analysis and improvement of map-reduce
data distribution in read mapping applications
Joan Protasio
Contents
• Introduction and problem statement
• Objectives
• Possible solutions: Map Reduce paradigm
• Hadoop and MR MAQ
• Experimentation and Results
• Conclusions
DNA Sequencing
• Refering to the suit of biochemical methods and techniques used to obtain the real order
of nucleotides of a genome.
•Lots of Giga bases Nucleotides (gbps)
•23 chromosomes in Humans
•Redundant DNA (40%)
Adenine (A)
Guanine (G)
Cytosine (C)
Thymine (T)
DNA Sequencing
• From 60’s the suit of sequencing tools has experienced a fast evolution and lots of
improvements: Maxam Gilbert, the Dideoxy method of Frederik Sanger etc.
• 10 years from the first draft of Human Genome with the success of Human Genome
Project (HPG) by John Craig Venter.
• The Next Generation Sequencing involves two steps:
-Sequecing step : Sequencers generates raw data (Short Reads) of a genome.
-Read Mapping step : by Computational treatment.
Mapping step
Records are mappedcovering the referencegenome. Algorithms, oraligners, searches thebest alignment for allrecords (that’s called thescore).
Big size of input datasets
High parallelism of Data
n bases
>> 22 GB RAM
>> Days of Execution
MAQ algorithm
• One of most experienced Read Mapping algorithms ( but single thread version).
• Programmed for Sequencers which generates reads of short size.
• Success factor in mapping of 94,2 %.
• Time Complexity: O(N log N + L + (2-k NL))
• Spatial Complexity : max. precission (3Gaps) => 6 Hash tables * L
Inconvenients of MAQ
Whole Human Genome Ref. : 3,3 x 109 bps
Reads of Pollonator Sequencer: 300.000 (250bps)
Conventional PC Pentium 4 : 1GB RAM, 3,2 Ghz (9.000 MIPS)
1,64 x 106 3 x 109 0,1
NON-VIABLE!!! HPC needs
Hash Table : ({A,C,G,T}k x k ).6 == (414 x 14 ).6
Contents
• Introduction and problem statement
• Objectives
• Possible solutions: Map Reduce paradigm
• Hadoop and MR MAQ
• Experimentation and Results
• Conclusions
Objectives
• Implementation of a Read Mapping algorithm based on MAQ and oriented for
High Performance Computing using Map Reduce paradigm and the
framework of Hadoop.
• Use the aligner method of Seed and Extend.
• Perfomance Analysis and scalability.
Contents
• Introduction and problem statement
• Objectives
• Possible solutions: Map Reduce paradigm
• Hadoop and MR MAQ
• Experimentation and Results
• Conclusions
Possible solutions: Map Reduce
Paradigm
• Parallel programming paradigm oriented for projects which requires Scalability.
•Data intensive applications (terabytes, petabytes).
• Compute intensive applications (Batch jobs, CPU big consume).
• Big and easy abstraction for programmers.
• Functional programming: funcions Map, Reduce.
Map Reduce paradigm : Architecture
• Input and data generated during the job are stored in the Distributed File
System (DFS).
• Each node execute phases: Map (M) and Reduce (R) (Shuffle, sort &
function R).
• Data generated by Map are tuples Key, value.
• Shuffle phase group by Key.
Function Map
Map (reg) →list(key_n,value_n)
list(key,value) →list(key, list(value))
(Hola, 1)
(Hola, 1)
(Hola, 1)
.
(és, 1)
.
(prova,1)
key Value
List (c,v)
Group per key
“Hola”
Group per key
“és”
Function Shuffle & Sort
(key,list(value)) →(key, value)
Function Reduce
1+1+1 == 3
1 == 1
això, 1
de, 1
és, 1
Hola, 3
prova,1
text, 1
un, 1
Contents
• Introduction and problem statement
• Objectives
• Possible solutions: Map Reduce paradigm
• Hadoop and MR MAQ
• Experimentation and Results
• Conclusions
Highly Inspired in MAQ
MR Maq
Implements Seed and Extend method
Uses the Framework of Hadoop
Seed step
Seed and Extend method
• Principle: Is easier to fit a small part of reads than the whole read.
Extend step
• Once the algorithm finds an exact match checks the rest of the read.
Seed
n bases
Function Map
MR MAQ
tuples (key, value) (k_new, (Rest_of_read, ...))
IsRef
?
Whole Short·Read idFASTA
Template App.
IsRef
?
Long fragment of Genome
idPreProc
Template App.
tuples (key , value) n bases
Seed (Seed & Extend method)Apply
Template Function Gap MAQ
new reads from
original
[11110000 ] , [ 00001111 ]
Implements the Seed step
Function Shuffle & Sort
MR MAQ
Tuples_Genome (key , value) Tuples_short·Read (key ,value)
Group_i
Function Reduce
n bases
(key, list(values))
Seed Whole Reads
(total-n) bases
IsRef
?
Whole Short·Read idFASTA
Template App.
Implements the Extend step
Group per key (Seed) tuples of reads and genome
Contents
• Introduction and problem statement
• Objectives
• Possible solutions: Map Reduce paradigm
• Hadoop and MR MAQ
• Experimentation and Results
• Conclusions
Experimentation and Results
- Processor Pentium D 3,40 Ghz.
- 1,00 Gb RAM memory.
- 2048 Kb cache memory in 2 levels ( L1 i L2) per
core.
- IDE Hard disks.
- O.S. Debian Ubuntu 9,04 with kernel 2.6.28-11.37.
- Hadoop release 0.19.2.
- Jdk 1.7
Test Case, Environment definition
Input Description
• Whole Human Chromosome 22 (40 x 106 bps ).
• 100.000 Short·Reads with a High quality (36 bps).
Experimentation 1: Real case (1 Node)
•Objective: Tunning up to optimize total time execution.
• User parameters: 1 Map, 1 Reduce.
•Administrador parameters: block size HDFS: 32MB, buffer size of map=100MB, %
occupation of map buffer etc.
Fase Temps Exec. (seg)
Map 201
Reduce 277
Fase Temps Exec. (seg)
Map 201
Shuffle 195
Sort 7
Reduce cpu 75
Experimentation 1: Real case (1 Node)
•MAP : Most of time doing E/S.
• Reduce: Compute intesive CPU
•Disk access (HDFS) slower than buffer access (mem)
Fase Temps Exec. (seg)
Map 152
Shuffle 147
Sort 3
Reduce cpu 31
With tunning up:
-Map increases Throughput 24,37%
-Shuffle 24,61%
-Sort 57,14%
-Reduce 58,66%.
User params
•4 Maps x Core
•1 Reduce x Core
Administrador params
• HDFS Block size ≈ buffer map size
•High % Buffer occupattion : disk access
Extensive testing
Experimentation: Time Execution
Nodes Map
Reduce
cpu Total Teòric
1 152 31 183 183
2 62 15 88 91,5
4 30 9 54 45,75
Experimentation: Time Execution and Speed-Up
Experimentation: Efficiency
Contents
• Introduction and problem statement
• Objectives
• Possible solutions: Map Reduce paradigm
• Hadoop and MR MAQ
• Experimentation and Results
• Conclusions
Conclusions
• MR MAQ linearly scales and succeeds for all executions.
• Efficiency of Map is excellent but the Reduce one could improve (Amdahl Law)
• Abstaction of Framework becomes positive for the programming step but it’s
not good for the experimentation and tunning up steps.
Lines of Continuity
• Execution with some genomes concurrently.
• Personalize queueing module (FIFO by default)
• Graphical interface for the application
Thank you.
Analysis and improvement of map-reduce
data distribution in read mapping applications
Joan Protasio