Date post: | 31-Dec-2015 |
Category: |
Documents |
Upload: | virgil-french |
View: | 215 times |
Download: | 1 times |
PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA
Candidacy Examination08/26/2014
Mucahid Kutlu
MotivationThe sequencing costs are decreasing Big data problem
Candidacy Examination 2
*Adapted from genome.gov/sequencingcosts *Adapted from https://www.nlm.nih.gov/about/2015CJ.html
Parallel processing is inevitable!
Typical Analysis on Genomic Data
• Single Nucleotide Polymorphism (SNP) calling
Candidacy Examination 3
Sequences 1 2 3 4 5 6 7 8Read-1 A G C GRead-2 G C G GRead-3 G C G T ARead-4 C G T T C C
Alig
nmen
t File
-1
Reference A G C G T A C C
Sequences 1 2 3 4 5 6 7 8
Read-1 A G A G
Read-2 A G A G T
Read-3 G A G T
Read-4 G T T C CAlig
nmen
t File
-2
*Adapted from Wikipedia
A single SNP may cause Mendelian disease!
✖ ✓✖
Existing Solutions for Implementation
• Serial tools– SamTools, VCFTools, BedTools – File merging, sorting etc.– VarScan – SNP calling
• Parallel implementations– Turboblast, searching local alignments, – SEAL, read mapping and duplicate removal– Biodoop, statistical analysis
• Middleware Systems– Hadoop
• Not designed for specific needs of genetic data• Limited programmability
– Genome Analysis Tool Kit (GATK)• Designed for genetic data processing• Provides special data traversal patterns• Limited parallelization for some of its tools
Candidacy Examination 4
Main Goal of My Thesis
Candidacy Examination 5
• We want to develop middleware systems– Specific for parallel genetic data processing– Allow parallelization of a variety of genetic algorithms– Be able to work with different popular genetic data
formats – Eases programming since most developers are biologists,
not computer scientists
Papers During My PhD Study• Mucahid Kutlu, Gagan Agrawal. Cluster-based SNP Calling on Large-Scale
Genome Sequencing Data, the 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2014) (Accepted - 19.1% acceptance rate)
• -Mucahid Kutlu, Gagan Agrawal, PAGE: A Framework for Easy PArallelization of GEnomic Applications,the 28th IEEE International Parallel & Distributed Process- ing Symposium (IPDPS 2014) (Accepted - 21.1% acceptance rate)
• -Mucahid Kutlu, Gagan Agrawal and Oguz Kurt, "Fault tolerant parallel data-intensive algorithms," High Performance Computing (HiPC), 2012 (25.1 % acceptance rate)
• -Mucahid Kutlu, Gagan Agrawal and Oguz Kurt, "Fault tolerant parallel data-intensive algorithms", High Performance and Distributed Computing (HPDC), 2012 (poster paper)
• RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications (to be submitted)
Candidacy Examination 6
Outline
• Motivation & Background• Current Work– PAGE: A Framework for Easy PArallelization of GEnomic
Applications– RE-PAGE: Domain-Specific REplication and PArallel
Processing of GEnomic Applications
• Future Work
Candidacy Examination 7
Our Work
• PAGE: A Map-Reduce-like middleware for easy parallelization of genomic applications
• Mappers and reducers are executable programs– Allows us to exploit existing applications– No restriction on programming language
Candidacy Examination 8
File-mFile-2File-1
Map
Reduce
Region-1
Map
Region-n
Intra-dependent Processing
Candidacy Examination 9
O-11
O-1n
Output-1
Map
Reduce
Region-1
Map
Region-n
O-m1
O-mn
Output-m
• Each file is processed independently
Map O1
Ok
On
Reduce Output
Region-1
Input Files
Map
Region-k
Map
Region-n
Inter-dependent Processing• Each map task processes a particular region of ALL files
Candidacy Examination 10
Data Partitioning• Data is NOT packaged into equal-size data blocks as in
Hadoop– Each application has a different way of reading the data– Equal-size data block packaging ignores nucleotide base
location information
• Genome structure is divided into regions and each map task is assigned for a region.– Takes account location information– The map task is responsible of accessing particular region of
the input files• It is a common feature for many genomic tools (GATK, SamTools)
Candidacy Examination 11
Genome Partition
• PAGE provides two data partitioning methods– By-locus partitioning: Chromosomes are divided into
regions
– By-chromosome partitioning: Chromosomes preserve their unity
Candidacy Examination 12
Chr-1 Chr-2 Chr-3 Chr-4 Chr-5 Chr-6
Chr-1 Chr-2 Chr-3 Chr-4 Chr-5 Chr-6
Challenges
• Load Imbalance due to nature of genomic data– It is not just an array of
A, G, C and T characters
• High overhead of tasks
• I/O contention
Candidacy Examination 13
1 3 4
Coverage Variance
13
Task Scheduling
Static • Each processor is responsible of regions with equal length.• All map tasks should finish before the execution of reduce
tasks.
Dynamic• Map & reduce tasks are assigned by a master process• Reduce tasks can start if there are enough available
intermediate results.
Candidacy Examination 14
PAGE provides two types of scheduling schemes.
Sample Application Development with PAGE
• Serial execution command of VarScan Software– samtools mpileup –b file_list -f reference | java -jar VarScan.jar mpileup2snp
• To parallelize VarScan with PAGE, user needs to define:– Genome Partition: By-Locus– Scheduling Scheme: Dynamic (or Static)– Execution Model: Inter-dependent– Map command: samtools mpileup –b file_list -r regionloc -f
reference | java -jar VarScan.jar mpileup2snp >outputloc– Reduction : cat bash shell command
Candidacy Examination 15
Experiments
• Experimental Setup– In our cluster
• Each node has 12 GB memory• 8 cores (2.53 GHz)
– We obtained the data from 1000 Human Genome Project– We evaluated PAGE with 4 applications
• VarScan: SNP detection• Realigner Target Creator: Detects insertion/deletions in
alignment files• Indel Realigner: Applies local realignment to improve quality
of alignment files• Unified Genotyper: SNP detection
Candidacy Examination 16
Comparison with GATK
Candidacy Examination 17
Scalability Data Size Impact
- Unified Genotyper tool of GATK
10.9x 12.8x
Data Size: 34 GB # of cores: 128
Scalability Data Size Impact
- VarScan Application
6.9x 12.7x
Comparison with Hadoop Streaming
Candidacy Examination 18
Data Size: 52 GB # of cores: 128
Outline
• Motivation & Background• Current Work– PAGE: A Framework for Easy PArallelization of GEnomic
Applications– RE-PAGE: Domain-Specific REplication and PArallel
Processing of GEnomic Applications
• Future Work
Candidacy Examination 19
RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications
• In this study, we improve our middleware PAGE from several aspects
• Main goal: Less I/O contention• Main approach: – Utilizing distributed disks– Intelligent replication technique– Scheduling scheme that minimizes network traffic
Candidacy Examination 20
Execution Model
Candidacy Examination 21
Allowing Remote Processing or Not?
Candidacy Examination 22
Advantages Disadvantages
As number of nodes increases, network traffic will increase
Data transfer will be more effective as computation becomes more data intensive
Data transfering can be problematic for large scale data
Better workload balance
Proposed Scheduling Schemes• General idea: Replicate data and prohibit remote processing
– Replication will increase number of local tasks for nodes and be useful to decrease workload imbalance
• Data chunks can have varying sizes and varying replication factors• Master & worker approach• We propose 3 scheduling schemes
– Factoring – Help the busiest node (HBN)– Effective memory management (EMM)
Candidacy Examination 23
FactoringHBNEMM
Proposed Replication Method
• Replicating all chunks into all nodes is not feasible.
• Depending on the analysis we want to perform, some genomic regions can be more important than others for the target analysis.
• General Idea: Replicate important regions more than others.
Candidacy Examination 24
Replication & Distribution
Candidacy Examination 25
Scheduling Scheme Evaluation
Candidacy Examination 26
• Works on real data• 32 nodes (256 cores) • 20 BAM files (21 GB)
• All 3 scheduling schemes are better than random scheduling
• Factoring is the best among all for all experiments
Work Stealing vs. Our Approach
• Synthetic application• Fixed data chunk size,
varying execution time• Performance comparison is
shown: Work Stealing / Our approach
• As processing becomes more data intensive, our approach gives better results!
Candidacy Examination 27
Data Size Impact
Candidacy Examination 28
+%3
+%7
+%4
-%1
• Unified Genotyper• 32 nodes (256 cores)• As data size increases, WS-3
becomes better than WS-1• As data size increases, RE-
PAGE becomes better than WS-3
Candidacy Examination 29
4.2x 7.1x
2.2x
9.9x
Scalability Evaluation
Coverage Analyzer Unified Genotyper
Outline
• Motivation & Background• Current Work– PAGE: A Framework for Easy PArallelization of GEnomic
Applications– RE-PAGE: Domain-Specific REplication and PArallel
Processing of GEnomic Applications
• Future Work
Candidacy Examination 30
Future Work
• An API to Develop Parallel Genomic Applications for Memory Constraint Architectures
• Processing Compressed Genomic Data
Candidacy Examination 31
API for Memory Constraint Architectures
• We employed CPUs so far
• Co-processors can be also useful for genomic applications
• The trend in computing technologies– More cores, smaller memory– Intel Many Integrated Core (MIC) architecture
Candidacy Examination 32
Proposed Work
• An API which helps user implement parallel genomic applications with memory constraint architectures
• In this work, executables are not used, the developer needs to write map-reduce functions with C programming language
• The middleware helps the developer in 3 ways– Data reading from BAM and Fasta files– Memory utilization– Parallel execution and task scheduling
Candidacy Examination 33
Execution Flow
Candidacy Examination 34
Input Data
Compressed Data
Intermediate Result
Compress Map
Reduce
Input Data
Compressed Data
Intermediate Result
Compress Map
Result
Data Reading
• The middleware reads the data from files and generates genome matrices which are compressed inputs of map tasks.
• The genome matrix can be in two types– Sequence Based: Each row keeps a sequence – Location Based: Keeps the data in mpileup format. Each
row of the matrix keeps information for a different location
Candidacy Examination 35
Genome Matrices
Sequence Based Location Based
Candidacy Examination 36
Optimization of Memory Utilization
• In order to decrease memory usage, we apply two techniques:– Selective Loading– Transparent Compression
Candidacy Examination 37
Selective Loading
• Each read-sequence in Sam/Bam files consist of 11 mandatory and 1 alternative sections – Sequence ID, location, base sequences, strand and others
• For many applications, we do not need all of them.– For counting bases, sequence ids can be ignored
• We load the parts only we need
Candidacy Examination 38
Transparent Compression
• Main Idea: The genome matrices keep the data in compressed format but the developer can access the data with our API as it is uncompressed.
• Compression Technique: Will be investigated
Candidacy Examination 39
Sample Map Taskvoid* map_coveragedepth( location_based_genome_matrix gm){ int i,j,position, indelLength, char* sequence; reduce_object *total; for(i=0;i<gm.number_of_results;i++) { position = getPosition_from_lbgm(gm.code[i],selected_parts) chromosome = get_chromosome_from_lbgm(gm.code[i],selected_parts); for(j=0;j<gm.num_samples;j++) { sequence = get_base_sequence_for_sample_n(gm->code[i], selected_parts, gm.num_samples,j ); count_num_bases(sequence); add_results_to_reduce_object(total, position, chromosome, sequence); } } return (void*)total;}
Input genome matrix
Reduce objectMethods we provide
Candidacy Examination 40
Open Questions
• How to schedule map and reduce tasks?
• How to keep the intermediate results in memory?– Location based genome matrix structure is useful to
decrease the intermediate results.• No need iterative computation for many applications (e.g.
SNP calling)• Reduction is just concatenation of the intermediate results.
So they can be written to the disks as they are produced.
Candidacy Examination 41
A middleware for processing compressed genomic data
• Compression is useful for archiving concern, however, it decreases the performance
• There are enormous amount of compression method for genomic data– No need to another compression method
• Our goal: A middleware that helps users to process compressed data without fully decompressing it.
Candidacy Examination 42
Execution Model
Candidacy Examination 43
Candidacy Examination 44
THANKS!