Analysis Tools forData Enabled Science
SALSA HPC Group http://salsahpc.indiana.edu
School of Informatics and ComputingIndiana University
Bioinformatics PipelineGene
Sequences (N = 1 Million)
Distance Matrix
Interpolative MDS with Pairwise
Distance Calculation
Multi-Dimensional
Scaling (MDS)
Visualization 3D Plot
Reference Sequence Set (M = 100K)
N - M Sequence
Set (900K)
Select Referenc
e
Reference Coordinates
x, y, z
N - M Coordinates
x, y, z
Pairwise Alignment & Distance Calculation
O(N2)
Structure of Twister4Azure
Iterative MapReduce for Azure
Reduce
Reduce
MergeAdd
Iteration? No
Map Combine
Map Combine
Map Combine
Data Cache
Yes
Hybrid scheduling of the new iteration
Job Start
Job Finish
Merge Step In-Memory Caching of static data Cache aware hybrid scheduling using Queues as
well as using a bulletin board (special table)
Performance – Kmeans Clustering
Performance with/without data caching
Speedup gained using data cache
0%
20%
40%
60%
80%
100%
120%
140%
160%
0
200
400
600
800
1000
1200
1400
1600
8 X 16M 16 X 32M 32 X 64M 48 X 96M 64 X 128M
Rela
tive
Para
llel E
ffici
ency
Tim
e (s
)
Num Instances X Num Data Points
Relative ParallelEfficiencyTime(s)
Scaling speedup Increasing number of iterations
Performance Comparisons
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
128 228 328 428 528 628 728
Para
llel E
ffici
ency
Number of Query Files
Twister4Azure
Hadoop-Blast
DryadLINQ-Blast
BLAST Sequence Search
50%55%60%65%70%75%80%85%90%95%
100%
Para
llel E
ffici
ency
Num. of Cores * Num. of Files
Twister4Azure
Amazon EMR
Apache Hadoop
Cap3 Sequence Assembly0
500
1000
1500
2000
2500
3000
Adjusted
Tim
e (s)
Num. of Cores * Num. of Blocks
Twister4Azure
Amazon EMR
Apache Hadoop
Smith Watermann Sequence Alignment
Twister v0.9
Configuration Program to setup Twister environment automatically on a clusterFull mesh network of brokers for facilitating communicationNew messaging interface for reducing the message serialization overheadMemory Cache to share data between tasks and jobs
New Infrastructure for Iterative MapReduce Programming
Twister-MDS DemoThis demo is for real time visualization of the process of multidimensional scaling(MDS) calculation. We use Twister to do parallel calculation inside the cluster, and use PlotViz to show the intermediate results at the user client computer. The process of computation and monitoring is automated by the program.
MDS projection of 100,000 protein sequences showing a few experimentally identified clusters in preliminary work with Seattle Children’s Research Institute
Twister-MDS Output
Twister-MDS Work Flow
Master Node
Twister Driver
Twister-MDS
ActiveMQBroker MDS Monitor
PlotViz
I. Send message to start the job
II. Send intermediate results
Local Disk
III. Write data IV. Read data
Client Node
MDS Output Monitoring InterfacePub/Sub Broker Network
Worker Node
Worker Pool
Twister Daemon
Master Node
Twister Driver
Twister-MDS
Worker Node
Worker Pool
Twister Daemon
map
reduce
map
reduce
calculateStress
calculateBC
Twister-MDS Structure
New Network of BrokersTwister Driver Node
Twister Daemon NodeActiveMQ Broker Node
Broker-Daemon Connection
Broker-Broker Connection
Broker-Driver Connection
7 Brokers and 32 Computing Nodes in total
Full Mesh Network
Hierarchical Sending
Performance Improvement
38400 51200 76800 1024000.000
200.000
400.000
600.000
800.000
1000.000
1200.000
1400.000
1600.000
189.288
359.625
816.364
1508.487
148.805
303.432
737.073
1404.431
Twister-MDS Execution Time100 iterations, 40 nodes, under different input data sizes
Original Execution Time (1 broker only) Current Execution Time (7 brokers, the best broker number)
Number of Data Points
Tota
l Exe
cutio
n Ti
me
(Sec
onds
)
Harnessing the Power of Workflow
Design Workflow Pattern
Configure Trident Jobs
Harnessing the Power of WorkflowFuture Work: Combine Windows Trident with Twister
Twister for Polar ScienceThe Center for Remote Sensing of Ice Sheets
ResearchEducationKnowledge Transfer
Utilizing the Power of Twister to Perform Large Scale Scientific Calculation
Twister for Polar ScienceDeploying a Twister
Appliance for Polar Grid
copy
instantiate
…
Virtual Machines
GroupVPN
Virtual IP - DHCP5.5.1.1
Virtual IP - DHCP5.5.1.2
GroupVPNCredentials
(fromWeb site)
Twister Architecture
Linux HPCBare-system
Amazon Cloud Windows Server HPC
Bare-system Virtualization
CPU Nodes
VirtualizationInfrastructure
Hardware
Azure Cloud Grid Appliance
GPU Nodes
Cross Platform Iterative MapReduce (Collectives, Fault Tolerance, Scheduling)
Kernels, Genomics, Proteomics, Information Retrieval, Polar ScienceScientific Simulation Data Analysis and Management
Dissimilarity Computation, Clustering, Multidimentional Scaling, Generative Topological Mapping
Applications
Programming Model
Services and Workflow
High Level Language
Distributed File Systems Data Parallel File System
Runtime
Storage Object Store
Security, Provenance, Portal
Twister FuturesDevelopment of library of Collectives to use at Reduce phase
Broadcast and Gather needed by current applicationsDiscover other important onesImplement efficiently on each platform – especially Azure
Better software message routing with broker networks using asynchronous I/O with communication fault toleranceSupport nearby location of data and computing using data parallel file systemsClearer application fault tolerance model based on implicit synchronizations points at iteration end pointsLater: Investigate GPU supportLater: run time for data parallel languages like Sawzall, Pig Latin, LINQ
(a) Map Only (d) Loosely Synchronous(c) Iterative MapReduce(b) Classic MapReduce
Input
map
reduce
Input
map
reduce
IterationsInput
Output
map
Pij
CAP3 Analysis
Smith-Waterman Distances
Parametric sweeps
PolarGrid Matlab data analysis
High Energy Physics (HEP)
Histograms
Distributed search
Distributed sorting
Information retrieval
Many MPI scientific
applications such as solving
differential equations and
particle dynamics
Domain of MapReduce and Iterative Extensions MPI
Expectation maximization clustering
e.g. Kmeans
Linear Algebra
Multimensional Scaling
Page Rank
Status of Iterative MapReduce
Education and Broader ImpactWe devote a lot to guide studentswho are interested in computing
Education
We offer classes with emerging new topics
Together with tutorials on the most popular cloud computing tools
Hosting workshops and spreading our technology across the nation
Giving students unforgettable research experience
Broader Impact
AcknowledgementSALSA HPC Group Indiana University
http://salsahpc.indiana.edu