+ All Categories
Home > Documents > Analysis Tools for Data Enabled S cience

Analysis Tools for Data Enabled S cience

Date post: 24-Feb-2016
Category:
Upload: jacoba
View: 30 times
Download: 0 times
Share this document with a friend
Description:
Analysis Tools for Data Enabled S cience. S A L S A HPC Group http:// salsahpc.indiana.edu School of Informatics and Computing Indiana University. Bioinformatics Pipeline. Gene Sequences (N = 1 Million). Distance Matrix. Pairwise Alignment & Distance Calculation. Select Reference. - PowerPoint PPT Presentation
Popular Tags:
24
Analysis Tools for Data Enabled Science SALSA HPC Group http:// salsahpc.indiana.edu School of Informatics and Computing Indiana University
Transcript
Page 1: Analysis Tools for Data Enabled  S cience

Analysis Tools forData Enabled Science

SALSA HPC Group http://salsahpc.indiana.edu

School of Informatics and ComputingIndiana University

Page 2: Analysis Tools for Data Enabled  S cience

Bioinformatics PipelineGene

Sequences (N = 1 Million)

Distance Matrix

Interpolative MDS with Pairwise

Distance Calculation

Multi-Dimensional

Scaling (MDS)

Visualization 3D Plot

Reference Sequence Set (M = 100K)

N - M Sequence

Set (900K)

Select Referenc

e

Reference Coordinates

x, y, z

N - M Coordinates

x, y, z

Pairwise Alignment & Distance Calculation

O(N2)

Page 3: Analysis Tools for Data Enabled  S cience

Structure of Twister4Azure

Page 4: Analysis Tools for Data Enabled  S cience

Iterative MapReduce for Azure

Reduce

Reduce

MergeAdd

Iteration? No

Map Combine

Map Combine

Map Combine

Data Cache

Yes

Hybrid scheduling of the new iteration

Job Start

Job Finish

Merge Step In-Memory Caching of static data Cache aware hybrid scheduling using Queues as

well as using a bulletin board (special table)

Page 5: Analysis Tools for Data Enabled  S cience

Performance – Kmeans Clustering

Performance with/without data caching

Speedup gained using data cache

0%

20%

40%

60%

80%

100%

120%

140%

160%

0

200

400

600

800

1000

1200

1400

1600

8 X 16M 16 X 32M 32 X 64M 48 X 96M 64 X 128M

Rela

tive

Para

llel E

ffici

ency

Tim

e (s

)

Num Instances X Num Data Points

Relative ParallelEfficiencyTime(s)

Scaling speedup Increasing number of iterations

Page 6: Analysis Tools for Data Enabled  S cience

Performance Comparisons

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

128 228 328 428 528 628 728

Para

llel E

ffici

ency

Number of Query Files

Twister4Azure

Hadoop-Blast

DryadLINQ-Blast

BLAST Sequence Search

50%55%60%65%70%75%80%85%90%95%

100%

Para

llel E

ffici

ency

Num. of Cores * Num. of Files

Twister4Azure

Amazon EMR

Apache Hadoop

Cap3 Sequence Assembly0

500

1000

1500

2000

2500

3000

Adjusted

Tim

e (s)

Num. of Cores * Num. of Blocks

Twister4Azure

Amazon EMR

Apache Hadoop

Smith Watermann Sequence Alignment

Page 7: Analysis Tools for Data Enabled  S cience

Twister v0.9

Configuration Program to setup Twister environment automatically on a clusterFull mesh network of brokers for facilitating communicationNew messaging interface for reducing the message serialization overheadMemory Cache to share data between tasks and jobs

New Infrastructure for Iterative MapReduce Programming

Page 8: Analysis Tools for Data Enabled  S cience

Twister-MDS DemoThis demo is for real time visualization of the process of multidimensional scaling(MDS) calculation. We use Twister to do parallel calculation inside the cluster, and use PlotViz to show the intermediate results at the user client computer. The process of computation and monitoring is automated by the program.

Page 9: Analysis Tools for Data Enabled  S cience

MDS projection of 100,000 protein sequences showing a few experimentally identified clusters in preliminary work with Seattle Children’s Research Institute

Twister-MDS Output

Page 10: Analysis Tools for Data Enabled  S cience

Twister-MDS Work Flow

Master Node

Twister Driver

Twister-MDS

ActiveMQBroker MDS Monitor

PlotViz

I. Send message to start the job

II. Send intermediate results

Local Disk

III. Write data IV. Read data

Client Node

Page 11: Analysis Tools for Data Enabled  S cience

MDS Output Monitoring InterfacePub/Sub Broker Network

Worker Node

Worker Pool

Twister Daemon

Master Node

Twister Driver

Twister-MDS

Worker Node

Worker Pool

Twister Daemon

map

reduce

map

reduce

calculateStress

calculateBC

Twister-MDS Structure

Page 12: Analysis Tools for Data Enabled  S cience

New Network of BrokersTwister Driver Node

Twister Daemon NodeActiveMQ Broker Node

Broker-Daemon Connection

Broker-Broker Connection

Broker-Driver Connection

7 Brokers and 32 Computing Nodes in total

Full Mesh Network

Hierarchical Sending

Page 13: Analysis Tools for Data Enabled  S cience

Performance Improvement

38400 51200 76800 1024000.000

200.000

400.000

600.000

800.000

1000.000

1200.000

1400.000

1600.000

189.288

359.625

816.364

1508.487

148.805

303.432

737.073

1404.431

Twister-MDS Execution Time100 iterations, 40 nodes, under different input data sizes

Original Execution Time (1 broker only) Current Execution Time (7 brokers, the best broker number)

Number of Data Points

Tota

l Exe

cutio

n Ti

me

(Sec

onds

)

Page 14: Analysis Tools for Data Enabled  S cience

Harnessing the Power of Workflow

Design Workflow Pattern

Configure Trident Jobs

Page 15: Analysis Tools for Data Enabled  S cience

Harnessing the Power of WorkflowFuture Work: Combine Windows Trident with Twister

Page 16: Analysis Tools for Data Enabled  S cience

Twister for Polar ScienceThe Center for Remote Sensing of Ice Sheets

ResearchEducationKnowledge Transfer

Utilizing the Power of Twister to Perform Large Scale Scientific Calculation

Page 17: Analysis Tools for Data Enabled  S cience

Twister for Polar ScienceDeploying a Twister

Appliance for Polar Grid

copy

instantiate

Virtual Machines

GroupVPN

Virtual IP - DHCP5.5.1.1

Virtual IP - DHCP5.5.1.2

GroupVPNCredentials

(fromWeb site)

Page 18: Analysis Tools for Data Enabled  S cience

Twister Architecture

Linux HPCBare-system

Amazon Cloud Windows Server HPC

Bare-system Virtualization

CPU Nodes

VirtualizationInfrastructure

Hardware

Azure Cloud Grid Appliance

GPU Nodes

Cross Platform Iterative MapReduce (Collectives, Fault Tolerance, Scheduling)

Kernels, Genomics, Proteomics, Information Retrieval, Polar ScienceScientific Simulation Data Analysis and Management

Dissimilarity Computation, Clustering, Multidimentional Scaling, Generative Topological Mapping

Applications

Programming Model

Services and Workflow

High Level Language

Distributed File Systems Data Parallel File System

Runtime

Storage Object Store

Security, Provenance, Portal

Page 19: Analysis Tools for Data Enabled  S cience

Twister FuturesDevelopment of library of Collectives to use at Reduce phase

Broadcast and Gather needed by current applicationsDiscover other important onesImplement efficiently on each platform – especially Azure

Better software message routing with broker networks using asynchronous I/O with communication fault toleranceSupport nearby location of data and computing using data parallel file systemsClearer application fault tolerance model based on implicit synchronizations points at iteration end pointsLater: Investigate GPU supportLater: run time for data parallel languages like Sawzall, Pig Latin, LINQ

Page 20: Analysis Tools for Data Enabled  S cience

 

(a) Map Only (d) Loosely Synchronous(c) Iterative MapReduce(b) Classic MapReduce

   

Input

    map

   

      reduce

 

Input

    

map

   

      reduce

IterationsInput

Output

map

   

Pij

CAP3 Analysis

Smith-Waterman Distances

Parametric sweeps

PolarGrid Matlab data analysis

High Energy Physics (HEP)

Histograms

Distributed search

Distributed sorting

Information retrieval

 

Many MPI scientific

applications such as solving

differential equations and

particle dynamics

 

Domain of MapReduce and Iterative Extensions MPI

Expectation maximization clustering

e.g. Kmeans

Linear Algebra

Multimensional Scaling

Page Rank

 

Status of Iterative MapReduce

Page 21: Analysis Tools for Data Enabled  S cience

Education and Broader ImpactWe devote a lot to guide studentswho are interested in computing

Page 22: Analysis Tools for Data Enabled  S cience

Education

We offer classes with emerging new topics

Together with tutorials on the most popular cloud computing tools

Page 23: Analysis Tools for Data Enabled  S cience

Hosting workshops and spreading our technology across the nation

Giving students unforgettable research experience

Broader Impact

Page 24: Analysis Tools for Data Enabled  S cience

AcknowledgementSALSA HPC Group Indiana University

http://salsahpc.indiana.edu


Recommended