Semantics-based Distributed I/O with the ParaMEDIC Framework P. Balaji, W. Feng, H. Lin Math. and...

Semantics-based Distributed I/O with the ParaMEDIC

Framework

P. Balaji, W. Feng, H. Lin

Math. and Computer Science, Argonne National Laboratory

Computer Science and Engg., Virginia Tech

Dept. of Computer Sci., North Carolina State University

Pavan Balaji, Argonne National Laboratory

HPDC '08


Distributed Computation and I/O• Growth of combined compute and I/O requirements

– E.g., Genomic sequence search, Large-scale data mining, data visual analytics and communication profiling

– Commonality: Require a lot of compute power and use and generate a lot of data• Data has to be managed for later processing or archival

• Managing large data volumes: Distributed I/O– Non-local access to large compute systems

• Data generated remotely and transferred to local systems

– Resource locality: Applications need compute and storage• Data generated at one site and moved to another


HPDC '08


Distributed I/O: The Necessary Evil• Lot of prior research tries to improve distributed I/O• Continues to be the elusive holy grail

– Difficult to achieve high performance for “real data” [1]• Bandwidth is not everything

– Real software requires synchronization (milliseconds)– High-speed TCP eats up memory – slows down applications– Data encryption or endianness conversion required in some

cases

– Not everyone has a lambda grid• Scientists run jobs on large centers from their local

system

– There is just too much data!– Solution: FEDEX !

[1] “Wide Area Filesystem Performance Using Lustre on the Teragrid”, S. Simms, G. Pike, D. Balog. Teragrid Conference, 2007


HPDC '08


Case Study: mpiBLAST on the TeraGrid

10 20 30 40 50 60 70 80 90 1000

500

1000

1500

2000

2500

3000

3500mpiBLAST Performance Breakup (TeraGrid Infrastructure)

Compute Time

I/O Time

Query Size (KB)

Exe

cutio

n T

ime

(se

c)

85% of the time is spent on I/O

On a local-area network, mpiBLAST I/O time is less than 5%


HPDC '08


Presentation Outline

• Distributed I/O on the WAN

• ParaMEDIC: Framework to Decouple Compute

and I/O

• Case Studies with mpiBLAST and MPE

• Experimental Results

• Glimpses of Follow-on Work

• Concluding Remarks


HPDC '08


ParaMEDIC Overview• Parallel Meta-data Environment for Distributed I/O

and Computing

• New way of “programming” distributed I/O– Application generates output data

– ParaMEDIC takes over:• Transforms output to (orders-of-magnitude smaller)

“application-specific meta-data” at the compute site

• Transports meta-data over the WAN to the storage site

• Transforms meta-data back to the original data at the storage site (host site for the global file-system)

– Similar to compression, yet different• Deals with data as abstract objects, not as a byte-

stream


HPDC '08


The ParaMEDIC Framework

Applications

mpiBLASTCommunication

ProfilingRemote

Visualization

ParaMEDIC Data Tools

DataEncryption

DataIntegrity

Communication Services

DirectNetwork

GlobalFilesystem

Application Plugins

mpiBLASTPlugin

CommunicationProfiling Plugin

BasicCompression

ParaMEDIC API (PMAPI)

Other Utilities

Column Parsing

Data Sorting


HPDC '08


Tradeoffs in the ParaMEDIC Framework• Trading Computation and I/O

– More computation: Converting output to meta-data and back requires extra work

– Lesser I/O: Only meta-data is transferred over the WAN, so lesser bandwidth usage on the WAN

– But, computation is free; I/O is not !

• Trading Portability and Performance– Utility functions help develop application plugins, but

will always need non-zero effort

– Data is dealt has high-level objects: Better chance of improved performance


HPDC '08




• ParaMEDIC: Framework to Decouple Compute and

I/O





Pavan Balaji, Argonne National LaboratoryPavan Balaji, Argonne National Laboratory

Sequence Search with mpiBLAST

HPDC '08

QuerySequences

DatabaseSequences

Output

Sequential Search of Queries Parallel Search of Queries

QuerySequences

DatabaseSequences

Output

Pavan Balaji, Argonne National LaboratoryPavan Balaji, Argonne National Laboratory

mpiBLAST Meta-Data

HPDC '08

QuerySequences

DatabaseSequences

Output

Alignment information for

a bunch of sequences

Alignment of two sequences is

independent of the remaining sequences

Meta-data (IDs of matched sequences)

Communicate over the WAN

QuerySequences

Temporary Database

Sequences


HPDC '08


ParaMEDIC-powered mpiBLAST

Compute Master I/O Master

mpiBLAST Master

mpiBLASTWorker

mpiBLASTWorker

mpiBLASTWorker

mpiBLAST Master

mpiBLASTWorker

mpiBLASTWorker

Query Raw MetadataQuery

Write Results

Generate TempDatabase

Read TempDatabase

I/O Workers

Compute Workers

I/O Servershosting file

system

The ParaMEDIC Framework

Compute Sites Storage SiteWAN


HPDC '08


MPE: A Profiling Library for MPI

• MPE: MPI Profiling Environment– Suite of performance analysis tools and libraries

– Shipped as a part of the MPICH2 implementation of MPI

• Relies on the MPI Profiling Interface– Application is run regularly, MPE automagically logs

communication calls and time taken

• Generates lots of data– A large-scale application such as FLASH can generate

about 2.5MB of data per second per process

– A 16K process run for an hour generates 150 TB of data


HPDC '08


Example MPE Profiling Log (GROMACS)

Identify periodicity using Fourier transforms and only store the “diffs” in each periodCan give about 3-5X improvement


HPDC '08





I/O






HPDC '08


LAN Emulating a 10Gbps WAN

10 20 30 40 50 60 70 80 90 1000

50

100

150

200

250Impact of Input Query Size

mpi-BLASTPara-MEDIC

Query Size (KB)

Exe

cutio

n T

ime

(se

c)

100 300 500 1000 2000 4000 10000

0

20

40

60

80

100

120

140

160

180

Impact of Number of Output Query Sequences

Number of Requested Sequences

Exe

cutio

n T

ime

(se

c)


HPDC '08


Performance on Real Systems

10 20 30 40 50 60 70 80 90 100

0

1000

2000

3000

4000

5000

6000

ANL to Virginia Tech EncryptedFile-system

mpiBLAST

ParaMEDIC

Query Size (KB)

Exe

cutio

n T

ime

(se

c)

10 20 30 40 50 60 70 80 90 100

0

500

1000

1500

2000

2500

3000

3500

4000Teragrid Infrastructure

Query Size (KB)

Exe

cutio

n T

ime

(se

c)


HPDC '08


Performance Breakup on the TeraGrid

10 20 30 40 50 60 70 80 90 100

0

500

1000

1500

2000

2500

3000

3500

mpiBLAST Performance Breakup (TeraGrid Infrastructure)

Compute Time

I/O Time

Query Size (KB)

Exe

cutio

n T

ime

(se

c)

10 20 30 40 50 60 70 80 90 1000

100

200

300

400

500

600

700

ParaMEDIC Performance Breakup (TeraGrid Infrastructure)

Compute Time

Post-processing Time + I/O Time

Query Size (KB)

Exe

cutio

n ti

me

(se

c)


HPDC '08





I/O






HPDC '08


Evaluation on a Worldwide Supercomputer

0

50000

100000

150000

200000

250000

300000

350000

400000

mpiBLAST

ParaMEDIC

Absolute Time

Number of Query Sequence Sets

I/ O T i m e ( s e c o n d s )

1 2 4 8 16 32 64 1282880

100

200

300

400

500

600

700Factor of Improvement

Number of Query Sequence Sets


HPDC '08


Microbial Genome Database Search

• Semantic-aware metadata gives scientists 2.5*1014 searches at their finger-tips– All metadata results from all searches can fit on iPod

Nano– “Semantically compressed” 1 PB into 4 GB (106X)

• Usual compression results in 1 PB into 300 TB (3X)

SemanticCompression

“ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing”, P. Balaji, W. Feng, J. Archuleta and H. Lin. Storage Challenge Award, SC 2007.

“Distributed I/O with ParaMEDIC: Experiences with a Worldwide Supercomputer”, P. Balaji, W. Feng, H. Lin, J. Archuleta, S. Matsuoka, A. Warren, J. Setubal, E. Lusk, R. Thakur, I. Foster, D. S. Katz, S. Jha, K. Shinpaugh, S. Coghlan and D. Reed. Best Paper Award, ISC 2008.


HPDC '08





I/O






HPDC '08


Concluding Remarks

• Distributed I/O is a necessary evil

– Difficult to get high performance for “real data”

• Traditional approaches deal with data as a stream

of bytes (allows for portability across any type of

data)

• We propose ParaMEDIC

– Semantics-based meta-data transformation of data

– Trade Portability for Performance

• Evaluated on emulated and real systems

– Order-of-magnitude benefits in performance

Thank You!

Contact Information

Email: [email protected]

Web: http://www.mcs.anl.gov/~balaji

mailto:[email protected]

http://www.mcs.anl.gov/~balaji

Date post:	31-Dec-2015
Category:	Documents
Upload:	anne-allen
View:	217 times
Download:	1 times

Semantics-based Distributed I/O with the ParaMEDIC Framework P. Balaji, W. Feng, H. Lin Math. and...

Documents