DOE Network PI Meeting 2005
Runtime Data Management for
Data-Intensive Scientific Applications
Xiaosong MaNC State University
Joint Faculty: Oak Ridge National LabECPI: 2005 – 2008
DOE Network PI Meeting 2005
Data-Intensive Applications on NLCF
Data-processing applications Bio sequence DB queries, simulation data analysis,
visualization Challenges
Rapid data growth (data avalanche) Computation requirement I/O requirement Needs ultra-scale machines
Less studied than numerical simulations Scalability on large machines
Complexity and heterogeneity Case-by-case static optimization costly
DOE Network PI Meeting 2005
Run-Time Data Management
Parallel execution plan optimization Example: genome vs. database sequence comparison on
1000s of processors Data placement crucial for performance/scalability Issues
Data partitioning/replication Load balancing
Efficient parallel I/O w. scientific data formats I/O subsystem performance lagging behind Scientific data formats widely used (HDF, netCDF)
Further limits applications’ I/O performance Issues
Library overhead Metadata management and accesses
DOE Network PI Meeting 2005
Proposed Approach
Adaptive run-time optimization For parallel execution plan optimization
Connect scientific data processing to relational databases
Runtime cost modeling and evaluation For parallel I/O w. scientific data formats
Library-level memory management Hiding I/O costs
Caching, prefetching, buffering
DOE Network PI Meeting 2005
Prelim Result 1: Efficient Data Accesses for Parallel Sequence Searches
BLAST Widely used bio sequence search tool NCBI BLAST Toolkit
mpiBLAST Developed at LANL Open source parallelization of BLAST using database
partitioning Increasingly popular: more than 10,000 downloads since
early 2003 Directly utilizing NCBI BLAST Super linear speedup with small number of processors
DOE Network PI Meeting 2005
Data Handling in mpiBLAST Not Efficient
Databases partitioned statically before search Inflexible: re-partitioning
required to use different No. of procs
Management overhead: generating large number of small files, hard to manage, migrate and share
Results processing and output serialized by the master node
Result: rapidly growing non-search overhead as No. of procs grows Output data size grows
0
500
1000
1500
2000
2500
3000
3500
4000
4500
4 8 16 32 62
Number of Processors
Tim
e (
Seco
nd
s)
Other time
Search time
- Search 150k queries against NR database
DOE Network PI Meeting 2005
pioBLAST
Efficient, highly scalable parallel BLAST implementation [IPDPS ’05] Improves mpiBLAST Focus on data handling Up to order of magnitude improvement on overall
performance Currently being merged with mpiBLAST
Major contributions Applying collective I/O techniques to bioinformatics,
enabling Dynamic database partitioning Parallel database input and result output
Efficient result data processing Removing master bottleneck
DOE Network PI Meeting 2005
pioBLAST Sample Performance Results
Platform: SGI Altix at ORNL 256 1.5GHz Itanium2
processors 8GB memory per
processor Database: NCBI nr
(1GB)
Node scalability tests (top figure) Queries – 150k queries
randomly sampled from nr
Varied no. of processors
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Program-No. of processes
Execu
tio
n t
ime (
s)
Other time
Search time
DOE Network PI Meeting 2005
Prelim Result 2: Active Buffering Hides periodic I/O costs behind computation phases
[IPDPS ’02, ICS ’02, IPDPS ’03, IEEE TPDS (to appear)] Organizes idle memory resources into buffer
hierarchy Masks costs of scientific data formats
Panda Parallel I/O Library University of Illinois Client-server architecture
ROMIO Parallel I/O Library Argonne National Lab Popular MPI-IO implementation, included in MPICH Server-less architecture ABT (Active Buffering with Threads)
DOE Network PI Meeting 2005
Write Throughput w. Active Buffering
0
200
400
600
800
1000
1200
2 4 8 16 32
number of clients
thro
ug
hp
ut
pe
r s
erv
er
(MB
/s)
local bufferingABMPIbinary write
0
200
400
600
800
1000
1200
2 4 8 16 32
number of clients
local bufferingABMPIHDF4 write
0
50
100
150
200
250
2 4 8 16 32
number of clients
thro
ughp
ut p
er
serv
er (M
B/s
)
idealABMPIbinary write
0
50
100
150
200
250
2 4 8 16 32
number of clients
idealABMPIHDF4 write
w/o buffer
overflow
w. buffer overflow
DOE Network PI Meeting 2005
Prelim Result 3: Application-level Prefetching
GODIVA Framework: hides periodic input costs behind
computation phases [ICDE ’04] General Object Data
Interface for Visualization Applications
In-memory database managing data buffer locations
Relational database-like interfaces
Developer controllable prefetching and caching
Developer-supplied read functions
0
100
200
300
400
500
600
700
sim
ple(O
)
sim
ple(G
)
sim
ple(T
G)
med
ium
(O)
med
ium
(G)
med
ium
(TG
)
com
plex(
O)
com
plex(
G)
com
plex(
TG)
exec
uti
on
tim
e (s
)
computation time visible I/O time
DOE Network PI Meeting 2005
On-going Research
Parallel execution plan optimization Explore optimization space of bio sequence processing
tools on large-scale machines Develop algorithm-independent cost models
Efficient parallel I/O w. scientific data formats Investigate unified caching, prefetching and buffering
DOE Collaborators Team led by Al Geist and Nagiza Samatova (ORNL) Team leb by Wu-Chun Feng (LANL)