Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | hollie-jodie-jennings |
View: | 214 times |
Download: | 0 times |
Parallel and Grid I/O Infrastructure
W. Gropp, R. Ross, R. Thakur
Argonne National Lab
A. Choudhary, W. LiaoNorthwestern University
G. Abdulla, T. Eliassi-RadLawrence Livermore National
Lab
Parallel and Grid I/O Infrastructure
Outline
• Introduction• PVFS and ROMIO• Parallel NetCDF• Query Pattern Analysis
Please interrupt at any point for questions!
Parallel and Grid I/O Infrastructure
What is this project doing?
• Extending existing infrastructure work– PVFS parallel file system– ROMIO MPI-IO implementation
• Helping match application I/O needs to underlying capabilities– Parallel NetCDF– Query Pattern Analysis
• Linking with Grid I/O resources– PVFS backend for GridFTP striped server– ROMIO on top of Grid I/O API
Parallel and Grid I/O Infrastructure
What Are All These Names?
• MPI - Message Passing Interface Standard– Also known as MPI-1
• MPI-2 - Extensions to MPI standard– I/O, RDMA, dynamic processes
• MPI-IO - I/O part of MPI-2 extensions • ROMIO - Implementation of MPI-IO
– Handles mapping MPI-IO calls into communication (MPI) and file I/O
• PVFS - Parallel Virtual File System– An implementation of a file system for Linux
clusters
Parallel and Grid I/O Infrastructure
Fitting the Pieces Together
Query Pattern Analysis
Parallel NetCDF
Any MPI-IO Implementation
• Query Pattern Analysis (QPA) and Parallel NetCDF both written in terms of MPI-IO calls– QPA tools pass information down through MPI-IO hints– Parallel NetCDF written using MPI-IO for data read/write
• ROMIO implementation uses PVFS as storage medium on Linux clusters or could hook to Grid I/O resources
ROMIO MPI-IO ImplementationPVFS Parallel File System
Grid I/O Resources
Parallel and Grid I/O Infrastructure
PVFS and ROMIO
• Provide a little background on the two– What they are, example to set context, status
• Motivate the work• Discuss current research and development
– I/O interfaces– MPI-IO Hints– PVFS2
Our work with these two closely tied together.
Parallel and Grid I/O Infrastructure
Parallel Virtual File System
• Parallel file system for Linux clusters– Global name space– Distributed file data– Builds on TCP, local file systems
• Tuned for high performance concurrent access• Mountable like NFS file systems• User-level interface library (used by ROMIO)• 200+ users on mailing list, 100+ downloads/month
– Up from 160+ users in March
• Installations at OSC, Univ. of Utah, Phillips Petroleum, ANL, Clemson Univ., etc.
Parallel and Grid I/O Infrastructure
PVFS Architecture
• Client - Server architecture• Two server types
– Metadata server (mgr) - keeps track of file metadata (permissions, owner) and directory structure
– I/O servers (iod) - orchestrate movement of data between clients and local I/O devices
• Clients access PVFS one of two ways– MPI-IO (using ROMIO implementation)– Mount through Linux kernel (loadable
module)
PVFS and ROMIO
PVFS Performance
• Ohio Supercomputer Center cluster• 16 I/O servers (IA32), 70+ clients (IA64), IDE disks• Block partitioned data, accessed through ROMIO
Parallel and Grid I/O Infrastructure
ROMIO
• Implementation of MPI-2 I/O specification– Operates on wide variety of platforms– Abstract Device Interface for I/O (ADIO) aids in
porting to new file systems– Fortran and C bindings
• Successes– Adopted by industry (e.g. Compaq, HP, SGI)– Used at ASCI sites (e.g. LANL Blue Mountain)
MPI-IO InterfaceADIO Interface
FS-Specific Code (e.g. AD_PVFS, AD_NFS)
Parallel and Grid I/O Infrastructure
Example of Software Layers
• FLASH Astrophysics application stores checkpoints and visualization data using HDF5
• HDF5 in turn uses MPI-IO (ROMIO) to write out its data files
• PVFS client library isused by ROMIO to writedata to PVFS file system
• PVFS client libraryinteracts with PVFSservers over network
FLASH Astrophysics Code
HDF5 I/O Library
ROMIO MPI-IO Library
PVFS Client Library
PVFS Servers
Parallel and Grid I/O Infrastructure
Example of Software Layers (2)
• FLASH Astrophysics application stores checkpoints and visualization data using HDF5
• HDF5 in turn uses MPI-IO (IBM) to write out its data files
• GPFS File System storesdata to disks
FLASH Astrophysics Code
HDF5 I/O Library
IBM MPI-IO Library
GPFS
Parallel and Grid I/O Infrastructure
Status of PVFS and ROMIO
• Both are freely available, widely distributed, documented, and supported products
• Current work focuses on:– Higher performance through more rich file
systems interfaces– Hint mechanisms for optimizing behavior of
both ROMIO and PVFS– Scalability– Fault tolerance
Parallel and Grid I/O Infrastructure
Why Does This Work Matter?
• Much of I/O on big machines goes through MPI-IO– Direct use of MPI-IO (visualization)– Indirect use through HDF5 or NetCDF (fusion, climate,
astrophysics)– Hopefully soon through Parallel NetCDF!
• On clusters, PVFS is currently the most deployed parallel file system
• Optimizations in these layers are of direct benefit to those users
• Providing guidance to vendors for possible future improvements
Parallel and Grid I/O Infrastructure
I/O Interfaces
• Scientific applications keep structured data sets in memory and in files
• For highest performance, the description of the structure must be maintained through software layers– Allow the scientist to describe the data layout in
memory and file– Avoid packing into buffers in intermediate layers– Minimize the number of file system operations
needed to perform I/O
Parallel and Grid I/O Infrastructure
File System Interfaces
• MPI-IO is a great starting point• Most underlying file systems only provide
POSIX-like contiguous access
• List I/O work was first step in the right direction– Proposed FS interface– Allows movement of lists of
data regions in memory andfile with one call
Memory
File
Parallel and Grid I/O Infrastructure
List I/O
• Implemented in PVFS• Transparent to user through
ROMIO• Distributed in latest releases
0
5
10
15
20
25
30
MB/s
ec
POSIX I / O Data Sieving Two-PhaseCollective I / O
List I / O
Tiled Visualization Reader
Parallel and Grid I/O Infrastructure
List I/O Example
• Simple datatyperepeated over file
• Desire to read first9 bytes
• This is converted intofour [offset,length] pairs
• One can see how this process could result in a very large list of offsets and lengths
# of Bytes
# of Datatypes 1 2 3
Datatype
0 1 2 3 4 5 6 7 8 9 10 11
0 2 6 10
1 3 3 2
File Offsets
File Lengths
size of a byte
Flattening A File Datatype
Parallel and Grid I/O Infrastructure
Describing Regular Patterns
• List I/O can’t describe regular patterns (e.g. a column of a 2D matrix) in an efficient manner
• MPI datatypes can do this easily• Datatype I/O is our solution to this
problem– Concise set of datatype constructors used to
describe types– API for passing these descriptions to a file
system
Parallel and Grid I/O Infrastructure
Datatype I/O
• Built using a generic datatype processing component (also used in MPICH2)– Optimizing for performance
• Prototype for PVFS in progress– API and server support
• Prototype of support in ROMIO in progress– Maps MPI datatypes to PVFS datatypes– Passes through new API
• This same generic datatype component could be used in other projects as well
Parallel and Grid I/O Infrastructure
Datatype I/O Example
• Same datatype as in previous example• Describe datatype with one construct:
– index {(0,1), (2,2)} describes pattern of one short block and one longer one
– automatically tiled (as with MPI types for files)
• Linear relationship between # of contiguous pieces and size of request is removed
# of Bytes
# of Datatypes 1 2 3
Datatype
0 1 2 3 4 5 6 7 8 9 10 11
size of a byte
Parallel and Grid I/O Infrastructure
MPI Hints for Performance
• ROMIO has a number of performance optimizations built in
• The optimizations are somewhat general, but there are tuning parameters that are very specific– buffer sizes– number and location of processes to perform I/O– data sieving and two-phase techniques
• Hints may be used to tune ROMIO to match the system
Parallel and Grid I/O Infrastructure
ROMIO Hints
• Currently all of ROMIO’s optimizations may be controlled with hints– data sieving– two-phase I/O– list I/O– datatype I/O
• Additional hints are being considered to allow ROMIO to adapt to access patterns– collective-only I/O– sequential vs. random access– inter-file dependencies
Parallel and Grid I/O Infrastructure
PVFS2
• PVFS (version 1.x.x) plays an important role as a fast scratch file system for use today
• PVFS2 will supersede this version, adding– More comprehensive system management– Fault tolerance through lazy redundancy– Distributed metadata– Component-based approach for supporting new
storage and network resources
• Distributed metadata and fault tolerance will extend scalability into thousands and tens of thousands of clients and hundreds of servers
• PVFS2 implementation is underway
Parallel and Grid I/O Infrastructure
Summary
• ROMIO and PVFS are a mature foundation on which to make additional improvements
• New, rich I/O descriptions allow for higher performance access
• Addition of new hints to ROMIO allows for fine-tuning its operation
• PVFS2 focuses on the next generation of clusters