Date post: | 16-Dec-2015 |
Category: |
Documents |
Upload: | marybeth-west |
View: | 216 times |
Download: | 1 times |
Parallel I/O Performance Study
Christian Chilan
The HDF Group
September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 1
Introduction
• Parallel performance affected by the I/O access pattern, file system, and MPI communication modes.
• Determination of interaction of these elements provides hints for improving performance.
• Study presents four test cases using h5perf and h5perf_serial.• h5perf has been extended to support parallel testing of 2D
datasets.• h5perf_serial, based on h5perf, allows serial testing of
n-dimensional datasets and various file drivers.• Testing includes various combinations of MPI
communication modes and HDF5 storage layouts.• Finally, we make recommendations that can improve the
I/O performance for specific patterns.September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 2
Testing Systems and Configuration
System Architecture File System MPI Implementation
abe Linux Cluster with Intel 64
Lustre MVAPICH2 1.0.2p1 Message Passing with Intel compiler
cobalt ccNUMA with Itanium 2
CXFS SGI Message Passing Toolkit 1.16
mercury Linux Cluster with Itanium 2
GPFS MPICH Myrinet 1.2.5..10, GM 2.0.8, Intel 8.0
Processors 4
Dataset Size 64K×64K (4GB)
I/O Selection 64MB per processor (shape depends on test case)
API HDF5 v181 (default building options)
Iterations 3
MPI/IO Type Collective / Independent
Storage Layout Contiguous / Chunked (chunk size depend on test case)
September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 3
HDF5 Storage Layouts
• Contiguous• HDF5 assigns a static contiguous region of storage
for raw data.
Dataset Dataset storage
September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 4
HDF5 Storage Layouts
• Chunked• HDF5 define separate regions of storage for raw data
named chunks, which are pre-allocated in row-major order when a file is created in parallel.
• This layout is only valid when a file is created and the chunks are pre-allocated. Further modification of the file may cause the chunks to be arranged differently.
C0 C1
C2 C3
C0 C1 C2 C3
September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 5
Test Cases
• Case A• The transfer selections extend over the entire columns
with a size of 64K×1K. If the storage is chunked, the size of the chunks is 1K×1K. The selections are interleaved horizontally with respect to the processors.
P0 P1 P2 P3 P0 P1 P2 P3 … P0 P1 P2 P364K
64K
1K
September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 6
Test Cases
• Case B• The transfer selection only spans half the columns with a size of
32K×2K. If the storage is chunked, the size of the chunks is
2K×2K. The selections are interleaved horizontally with respect
to the processors.
P0 P1 P2 P3 P0 P1 P2 P3 … P0 P1 P2 P3
P0 P1 P2 P3 P0 P1 P2 P3 … P0 P1 P2 P3
32K
2K
64K
64K
September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 7
Test Cases
• Case C• The transfer selections only span half the rows with a size of
2K×32K. If the storage is chunked, the size of the chunks is
2K×2K. The lower dimension (column) is evenly divided among
the processors.
P0P0…P0P1P1…P1P2P2…P2P3P3…P3
P0P0…P0P1P1…P1P2P2…P2P3P3…P3
2K
32K
64K
64K
September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 8
Test Cases
• Case D• The transfer selection extends over the entire rows with a
size of 1K×64K. If the storage is chunked, the size of the chunks is 1K×1K. The lower dimension (column) is evenly divided among the processors.
P0P0…P0P1P1…P1P2P2…P2P3P3…P3
64K
1K
64K
September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 9
Access Patterns
• Contiguous• Each processor retrieves a separate region of
contiguous storage. An example of this pattern is case D using contiguous storage.
• Non-contiguous• Separate regions are still assigned to each processor
but such regions contain gaps. Examples of this pattern include case C using contiguous storage, and collective cases C-D using chunked storage.
P0 P1 P2 P3
P0 … P1 P1 … P2 P2 … P3 P3 ...P0
September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 10
Access Patterns
• Interleaved (or overlapped)• Each processor writes into many portions that are
interleaved with respect to the other processors. For example, using contiguous storage along with cases A-B generates
• Another instance results from using chunked storage with collective cases A-B
P0 P1 P2 P3 P0 P1 P2 P3 …
P0 P1 P2 P3 P0 P1 P2 P3 …
September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 11
Performance Results and Analysis
• The results correspond to maximum throughput values of Write Open-Close operations during 3 iterations.
• Serial throughput is the performance baseline since our objective is to determine how parallel access can improve performance.
• Unlike GPFS and CXFS, Lustre does not stripe files by default. To enable parallel access, the directory / file must be striped using the command lfs.
September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 12
I/O Performance in Lustre
Case A Case B Case C Case D1
10
100
1000
contiguous storage
ind/non-striped
ind/striped
coll/non-striped
coll/striped
MB
/s
Case A Case B Case C Case D1
10
100
1000
chunked storage
ind/non-striped
ind/striped
coll/non-striped
coll/striped
MB
/s
September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 13
I/O Performance in Lustre
• Striping partitions the file space into stripes and assigns them to several Object Storage Targets (OSTs) in round-robin fashion.
• Since each OST stores portions of the file that are different from the other OSTs, they all can access the file in parallel.
• The default configuration on abe uses a stripe size of 4MB and a stripe count of 16.
• Striping improves performance when the I/O request of each processor spans several stripes (and OSTs) after MPI aggregations, if any.
• When the processors make small independent I/O requests that are practically contiguous as cases A-B using chunked storage, a single OST can provide better performance due to asynchronous operations.
September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 14
I/O Performance
Case A Case B Case C Case D1
10
100
1000
abe
serial/cont
serial/chk
ind/cont
ind/chk
coll/cont
coll/chk
MB
/s
September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 15
I/O Performance
Case A Case B Case C Case D1
10
100
1000
cobalt
serial/cont
serial/chk
ind/cont
ind/chk
coll/cont
coll/chk
MB
/s
September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 16
I/O Performance
Case A Case B Case C Case D0.1
1
10
100
1000
mercury
serial/cont
serial/chk
ind/cont
ind/chk
coll/cont
coll/chk
MB
/s
September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 17
Performance of Serial I/O
• Access using contiguous storage has the steepest performance trend as the cases change from A to D.
• When using chunked storage, the throughput remains almost constant at the upper bound.
• The allocation of chunks at the time they are written causes the access pattern to be virtually contiguous regardless of the test cases.
September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 18
Performance of Independent I/O
• Processors perform their I/O requests independently from each other.
• For contiguous storage, performance improves as the tests move from A to D.
• For chunked storage, throughput is high for interleaved cases A-B since writing blocks (chunks) become larger and caching is exploited. For cases C-D, the many writing requests (one per chunk) multiply the overhead due to unnecessary locking and caching in Lustre and CXFS.
• Unlike these file systems, GPFS has shown better scalability [1,2].
September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 19
Performance of Collective I/O
• The participating processors coordinate and combine their many requests into fewer I/O operations reducing latency.
• Since the file space is evenly divided among the processors, no need for locking which reduces overhead [3].
• For contiguous storage, performance is overall high but there is still an increasing trend as the cases change from A to D.
• For chunked storage, the performance is even higher with minor variations among the tests cases because several chunks can be written with a single I/O operation.
September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 20
Conclusion
• Important to determine the access pattern by analyzing the I/O requirements of the application and the storage implementation.
• For contiguous access patterns, independent access is preferable because it omits unnecessary overhead of collective calls.
• For non-contiguous patterns, there is little difference between independent and collective access. However, writing many chunks in independent mode may be expensive in Lustre and CXFS if caching is not exploited.
• For interleaved access pattern, collective mode is usually faster.• For all the access patterns, collective mode and chunk storage
provide the combination that yields the highest average performance.
September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 21
References
1. J. Borrill, L. Oliker, J. Shalf, and H. Shan. Investigation of Leading HPC I/O Performance Using A Scientific-Application Derived Benchmark. In Proceedings of SC’07: High Performance Networking and Computing, Reno, NV, November 2007.
2. W. Liao, A. Ching, K. Coloma, A. Choudhary, and L. Ward. An Implementation and Evaluation of Client-Side File Caching for MPI-IO. In Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2007, IEEE International Volume, Issue 26-30, pages 1-10, March 2007.
3. R. Thakur, W. Gropp, and E. Lusk. Data Sieving and Collective I/O in ROMIO. In Proceedings of the 7th Symposium of the Frontiers of Massively Parallel Computation. IEEE Computer Society Press, February 1999.
September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 22