Parallel I/O Performance Study Christian Chilan The HDF Group September 9, 2008SPEEDUP Workshop -...

Parallel I/O Performance Study

Christian Chilan

The HDF Group

September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 1

Introduction

• Parallel performance affected by the I/O access pattern, file system, and MPI communication modes.

• Determination of interaction of these elements provides hints for improving performance.

• Study presents four test cases using h5perf and h5perf_serial.• h5perf has been extended to support parallel testing of 2D

datasets.• h5perf_serial, based on h5perf, allows serial testing of

n-dimensional datasets and various file drivers.• Testing includes various combinations of MPI

communication modes and HDF5 storage layouts.• Finally, we make recommendations that can improve the

I/O performance for specific patterns.September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 2

Testing Systems and Configuration

System Architecture File System MPI Implementation

abe Linux Cluster with Intel 64

Lustre MVAPICH2 1.0.2p1 Message Passing with Intel compiler

cobalt ccNUMA with Itanium 2

CXFS SGI Message Passing Toolkit 1.16

mercury Linux Cluster with Itanium 2

GPFS MPICH Myrinet 1.2.5..10, GM 2.0.8, Intel 8.0

Processors 4

Dataset Size 64K×64K (4GB)

I/O Selection 64MB per processor (shape depends on test case)

API HDF5 v181 (default building options)

Iterations 3

MPI/IO Type Collective / Independent

Storage Layout Contiguous / Chunked (chunk size depend on test case)


HDF5 Storage Layouts

• Contiguous• HDF5 assigns a static contiguous region of storage

for raw data.

Dataset Dataset storage


HDF5 Storage Layouts

• Chunked• HDF5 define separate regions of storage for raw data

named chunks, which are pre-allocated in row-major order when a file is created in parallel.

• This layout is only valid when a file is created and the chunks are pre-allocated. Further modification of the file may cause the chunks to be arranged differently.

C0 C1

C2 C3

C0 C1 C2 C3


Test Cases

• Case A• The transfer selections extend over the entire columns

with a size of 64K×1K. If the storage is chunked, the size of the chunks is 1K×1K. The selections are interleaved horizontally with respect to the processors.

P0 P1 P2 P3 P0 P1 P2 P3 … P0 P1 P2 P364K

64K

1K


Test Cases

• Case B• The transfer selection only spans half the columns with a size of

32K×2K. If the storage is chunked, the size of the chunks is

2K×2K. The selections are interleaved horizontally with respect

to the processors.

P0 P1 P2 P3 P0 P1 P2 P3 … P0 P1 P2 P3

P0 P1 P2 P3 P0 P1 P2 P3 … P0 P1 P2 P3

32K

2K

64K

64K


Test Cases

• Case C• The transfer selections only span half the rows with a size of

2K×32K. If the storage is chunked, the size of the chunks is

2K×2K. The lower dimension (column) is evenly divided among

the processors.

P0P0…P0P1P1…P1P2P2…P2P3P3…P3


2K

32K

64K

64K


Test Cases

• Case D• The transfer selection extends over the entire rows with a

size of 1K×64K. If the storage is chunked, the size of the chunks is 1K×1K. The lower dimension (column) is evenly divided among the processors.


64K

1K

64K


Access Patterns

• Contiguous• Each processor retrieves a separate region of

contiguous storage. An example of this pattern is case D using contiguous storage.

• Non-contiguous• Separate regions are still assigned to each processor

but such regions contain gaps. Examples of this pattern include case C using contiguous storage, and collective cases C-D using chunked storage.

P0 P1 P2 P3

P0 … P1 P1 … P2 P2 … P3 P3 ...P0


Access Patterns

• Interleaved (or overlapped)• Each processor writes into many portions that are

interleaved with respect to the other processors. For example, using contiguous storage along with cases A-B generates

• Another instance results from using chunked storage with collective cases A-B

P0 P1 P2 P3 P0 P1 P2 P3 …

P0 P1 P2 P3 P0 P1 P2 P3 …


Performance Results and Analysis

• The results correspond to maximum throughput values of Write Open-Close operations during 3 iterations.

• Serial throughput is the performance baseline since our objective is to determine how parallel access can improve performance.

• Unlike GPFS and CXFS, Lustre does not stripe files by default. To enable parallel access, the directory / file must be striped using the command lfs.


I/O Performance in Lustre

Case A Case B Case C Case D1

10

100

1000

contiguous storage

ind/non-striped

ind/striped

coll/non-striped

coll/striped

MB

/s


10

100

1000

chunked storage

ind/non-striped

ind/striped

coll/non-striped

coll/striped

MB

/s


I/O Performance in Lustre

• Striping partitions the file space into stripes and assigns them to several Object Storage Targets (OSTs) in round-robin fashion.

• Since each OST stores portions of the file that are different from the other OSTs, they all can access the file in parallel.

• The default configuration on abe uses a stripe size of 4MB and a stripe count of 16.

• Striping improves performance when the I/O request of each processor spans several stripes (and OSTs) after MPI aggregations, if any.

• When the processors make small independent I/O requests that are practically contiguous as cases A-B using chunked storage, a single OST can provide better performance due to asynchronous operations.


I/O Performance


10

100

1000

abe

serial/cont

serial/chk

ind/cont

ind/chk

coll/cont

coll/chk

MB

/s


I/O Performance


10

100

1000

cobalt

serial/cont

serial/chk

ind/cont

ind/chk

coll/cont

coll/chk

MB

/s


I/O Performance

Case A Case B Case C Case D0.1

1

10

100

1000

mercury

serial/cont

serial/chk

ind/cont

ind/chk

coll/cont

coll/chk

MB

/s


Performance of Serial I/O

• Access using contiguous storage has the steepest performance trend as the cases change from A to D.

• When using chunked storage, the throughput remains almost constant at the upper bound.

• The allocation of chunks at the time they are written causes the access pattern to be virtually contiguous regardless of the test cases.


Performance of Independent I/O

• Processors perform their I/O requests independently from each other.

• For contiguous storage, performance improves as the tests move from A to D.

• For chunked storage, throughput is high for interleaved cases A-B since writing blocks (chunks) become larger and caching is exploited. For cases C-D, the many writing requests (one per chunk) multiply the overhead due to unnecessary locking and caching in Lustre and CXFS.

• Unlike these file systems, GPFS has shown better scalability [1,2].


Performance of Collective I/O

• The participating processors coordinate and combine their many requests into fewer I/O operations reducing latency.

• Since the file space is evenly divided among the processors, no need for locking which reduces overhead [3].

• For contiguous storage, performance is overall high but there is still an increasing trend as the cases change from A to D.

• For chunked storage, the performance is even higher with minor variations among the tests cases because several chunks can be written with a single I/O operation.


Conclusion

• Important to determine the access pattern by analyzing the I/O requirements of the application and the storage implementation.

• For contiguous access patterns, independent access is preferable because it omits unnecessary overhead of collective calls.

• For non-contiguous patterns, there is little difference between independent and collective access. However, writing many chunks in independent mode may be expensive in Lustre and CXFS if caching is not exploited.

• For interleaved access pattern, collective mode is usually faster.• For all the access patterns, collective mode and chunk storage

provide the combination that yields the highest average performance.


References

1. J. Borrill, L. Oliker, J. Shalf, and H. Shan. Investigation of Leading HPC I/O Performance Using A Scientific-Application Derived Benchmark. In Proceedings of SC’07: High Performance Networking and Computing, Reno, NV, November 2007.

2. W. Liao, A. Ching, K. Coloma, A. Choudhary, and L. Ward. An Implementation and Evaluation of Client-Side File Caching for MPI-IO. In Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2007, IEEE International Volume, Issue 26-30, pages 1-10, March 2007.

3. R. Thakur, W. Gropp, and E. Lusk. Data Sieving and Collective I/O in ROMIO. In Proceedings of the 7th Symposium of the Frontiers of Massively Parallel Computation. IEEE Computer Society Press, February 1999.


Date post:	16-Dec-2015
Category:	Documents
Upload:	marybeth-west
View:	216 times
Download:	1 times

Parallel I/O Performance Study Christian Chilan The HDF Group September 9, 2008SPEEDUP Workshop -...

Documents