ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large...

Scalable I/O

Ed Karrels, [email protected]

I/O performance overview

• Main factors in performance• Know your I/O• Striping• Data layout• Collective I/O

2 of32

I/O performance

• Length of each basic operation• High throughput, high latency• Avoid using small reads / writes

• Locality matters• Avoid jumping around in the file

3 of32

Single process I/O size

400

450

500

550

600

650

700

32K 256K 1M 4M 16M

Thro

ughp

ut M

iB/s

IO operation size

Contiguous I/O operations

WriteRead

4 of32

Single process I/O size

0.1

1

10

100

1000

4 16 64 256 1K 4K 32K 256K 1M 4M 16M

Thro

ughp

ut M

iB/s

IO operation size

Contiguous I/O operations

WriteRead

500 MiB/s

5 of32

Single process random access

0.0001

0.001

0.01

0.1

1

10

100

1000

4 16 64 256 1K 4K 32K 256K 1M 4M 16M

Thro

ughp

ut M

iB/s

IO operation size

I/O operations at random offsets

WriteRead

6 of32

Know your I/O

• How fast (or slow) is it?• How is the data organized on disk?

• N-dimensional array?• Text or binary?

• Which subset of the data for each process?

7 of32

Darshan I/O analysis

• Darshan collects logs of all I/O• On by default

• disable with "module unload darshan"• Logs are in

/projects/monitoring_data/darshan/YYYY-MM/

• Only writes logs on normal exit (needs call to MPI_Finalize)

8 of32

Darshan I/O analysis

9 of32

Darshan commands

• darshan-job-summary.pl• Generate PDF report of the whole job

• darshan-summary-per-file.sh• Generate per-file report

• darshan-parser• Extract lots of raw data

10 of32

Darshan 3 not backward-compatible with 2

• Darshan 2: *.darshan.gz• Darshan 3: *.darshan

module swap darshan/3.1.3 darshan/2.3.0.1module unload gnuplot/5.0.5

11 of32

Parallel I/O striping

• Blue Waters uses Lustre file system• 360 OSTs

• Essentially 360 independent file servers• Striping parameters for each file

• Size: length of each stripe• Count: number of file servers to use

OST0 OST1 OST2 OST3 OST0 OST1 OST2 OST3

12 of32

Striping with Lustre

• lfs getstripe <file or dir>• Print striping parameters for a file$ lfs getstripe foo4foo4lmm_stripe_count: 4lmm_stripe_size: 1048576lmm_pattern: 1lmm_layout_gen: 0lmm_stripe_offset: 287

obdidx objid objid group287 59540013 0x38c822d 011 65961401 0x3ee7db9 0151 65962343 0x3ee8167 071 65963432 0x3ee85a8 0

13 of32

Striping with Lustre

• lfs setstripe –s <size> -c <count> <file or dir>

• Directories• New files will inherit striping parameters

• Files• Set at creation time• lfs setstripe on a nonexistent file creates it

lfs setstripe –s 1M -c 16 foo16cp foo1 foo16

14 of32

Set striping in MPI IO

MPI_Info info;MPI_Info_set(info, "striping_factor", "64");MPI_Info_set(info, "striping_unit", "1048576");MPI_File_open(..., info, ...);

15 of32

Striping on Blue Waters

• Maximum stripe count• home / projects: 36• scratch: 360

• Can’t set > 36 on scratch directly (bug?)• Solution: set on directory, inheritlfs setstripe -c 360 mydata/touch mydata/foo

16 of32

0.3

1

3

10

30

100

1 2 4 8 16 32 64 128

Thro

ughp

ut G

iB/s

Stripe count

Striped file throughput

Single file, striped

43GiB/s

*Note:olddata

17 of32

2

5

10

20

50

100

10 20 50 100

Tim

e in

sec

onds

Size of file in GiB

Time to write MILC data

4 MiB stripes16 MiB stripes64 MiB stripes

Stripe length – bigger is better

Factor of 7.4x

18 of32

Stripe length – doesn’t matter?

6 7 8 9

10 11 12 13 14 15

1 4 6 8 12 16 20 24 28 32

Tim

e in

sec

onds

Stripe length in MiB

Throughput in GiB/s

19 of32

Striping with many files

• Leave stripe count = 1• Full file on one file server

• Each one assigned to a random file server

20 of32

One file per process

0.1

1

10

100

1000

1 10 100 1000 10000

Thro

ughp

ut G

iB/s

Process count

File-per-process throughput

1 node2 nodes4 nodes8 nodes

16 nodes32 nodes64 nodes

128 nodes256 nodes

21 of32

One file per process

100

110

120

130

140

150

160

1000 2000 3000 4000 5000 6000 7000 8000

Thro

ughp

ut G

iB/s

Process count

File-per-process throughput

128 nodes256 nodes

Peak:153GiB/s

22 of32

Data layout

• Common pattern: N-dimensional array

23 of32

Data layout

• Tile on each process

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

24 of32

I/O for a tile

• Many small accesses (eek!)0

P0 P1 P2

25 of32

Tile solution: collective I/O

P0P1P2P3

Diskaccess

P0 P1

P2 P3

Memorylayout

• Small number of large I/O ops• Redistribute data in memory

26 of32

Collective I/O

• Built into MPI-IO• Describe shape of data with MPI datatypes• Use Mesh IO library to make it easier

• Parallel HDF5• Parallel NetCDF

27 of32

Mesh I/O

int Mesh_IO_read(MPI_File fh,MPI_Offset offset,MPI_Datatype etype,int file_endian,void *buf,int ndims,const int *mesh_sizes,const int *file_mesh_sizes,const int *file_mesh_starts,int file_array_order,const int *memory_mesh_sizes,const int *memory_mesh_starts,int memory_array_order);

• ForN-dimensionalmeshes• Describesizeoffullmeshandthesubmesh youwant

• Collectiveoperation• Matchingwritefunction• github.com/oshkosher/meshio

28 of32

XPACC - tiles without collective I/O

• Constantdatasize(1GB)• Runtimegrowswithprocesscount• 4500secondson4096processes

29 of32

XPACC - tiles with collective I/O

• Consistently under 1 second

30 of32

Compression

• Highly compressible data – minimize IO• Bioinformatics (AGTCTGTCTTGC…)

• Example file: 40 GiB• Compress to 275 MiB (151x)• Serial scan in 9.6s (4.2 GiB/s)

• Tools - github.com/oshkosher/bioio• zchunk – read in chunks: offset+length• zlines – read lines of text: line number

31 of32

Summary

• All reads/writes at least 64k• Best performance: many files• With single file, enable striping• Use collective I/O to avoid small accesses

32 of32

Questions / corrections / [email protected]

Date post:	02-Jun-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large...

Documents