+ All Categories
Home > Documents > ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large...

ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large...

Date post: 02-Jun-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
32
Scalable I/O Ed Karrels, [email protected]
Transcript
Page 1: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

Scalable I/O

Ed Karrels, [email protected]

Page 2: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

I/O performance overview

• Main factors in performance• Know your I/O• Striping• Data layout• Collective I/O

2 of32

Page 3: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

I/O performance

• Length of each basic operation• High throughput, high latency• Avoid using small reads / writes

• Locality matters• Avoid jumping around in the file

3 of32

Page 4: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

Single process I/O size

400

450

500

550

600

650

700

32K 256K 1M 4M 16M

Thro

ughp

ut M

iB/s

IO operation size

Contiguous I/O operations

WriteRead

4 of32

Page 5: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

Single process I/O size

0.1

1

10

100

1000

4 16 64 256 1K 4K 32K 256K 1M 4M 16M

Thro

ughp

ut M

iB/s

IO operation size

Contiguous I/O operations

WriteRead

500 MiB/s

5 of32

Page 6: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

Single process random access

0.0001

0.001

0.01

0.1

1

10

100

1000

4 16 64 256 1K 4K 32K 256K 1M 4M 16M

Thro

ughp

ut M

iB/s

IO operation size

I/O operations at random offsets

WriteRead

6 of32

Page 7: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

Know your I/O

• How fast (or slow) is it?• How is the data organized on disk?

• N-dimensional array?• Text or binary?

• Which subset of the data for each process?

7 of32

Page 8: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

Darshan I/O analysis

• Darshan collects logs of all I/O• On by default

• disable with "module unload darshan"• Logs are in

/projects/monitoring_data/darshan/YYYY-MM/

• Only writes logs on normal exit (needs call to MPI_Finalize)

8 of32

Page 9: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

Darshan I/O analysis

9 of32

Page 10: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

Darshan commands

• darshan-job-summary.pl• Generate PDF report of the whole job

• darshan-summary-per-file.sh• Generate per-file report

• darshan-parser• Extract lots of raw data

10 of32

Page 11: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

Darshan 3 not backward-compatible with 2

• Darshan 2: *.darshan.gz• Darshan 3: *.darshan

module swap darshan/3.1.3 darshan/2.3.0.1module unload gnuplot/5.0.5

11 of32

Page 12: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

Parallel I/O striping

• Blue Waters uses Lustre file system• 360 OSTs

• Essentially 360 independent file servers• Striping parameters for each file

• Size: length of each stripe• Count: number of file servers to use

OST0 OST1 OST2 OST3 OST0 OST1 OST2 OST3

12 of32

Page 13: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

Striping with Lustre

• lfs getstripe <file or dir>• Print striping parameters for a file$ lfs getstripe foo4foo4lmm_stripe_count: 4lmm_stripe_size: 1048576lmm_pattern: 1lmm_layout_gen: 0lmm_stripe_offset: 287

obdidx objid objid group287 59540013 0x38c822d 011 65961401 0x3ee7db9 0151 65962343 0x3ee8167 071 65963432 0x3ee85a8 0

13 of32

Page 14: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

Striping with Lustre

• lfs setstripe –s <size> -c <count> <file or dir>

• Directories• New files will inherit striping parameters

• Files• Set at creation time• lfs setstripe on a nonexistent file creates it

lfs setstripe –s 1M -c 16 foo16cp foo1 foo16

14 of32

Page 15: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

Set striping in MPI IO

MPI_Info info;MPI_Info_set(info, "striping_factor", "64");MPI_Info_set(info, "striping_unit", "1048576");MPI_File_open(..., info, ...);

15 of32

Page 16: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

Striping on Blue Waters

• Maximum stripe count• home / projects: 36• scratch: 360

• Can’t set > 36 on scratch directly (bug?)• Solution: set on directory, inheritlfs setstripe -c 360 mydata/touch mydata/foo

16 of32

Page 17: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

0.3

1

3

10

30

100

1 2 4 8 16 32 64 128

Thro

ughp

ut G

iB/s

Stripe count

Striped file throughput

Single file, striped

43GiB/s

*Note:olddata

17 of32

Page 18: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

2

5

10

20

50

100

10 20 50 100

Tim

e in

sec

onds

Size of file in GiB

Time to write MILC data

4 MiB stripes16 MiB stripes64 MiB stripes

Stripe length – bigger is better

Factor of 7.4x

18 of32

Page 19: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

Stripe length – doesn’t matter?

6 7 8 9

10 11 12 13 14 15

1 4 6 8 12 16 20 24 28 32

Tim

e in

sec

onds

Stripe length in MiB

Throughput in GiB/s

19 of32

Page 20: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

Striping with many files

• Leave stripe count = 1• Full file on one file server

• Each one assigned to a random file server

20 of32

Page 21: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

One file per process

0.1

1

10

100

1000

1 10 100 1000 10000

Thro

ughp

ut G

iB/s

Process count

File-per-process throughput

1 node2 nodes4 nodes8 nodes

16 nodes32 nodes64 nodes

128 nodes256 nodes

21 of32

Page 22: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

One file per process

100

110

120

130

140

150

160

1000 2000 3000 4000 5000 6000 7000 8000

Thro

ughp

ut G

iB/s

Process count

File-per-process throughput

128 nodes256 nodes

Peak:153GiB/s

22 of32

Page 23: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

Data layout

• Common pattern: N-dimensional array

23 of32

Page 24: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

Data layout

• Tile on each process

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

24 of32

Page 25: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

I/O for a tile

• Many small accesses (eek!)0

P0 P1 P2

25 of32

Page 26: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

Tile solution: collective I/O

P0P1P2P3

Diskaccess

P0 P1

P2 P3

Memorylayout

• Small number of large I/O ops• Redistribute data in memory

26 of32

Page 27: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

Collective I/O

• Built into MPI-IO• Describe shape of data with MPI datatypes• Use Mesh IO library to make it easier

• Parallel HDF5• Parallel NetCDF

27 of32

Page 28: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

Mesh I/O

int Mesh_IO_read(MPI_File fh,MPI_Offset offset,MPI_Datatype etype,int file_endian,void *buf,int ndims,const int *mesh_sizes,const int *file_mesh_sizes,const int *file_mesh_starts,int file_array_order,const int *memory_mesh_sizes,const int *memory_mesh_starts,int memory_array_order);

• ForN-dimensionalmeshes• Describesizeoffullmeshandthesubmesh youwant

• Collectiveoperation• Matchingwritefunction• github.com/oshkosher/meshio

28 of32

Page 29: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

XPACC - tiles without collective I/O

• Constantdatasize(1GB)• Runtimegrowswithprocesscount• 4500secondson4096processes

29 of32

Page 30: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

XPACC - tiles with collective I/O

• Consistently under 1 second

30 of32

Page 31: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

Compression

• Highly compressible data – minimize IO• Bioinformatics (AGTCTGTCTTGC…)

• Example file: 40 GiB• Compress to 275 MiB (151x)• Serial scan in 9.6s (4.2 GiB/s)

• Tools - github.com/oshkosher/bioio• zchunk – read in chunks: offset+length• zlines – read lines of text: line number

31 of32

Page 32: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO

Summary

• All reads/writes at least 64k• Best performance: many files• With single file, enable striping• Use collective I/O to avoid small accesses

32 of32

Questions / corrections / [email protected]


Recommended