Download - HPC Challenge 2007 © 2007 IBM Corporation IBM Research 1 HPC Challenge 07: X10 Vijay Saraswat [email protected] November 13, 2007 IBM Research This material.

HPC Challenge 2007

© 2007 IBM Corporation1

IBM

Res

earc

h

HPC Challenge 07: X10

Vijay [email protected] 13, 2007IBM Research

This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002.

HPC Challenge 2007


IBM

Res

earc

h

Teams

Application– Tong Wen (WAT)– Mark Stephenson (AUS)– Vipin Sachdeva (AUS)– Guojing Cong (WAT)

Compiler– Igor Peshansky (WAT)– Krishna Venkata (IRL)– Pradeep Varma (IRL)– Nathaniel Nystrom (WAT)

Runtime– Sreedhar Kodali (ISL)– Ganesh Bikshandi (ISL)– Sriram Krishnamurthy (WAT)– Sayantan Sur (WAT)

BG Port– Jose Castanos (WAT)

ARMCI Port– Vinod Tipparaju (PNNL)

Special thanks to Calin Cascaval, John Gunnels and Doug Lea.

HPC Challenge 2007


IBM

Res

earc

h

What is X10?

– Supports fine-grained asynchrony.

– Supports simple constructs for atomicity and ordering

– Clean foundations• Advanced type system• Determinate and deadlock free

subsets

– Designed for multicore, clusters, for commercial applications and HPC.

– Based on sequential Java

– Extensions for concurrency, distribution and arrays

– Supports partitioned shared memory model (places)

HPC Challenge 2007


IBM

Res

earc

h

How is it implemented?

X10 Compiler ArchitectureX10 Frontend

(Polyglot)

AST

X10 Backend

C++ Backend Java BackendIR Backend

Java/ class

JVMGCC/xLCIR Codegen

IR C/ C++C/ C++

LAPI All Platforms

HPC Challenge 2007


IBM

Res

earc

h

Stream in X10 – LOC=47finish async {

final clock clk=clock.factory.clock();

ateach(point [p]:dist.UNIQUE) clocked (clk) {

final double[:rail]

a = new double[R],

b = new double[R] (point [i]) {return 1.5*i;},

c = new double[R] (point [i]) {return 2.5*i;};

for (int j=0;j<NUM_TIMES; j++) {

if (p==0) times[j]= -mySecond();

for (point [i]:R ) a[i]=b[i]+alpha*c[i];

next;

if (p==0) times[j] += mySecond();

}

for (point [i]: R) // verification

if (a[i] != b[i]+alpha* c[i])

async(place.FIRST_PLACE) clocked (clk) verified[0]=false;

}}

HPC Challenge 2007


IBM

Res

earc

h

Stream Performance

Stream on Power5 SMPs

0.00

500.00

1000.00

1500.00

2000.00

2500.00

3000.00

3500.00

4000.00

1 4 16 64 256 1024

Number of processes

GB

/s X10

MPI

64 nodes (16-way Power 5+, 1.9GHz, 64GB)

Stream on BG/W

0

5000

10000

15000

20000

25000

30000

35000

1 2 4 8 16

Number of racksG

B/s

ec X10'07

UPC'06

BG/W – UPC’6 uses xlc, X10 uses g++

HPC Challenge 2007


IBM

Res

earc

h

FT Code (LOC=137)

finish ateach(point [p]: UNIQUE) {

final int numLocalRows = SQRTN/NUM_PLACES;

int rowStartA = p*numLocalRows; // local transpose

region block = [0:numLocalRows-1,0:numLocalRows-1];

for (int k=0; k<NUM_PLACES; k++) { //for each block

int colStartA = k*numLocalRows;

… transpose locally…

for (int i=0; i<numLocalRows;i++) {

final int srcIdx = 2*((rowStartA + i)*SQRTN+colStartA),

destIdx = 2*(SQRTN * (colStartA + i) + rowStartA);

async (UNIQUE[k])

Runtime.arrayCopy(Y, srcIdx, Z, destIdx, 2*numLocalRows);

}}}

Actual FFT done through call to (sequential) FFTW.

Key routine: global transpose

HPC Challenge 2007


IBM

Res

earc

h

FT Performance

32 nodes (16-way Power 5+, 1.9GHz, 64GB)

BG/W – X10 uses g++. Now running for more racks.

(UPC == 51 GF/s for one rack.)

HPC Challenge 2007


IBM

Res

earc

h

Random Access (LOC=79)

static void RandomAccessUpdate(final int logLocalTableSize, final long[:rail] Table) {

finish ateach(point [p] : dist.UNIQUE) {

final long localTableSize=1<<logLocalTableSize,

TableSize=localTableSize*NUM_PLACES,

mask=TableSize-1,

NumUpdates=4*localTableSize;

long ran=HPCC_starts(p*NumUpdates);

for (long i=0; i<NumUpdates; i++) {

final long temp=ran;

final int index = (int)(temp & mask);

@aggregate async (dist.UNIQUE[index/(int)(TableSize/NUM_PLACES)])

atomic Table[index] ^= temp;

ran = (ran << 1)^((long) ran < 0 ? POLY : 0);

}}}

Core algorithm:

HPC Challenge 2007


IBM

Res

earc

h

Performance

Random Access performance on BG/W

0

5

10

15

20

25

30

1 2 4 8 16

Number of racks (Note: comlib change from 06 to 07)

GU

PS UPC'06

X10'07

Random Access on Power5 SMP Cluster

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

1 4 16 64 256 1024

Number of processes

GU

PS X10

MPI

1 2 4 8 16 32 64UPC 0.58 1.15 2.28 4.49 8.83 14.8 28.3X10 0.58 1.04 1.31 2.51 4.51

1 2 4 8 16 32 64 128 256 512 1024X10 0.005613 0.007755 0.013426 0.025345 0.048783 0.078405 0.11704 0.155228 0.323228 0.384118 0.398779MPI 0.001342 0.002124 0.004144 0.002796 0.005622 0.01084 0.020375 0.029293 0.055678 0.094022 0.143919

HPC Challenge 2007


IBM

Res

earc

h

LU Code (LOC=291)

Single-place program

2d-block distribution of workers

Step checks dependencies and executes appropriate basic operation (LU, bSolve, lower).

(Code doesn’t compute y.)

void run() {

finish foreach (point [pi,pj] : [0:px-1,0:py-1]) {

int startY=0, startX []= new int[ny];

final Block[] myBlocks=A[pord(pi,pj)].z;

while(startY < ny) {

boolean done=false;

for (int j=startY; j < min(startY+LOOK_AHEAD, ny) && !done; ++j) {

for (int i=startX[j]; i <nx; ++i) {

final Block b = myBlocks[lord(i,j)];

if (b.ready) {

if (i==startX[j]) startX[j]++;

} else done |= b.step(startY, startX);

Thread.yield();

}

}

if (startX[startY]==nx) { startY++;}

}}}

Core algorithm:

HPC Challenge 2007


IBM

Res

earc

h

Example of steps boolean step(final int startY, final int[] startX) {

visitCount++;

if (count==maxCount) {

return I<J ? stepIltJ() : (I==J ? stepIeqJ() : stepIgtJ());

} else {

Block IBuddy=getBlock(I, count);

if (!IBuddy.ready) return false;

Block JBuddy=getBlock(count,J);

if (!JBuddy.ready) return false;

mulsub(IBuddy, JBuddy);

count++;

return true;

}

}

stepIltJ wait; backsolve

stepIeqJ wait; control panel LU factorization

stepIgtJ wait; compute lower, participate in LU factorizationCall BLAS for DGEMM.

HPC Challenge 2007


IBM

Res

earc

h

LU Performance (Note: SMP only)LU Performance on 64-way 2.3GHz p595+

PowerPC

0.00

50.00

100.00

150.00

200.00

250.00

300.00

350.00

400.00

450.00

500.00

4K 8k 16K 32K 64K

Problem Size

GF

lop

s X10

UPC

MPI

X10 UPC MPIProbSz BlkSz PX PY

4096 256 8 8 53.24 7.10 108.008192 256 8 8 129.30 93.55 201.90

16384 256 8 8 238.65 250.14 274.3032768 256 8 8 354.36 380.16 325.6065536 256 8 8 404.43 428.60 358.40

Gflops

IBM J9 JVM (64-bit)

IBM Research: Software Technology


Pro

gram

min

g T

echn

olog

ies

Backup slides

HPC Challenge 2007


IBM

Res

earc

h

Fragmented FT on Power PC

1 2 4 8 16 32 64 128 2560.389602 0.644324 1.077085 1.91141 2.957925 6.792567 14.0677 25.14613 84.75768

HPCC Fragmented FT on PowerPC(SQRTN_N=16K)

0

10

20

30

40

50

60

70

80

90

1 2 4 8 16 32 64 128 256

Number of processes

GF

lop

s

Series1

Performs better than Non-Fragmented FT.

Test run on 128 nodes

HPC Challenge 2007


IBM

Res

earc

h

Environment variable settings on V20

## MPI

# Eager/Rendezvous protocol

#export MP_BUFFER_MEM=64M

#export MP_EAGER_LIMIT=32768

# Time source

#export MP_CLOCK_SOURCE=SWITCH

export MP_CSS_INTERRUPT=no

export MP_SHARED_MEMORY=yes

export MP_WAIT_MODE=poll

export MP_SINGLE_THREAD=yes

# Uncomment the following for RDMA

#export MP_USE_BULK_XFER=yes

#export MP_BULK_MIN_MSG_SIZE=16384

## LAPI

export LAPI_USE_SHM=yes

export MP_ADAPTER_USE=dedicated

export MP_CPU_USE=unique

# user space protocol

export MP_EUIDEVICE=sn_all

export MP_EUILIB=us

## Set this in run script.

#export MP_MSG_API=mpi

## POE - job specification

export MP_PGMMODEL=spmd

export MP_TASK_AFFINITY=MCM

export MEMORY_AFFINITY=MCM

## POE - job specification

export MP_PGMMODEL=spmd

export MP_TASK_AFFINITY=MCM

export MEMORY_AFFINITY=MCM

## POE - i/o control

#export MP_LABELIO=yes

export MP_EUIDEVELOP=MIN

HPC Challenge 2007


IBM

Res

earc

h

Sample HPL Settings for LU (from program output)

The following parameter values will be used:N : 65536 NB : 256 PMAP : Row-major process mappingP : 8 Q : 8 PFACT : Right NBMIN : 4 NDIV : 2 RFACT : Crout BCAST : 1ringM 2ringM DEPTH : 1 SWAP : Mix (threshold = 256)L1 : no-transposed formU : transposed formEQUIL : noALIGN : 8 double precision words

Input settings similar to http://icl.cs.utk.edu/hpcc/hpcc_record.cgi?id=171