Delivering Parallel Programmability to the Masses via the Intel MIC Ecosystem: A Case Study

synergy.cs.vt.edu

Delivering Parallel Programmability to the Masses

via the Intel MIC Ecosystem: A Case StudyKaixi Hou, Hao Wang, and Wu-chun Feng

Department of Computer Science, Virginia Tech

synergy.cs.vt.edu

NameRmax

(petaflop/s)

Xeon Phi

/Node

33.86 3

5.17 2

Tianhe-2

Titan

Sequoia

K computer

Mira

Piz Daint

Stampede

JUQUEEN

Vulcan

Cray XC30

Intel Xeon Phi in HPC

• In the Top500 list* of supercomputers …– 27% of accelerator-based systems use Intel Xeon Phi

(17/62)– Top 10:

04/19/2023

2

* Released in June 2014

1

2

3

4

5

6

7

8

9

10

Stampede 7

Equipped with Xeon Phi

synergy.cs.vt.edu04/19/2023

Intel Xeon vs. Intel Xeon Phi

► Less than 12 cores/socket

► Cores @ ~3GHz► 256-bit vector units► DDR3 80~100GB/s

BW

3

► Up to 61 cores► Cores @ ~1 GHz► 512-bit vector

units► GDDR5 150GB/s

BW

• x86 architecture and programming models

Yes. it’s easy to write and run programs on

Phi. So, is it easy to write programs for the Xeon Phi? … but optimizing

performance on Phi is not easy!


Architecture-Specific Solutions

• Transposition in FFT

[Park13] – Reduce memory accesses via cross-lane intrinsics

• Swendsen-Wang multi-cluster algorithm

[Wende13] – Maneuver the elements in registers via the data-

reordering intrinsics• Linpack benchmark

[Heinecke13] – Reorganize the computation patterns and

instructions via assembly code 4

If the optimizations are Xeon Phi-specific, the codes are not easy to write and

portable.


Performance, Programmability, and Portability

• It’s more than performance …… programmability and portability.

• Solution: directive-based optimizations + “simple” algorithmic changes.– @Cache

• Blocking to create better memory access – @Vector Units

• Pragmas + loop structure changes – @Many-cores

• Pragmas– Find the optimal combination of parameters.

5


Outline

• Introduction– Intel Xeon Phi– Architecture-Specific Solutions

• Case Study : Floyd-Warshall Algorithm– Algorithmic Overview– Optimizations for Xeon Phi

• Cache Blocking• Vectorization via Data-Level Parallelism• Many Cores via Thread-Level Parallelism

– Performance Evaluation on Xeon Phi• Conclusion

6


Case Study: Floyd-Warshall Algorithm• All-pairs shortest paths (APSP) problem• Algorithmic complexity: O(n3)• Algorithmic Overview

Keep an increasing subset of intermediate vertices for each iteration dynamic programming problem

7

a d

c

b4 3

4

2 1

synergy.cs.vt.edu

• means the newly added intermediate vertex in current iteration.

A Quick Example

04/19/2023

8

a d

c

b4 3

4

2 1

a b c d

a 0 4 - 4

b - 0 - 3

c 2 - 0 -

d 4 - 1 0

𝑘=1a b c d

a 0 4 - 4

b - 0 - 3

c 2 - 0 -

d 4 - 1 0

a b c d

a 0 4 - 4

b - 0 - 3

c 2 6 0 6

d 4 8 1 0

a b c d

a 0 4 - 4

b - 0 - 3

c 2 6 0 6

d 3 7 1 0

a b c d

a 0 4 5 4

b 6 0 4 3

c 2 6 0 6

d 3 7 1 0

𝑘=4

6

8

6

c a

c a

d a

b

d

b

2

2

4

4

4

4

c b

c d

d b

∞

∞

∞

𝑘=3𝑘=2


x2

x1i

jk

k

y

Issue in Caching: Lack of Data Locality

9

i k jx1 x2

i jy

Default algorithm: data locality problem

synergy.cs.vt.edu

Cache Blocking: Improve Data Reuse

04/19/2023

10

Step 2(k, j)

Step 2(i, k)

Step 1(k, k)

Step 3(i,j)

padding

padding

k1 k2 k3 k4

k1 k2 k3 k4

i1 i2 i3 i4

j1 j2 j3 j4kxky

Intermediate vertices

kxkz ky

ix ky

kxjykx

kz jy

ix kz ky ix jyix kz jy


v

u

v0+BLOCK_SIZEsize

u0+BLOCK_SIZE

size

Padded area

void update(int k0, int u0, int v0) { for(k=k0;k<MIN(k0+BLOCK_SIZE, size);k++) { for(u=u0;u<MIN(u0+BLOCK_SIZE,size);u++) { #pragma ivdep for(v=v0;v<MIN(v0+BLOCK_SIZE,size);v++) { if(dmat[u][v]>dmat[u][k]+dmat[k][v]) dmat[u][v]=dmat[u][k]+dmat[k][v];

pmat[u][v]=k;}}}}

v0

Bottom-right block

u0 Version 1

Padd

ed a

rea

Distance Matrix

Core computation

Vectorization: Data-Level Parallelism

11

• Pragmas to guide the compiler to vectorize the loop:– #pragma vector always: vectorize the loop regardless of

the efficiency– #pragma ivdep: ignore vector dependencies

Top test could not be found.


void update(int k0, int u0, int v0) { for(k=k0;k<MIN(k0+BLOCK_SIZE, size);k++){ for(u=u0;u<(u0+BLOCK_SIZE);u++) {#pragma ivdep for(v=v0;v<(v0+BLOCK_SIZE);v++) { if(dmat[u][v]>dmat[u][k]+dmat[k][v]) dmat[u][v]=dmat[u][k]+dmat[k][v]; pmat[u][v]=k;

}}}}

v

u

v0+BLOCK_SIZEsize

u0+BLOCK_SIZE

size

v0

u0 Version 3

Distance Matrix

Bottom-right block SIMD-friendly codes

• Modify the boundary check conditions (u-loop & v-loop)– Extra computations but regular loop forms

• Keep boundary check condition (k-loop) – Where to fetch data

12

Vectorization: Data-Level Parallelism (Cont’d)

Extra computations

Extr

a

com

puta

tions


Many Cores: Thread-Level Parallelism (TLP)

• OpenMP pragmas– A portable way to parallelize serial programs– Run-time specifications: thread number, thread

affinity, etc.• Utilize thread-level parallelism (TLP) in Xeon

Phi– Apply OpenMP pragmas on loops of step 2 and step

3: most parallelism opportunities.

13


Optimization Challenges in Thread-Level Parallelism

14

• Many configurable parameters – Ex: block size, thread number, runtime scheduling

policy, etc.• Difficulty in finding an appropriate combination

of parameters– Inter-dependency between parameters– Huge search space


Optimization Approach to Thread-Level Parallelism

• Starchart: Tree-based partitioning

[Jia13]

15

A pool of samplesFormat: (p1, p2, …, pn)-> performance

V = Performance Variance

Vi=Performance Variance + Performance Variance

Select parameter value which creates max(V-Vi)


200 samplesmean = 1.47 s

Data Size ∈{2000}

Data Size ∈{4000}



Block Size ∈{16,32,48}

Block Size ∈{64}



Thread Num ∈{61}

Thread Num ∈{122,183,244}



Thread Num ∈{61}

Thread Num ∈{122,183,244}



Block Size ∈{16}

Block Size ∈{32,48,64}





Task Alloc ∈{block}

Task Alloc ∈{cyc1,cyc2,cyc3}

Thread Num ∈{61,122}

Thread Num ∈{183,244}





Task Alloc ∈{block}

Task Alloc ∈{cyc1,cyc2,cyc3}

16

Applying Starchart

Small data size

Large data size

5 parameters: 480 possible combinations

Choosing appropriate block size and thread num is most important!

Parameters (small):32 /blockSize, 244/threadNum, block/taskAlloc, balanced/threadAffinity

Parameters (large):32/blockSize, 244/threadNum, cyc/taskAlloc, balanced/threadAffinity


Outline

• Introduction– Intel Xeon Phi– Architecture-Specific Solutions

• Case Study : Floyd-Warshall Algorithm– Algorithmic Overview– Optimizations for Xeon Phi

• Cache Blocking• Vectorization via Data-Level Parallelism• Many Cores via Thread-Level Parallelism

– Performance Evaluation for Xeon Phi• Conclusion

17


Default se-quential version

Cache block-ing

Cache block-ing +

changes to loop struc-

ture

Vectoriza-tion via SIMD

pragmas

Thread-level parallelism via OpenMP

pragmas

0

20

40

60

80

100

120

140

180.27 205.29

102.11

24.95

0.64

Exec

ution

Tim

es (s

)

180.27 205.29

Performance Evaluation : Step-by-Step

18

1.76x

4.09x

38.98x

(2,000 vertices)

• Cache blocking : 14% performance loss– Redundant computations

induced in step 2 and step 3.

– Boundary check conditions in the loop structures

• Cache blocking with changes to loop structure : 1.76x

• Vectorization via SIMD pragmas : 4.09x

• Thread-level parallelism via OpenMP pragmas : 38.98x

• Overall: 281.67x


Performance Evaluation : Scalability

• Baseline: OpenMP version of default algorithm

• Optimized (MIC) vs. Baseline: up to 6.39x

• Optimized (MIC) vs. Optimized (CPU): up to 3.2x– Peak performance ratio of

MIC and CPU: 3.23x (2148 Gflops and 665.6 Gflops)

19

1000 2000 4000 8000 160000.1

1

10

100

1000 Default FW with OpenMP (MIC)Blocked FW with SIMD pragmas + OpenMP (MIC)Blocked FW with SIMD pragmas + OpenMP (CPU)

Number of Vertices

Exec

ution

Tim

es (s

)

6.39x

1.37x

3.2x

“baseline”


Performance Evaluation : Strong Scaling

• Balanced thread affinity: – 2x from 1 thread/core to

4 threads/core• Other affinities:

– Scatter: 2.6x– Compact: 3.8x

20

61 122 183 2440

20406080

100120140160180

balanced

Number of Threads

Exec

ution

Tim

es (s

)

(16,000 vertices)

61 122 183 2440

50100150200250300350

compact

Number of Threads

Exec

ution

Tim

es (s

)61 122 183 244

0

50

100

150

200scatter

Number of Threads

Exec

ution

Tim

es (s

)


Conclusion

• CPU programs can be recompiled and directly run on Intel Xeon Phi, but achieving optimized performance requires a considerable effort.– Considerations: Performance, programmability, and portability

• We use directive-based optimizations and certain algorithmic changes to achieve significant performance gains for the Floyd-Warshall algorithm as a case study.– 6.4x speedup over a default OpenMP version of Floyd-Warshall on

Xeon Phi.– 3.2x speedup over a 12-core multicore CPU (Sandy Bridge).

21

Thanks! Questions?

Date post:	31-Dec-2015
Category:	Documents
Upload:	sacha-sellers
View:	29 times
Download:	0 times

Delivering Parallel Programmability to the Masses via the Intel MIC Ecosystem: A Case Study

Documents