Xin Huo Department of Computer Science and Engineering The Ohio State University

Supporting Applications Involving Irregular Accesses and Recursive Control Flow on

Emerging Parallel Environments

1

Xin HuoDepartment of Computer Science and Engineering

The Ohio State University

Advisor: Prof. Gagan Agrawal

Motivation - Parallel Architectures

2

• Programming Model• CUDA/OpenCL• All x86 based programming models

• Running Method• Offload method• Offload/Native method

• Memory Configuration• Configurable shared memory + L2 cache• Coherent L2 cache

• SIMT vs. SIMD• Support non-continuous and non-aligned

accesses, and control dependencies automatically

• Not direct support for non-continuous, non-aligned accesses, control/data dependencies

• Many-core Architecture

Motivation - Parallel Architectures• Heterogeneous Architecture

3

CPU - Decoupled GPU CPU - Coupled GPU

PCIe bus x86 CPU CORES

GPU Engine Arrays

Fusion APU

Memory Controller

Physical Memory

Host Memory GPU Memory

Physical Memory

Motivation - Application Patterns• Irregular / Unstructured Reduction

- A dwarf in Berkeley view on parallel computing (Molecular Dynamics and Euler)

- Challenges in Parallelism• Heavy data dependencies

- Challenges in Memory Performance• Indirect memory accesses result in poor data locality

• Recursive Control Flow- Conflicts between SIMD architecture and control dependencies- Recursion Support: SSE (No), OpenCL (No), CUDA(Yes)

• Generalized Reduction and Stencil Computation- Need to be reconsidered when harness thousands of threads

4

Thesis Work

5

Generalized Reduction

CPU + GPUIrregular Reduction

Stencil Computation

Recursive Application

GPU

APU

Xeon Phi

Application Patterns

Parallel Architectures

Thesis Work• Approaches for Parallelizing Reductions on

GPUs (HiPC 2010)

• An Execution Strategies and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs (ICS 2011)

• Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations (HiPC 2011)

• Runtime Support for Accelerating Applications on an Integrated CPU-GPU Architecture (SC 2012)

• Efficient scheduling of recursive control flow on GPUs (ICS 2013)

• A Programming System for Xeon Phis with Runtime SIMD Parallelization (ICS2014)

6

Generalized Reduction

CPU + GPUIrregular Reduction

Stencil Computation

Recursive Application

GPU

APU

Xeon Phi

Application Patterns

Parallel Architectures

Outline• Latest work

- A Programming System for Xeon Phis with Runtime SIMD Parallelization

• Other work- Strategy and runtime support for irregular reductions on GPUs- Improve parallelism of SIMD for recursive applications on

GPUs- Different strategies for generalized reduction on GPU- Task scheduling frameworks

• Decoupled GPU + CPU• Coupled GPU + CPU

• Conclusion

7

Intel Xeon Phi Architecture

8

SIMD - Single Instruction Multiple Data

9

for(i = 0; i < n; ++i) A[i] = A[i] + B[i];

for(i = 0; i < n; i += 4) A[i:(i+4)] = A[i:(i+4)] + B[i:(i+4)];

Scalar Loop SIMD Loop

MMX, SSE, AVX and IMCI Instruction Sets

10

GPU vs. Xeon Phi

GFlops Memory size

Memory speed Power Frequency

of core # of core

Tesla C2070 1030 6GB 1.5Ghz 238w 1.15GHz 448

Xeon Phi SE10P 1070 8GB 5.5Ghz 300w 1.1GHz 612443904

Contributions• MIMD and SIMD Programming System

- MIMD Parallelism• Pattern knowledge based parallel framework• Task Partitioning & Scheduling Strategies• Data reorganization to help SIMD parallelism • Automatic MIMD API

- SIMD Parallelism• Automatic SIMD API• Control flow dependency resolving• Data dependency resolving

12

MIMD and SIMD Challenges for Different Patterns

13

Application Patterns MIMD Challenge SIMD Challenge

Generalized Reduction Job partition

unaligned/non-unit-stride accesscontrol flow dependencydata dependency/conflits

Stencil Computation Job partition unaligned/non-unit-stride access

control flow dependency

Irregular Reduction

Job partitionload balance

unaligned/random memory accesscontrol flow dependencydata dependency/conflicts

MIMD Parallel System

14

Task Definition

Configuration Task Type

Parameter Type

Kernel Function

MIMD Parallel FrameworkTask

Task Partitioner

job job job job job

Job Scheduler

job

Data Reorganization

MIMD Parallel System• Task Partitioner

- Computation Space Partitioning • Achieve good load balance• Need data replication or locking to resolve data dependency• Generalized Reduction, Stencil Computation

- Reduction Space Partitioning • No data dependency between partitions• Introduce computation redundancy and load imbalance - can be reduced by

appropriate partitioning and reorder method • Irregular Reduction

• Job Scheduler- Static, dynamic, or user defined strategies

• API Interface- Easy interface

• run(Task) : Register Task to MIMD framework, and start to run• join() : Block and wait execution finish

15

SIMD Data Type

Data Type Name Description

Scalar Type int, float, double, ….Data is shared by all the SIMD lanes. All the basic data types or temporary variables are belonged to scalar type.

Vector Type vint, vfloat, vdouble, …. It includes multiple data scaling to all the SIMD lanes.

Mask Type mask It helps handling control flow in vectorization.

SIMD Data Type

17

value

value0 value1 value2 value3

int s;

vint v;

int s = s1

vint v = v1;

s1

v1-0 v1-1 v1-2 v1-3

vint v = s; s s s s

vint array[2]; v0-0 v0-1 v0-2 v0-3 v1-0 v1-1 v1-2 v1-3

SIMD API

18

vint v1, v2; int s; mask m;op represents the supported mathematic or logic operations

API ExamplesAssignment API

Support assignment between vector types, and scalar type to vector type.

v1 = v2; v1 = s; v1 op= v2; v1 op= s;

Mathematic APISupport most mathematic operations, including +,-,*,/,%, between vector types and scalar types.

v1 = v2 op v1;v1 = v2 op s;

Logic APISupport most logic operations, including ==, !=, <, >, <=, >=, between vector types and scalar types. Return type is mask type.

m = v1 op v2;m = v1 op s;

Load/Store APIvoid load(void *src); v1.load(addr);void store(void *dst); v1.store(addr);void load(void *src, const vint &index, int scale) load [src+index[i]*scale] into v[i]

void store(void *dst, const vint &index, int scale) store v[i] to [dst+index[i]*scale]

continuous memory access

non-continuous memory access

Sample Kernels - Kmeans

19

Input: float *data;int i; //the index of one node

//Iterate k clusters to compute the distance to each clusterfor(int j = 0; j < k; ++j){ float dis = 0.0; for(int m = 0; m < 3; ++m) { dis += (data[i*3+m]-cluster[j*3+m])*

(data[i*3+m]-cluster[j*3+m]); } dis = sqrt(dis);}

Input:vfloat *data;int i; //the index of one node

//Iterate k clusters to compute the distance to each clusterfor(int j = 0; j < k; ++j){ vfloat dis = 0.0; for(int m = 0; m < 3; ++m) { dis += (data[i+m*n]-cluster[j*3+m])*

(data[i+m*n]-cluster[j*3+m]); } dis = sqrt(dis);}

Sequential Codes SIMD API

Input : (i, j) the index of one node, vfloat b[N][N] vfloat Dx = 0.0;vfloat Dy = 0.0;

//Compute the weight for a node in a 3x3 areafor(int p = -1; p <= 1; p++){ for(int q = -1; q <= 1; q++){ Dx+=wH[p+1][q+1]*b[X(i,p,q)][Y(j,p,q)]; Dy += wV[p+1][q+1]*b[X(i,p,q)][Y(j,p,q)]; }}vfloat z = sqrt(Dx*Dx + Dy*Dy);z.store(&a[i][j]);

Sample Kernels - Sobel

20

Input : (i, j) the index of one node, float b[N][N]float Dx = 0.0;float Dy = 0.0;

//Compute the weight for a node in a 3x3 areafor(int p = -1; p <= 1; p++){ for(int q = -1; q <= 1; q++){ Dx += wH[p+1][q+1]*b[i+p][j+q]; Dy += wV[p+1][q+1]*b[i+p][j+q]; }}float z = sqrt(Dx*Dx + Dy*Dy);a[i][j] = z;

Sequential Codes SIMD API

Input : (i, j) the index of one node, float b[N][N] __m512 Dx = _mm_set1_ps(0.0);__m512 Dy = _mm_set1_ps(0.0);

//Compute the weight for a node in a 3x3 areafor(int p = -1; p <=1; ++p){ for(int q = -1; q <=1; ++q){ __m512 *tmp = (__m512*)&b[i+q][j+p*vec_width]; __m512 tmpx = _mm512_mul_ps(*tmp, wH[p+1][q+1]); Dx = _mm512_add_ps(Dx, tmpx);

__m512 tmpy = _mm512_mul_ps(*tmp, wV[p+1][q+1]); Dy = _mm512_add_ps(Dy, tmpy); }}

__m512 sqDX = _mm512_mul_ps(Dx, Dx);__m512 sqDy = _mm512_mul_ps(Dy, Dy);__m512 ret = _mm512_add_ps(sqDx, sqDy);ret = _mm512_sqrt_ps(ret);_mm512_store_ps(&a[i][j], ret);

Sample Kernels - Sobel

21

Input : (i, j) the index of one node, float b[N][N]float Dx = 0.0;float Dy = 0.0;

//Compute the weight for a node in a 3x3 areafor(int p = -1; p <= 1; p++){ for(int q = -1; q <= 1; q++){ Dx += wH[p+1][q+1]*b[i+p][j+q]; Dy += wV[p+1][q+1]*b[i+p][j+q]; }}float z = sqrt(Dx*Dx + Dy*Dy);a[i][j] = z;

Sequential codes Manual vectorization codes

Data Layout Reorganization

22

• Why data reorganization?

• Vectorization only support continuous and aligned memory access

• Gather/Scatter APIs to load and store non-continuous and unaligned data

• Increase more than 60% overhead and introduce programming efforts to users 1 2 4 8 16 32 64 128 256

0

2

4

6

8

10Kmeans (k = 100)Gather/Scatter Data reorganization

Thread Numbers

Exe

cutio

n tim

e (s

)

Data Layout Reorganization• Non-unit-stride memory access

- AOS (Array of Structure) to SOA (Structure of Array)- Structure{x, y, z}- [x0, y0, z0, x1, y1, z1,…] => [x0, x1, …, y0, y1, …, z0, z1, …]

• Non-aligned memory access- Non-linear data layout transformation proposed in “Data

Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures”

- Provide functions to gain the indices of the adjacent nodes after transformation

• Irregular/Indirect memory access- Edge reordering method proposed by Bin Ren

23

1 2 34 A 56 7 8

Control Flow Resolution• IMCI first introduces operations with mask

parameters- mask: bitset (1: active lanes, 0: inactive lanes)- old value: the default value for inactive lanes

24

vint v = [0, 2, 4, 6];mask = v < 3;old = 1;v = 0;

Active Active Inactive Inactive

if(v < 3)v = 0;

elsev = 1;

mask = [1, 1, 0, 0]

00 11

How to pass mask and old value to our

operator overload API?

Mask Operation Implementation• IMCI implementation

- Mask operations has different API from unmask operations - _mm512_add_ps(v1, v2) and _mm512_mask_add_ps(old, mask, v1, v2)

• Our Considerations- Operator overload functions do not support extra parameters- Minimize the complexity of API

• Mask Status is one kind of status in execution- It exist during the whole thread life cycle - static- It is private to each thread - thread local

25

Mask Status Class (MS)Member Descriptionstatic __thread mask m mask type variablestatic __thread type old_val the default value for inactive lanesstatic set_mask(mask m, type val); set or update mask and old valuestatic void clear_mask(); clear old value and set all lanes to active

Mask Operation Implementation• IMCI implementation

- Mask operations has different API from unmask operations - _mm512_add_ps(v1, v2) and _mm512_mask_add_ps(old, mask, v1, v2)

• Our Considerations- Operator overload functions do not support extra parameters- Minimize the complexity of API

• Mask Status is one kind of status in execution- It exist during the whole thread life cycle - static- It is private to each thread - thread local

26

vint v = [0, 2, 4, 6];MS::set_mask(v<3, 1);v = 0;

if(v < 3)v = 0;

elsev = 1;

vint v = [0, 2, 4, 6];mask = v < 3;old = 1; v = 0;

Mask Operation Implementation

• When our API should use mask operation?- API always use mask operation

• No effort for users• Introduce extra performance

overhead- User decide when to use mask

operation• Two vector types: unmask

vector type and mask vector type

• Change from unmask to mask type by calling mask() function

27

1 2 4 8 16 32 64 128 2560

2

4

6

8Sobel (16384 x 16384)API API-Mask

Thread Numbers

Exe

cutio

n tim

e (s

)

Mask Operation Implementation

• When our API should use mask operation?- API always use mask operation

• No effort for users• Introduce extra performance

overhead- User decide when to use mask

operation• Two vector types: unmask

vector type and mask vector type

• Change from unmask to mask type by calling mask() function

28

vint v = [0, 2, 4, 6];MS::set_mask(v<3, 1); v = 0;

if(v < 3)v = 0;

elsev = 1;

vint v = [0, 2, 4, 6];mask = v < 3;old = 1; v = 0;

vint v = [0, 2, 4, 6];MS::set_mask(v<3, 1); v.mask() = 0;

Data Dependency Resolution

29

template<class ReducFunc = func>void reduction(int *update, int scale, int offset, vint *index, type value)

for(int i = 0; i < simd_width; ++i){ReducFunc()(update[index[i]*scale+offset], value[i];)

}

• Writing conflits in reduction- Undefined behavior when different SIMD lanes

update the same location

• Serialized reduction

A C A BB D

Data Dependency Resolution

30

A A B B C DA B

• Reorder reduction

Still need gather and scatter for non-continuous memory access

• Group reductionA A B B C D

A B A B C D

Experiment Evaluations• Environment configuration

- Xeon Phi SE 10P with 61 cores at 1.1 GHz- 8GB DDR5 memory- Running in native model- Intel ICC 13.1.0 with -O3 option

• Benchmarks- Generalized reduction: Kmeans, NBC- Stencil computation: Sobel, Heat3D

• Experiment Goals- SIMD-API- SIMD-Manual- Pthread-vec: -vec option- Pthread-novec: -no-vec option- OpenMP-vec: -vec option with #pragma vector always- OpenMP-novec: -no-vec option

31

Speedup - Generalized Reduction

32

• SIMD-API achieves better performance compared to Pthread-nonvec and Pthread-vec

• SIMD-API introduces very small overhead compared to SIMD-Manual

• Kmeans: the performance of compiler vectorization dependents on K

• NBC: large number of divergences limit the performance of compiler vectorization

Speedup - Stencil Computation

33

• Sobel: Compiler vectorization fails due to the complicate nest loop

• Heat3D: Compiler vectorization achieves the best performance, which is very similar to our API version

• Unaligned access is the major challenge for vectorization

• Compiler can also do auto-vectorization for stencil computation in some conditions

Scalability

34

Comparison with OpenMP

35

• MIMD + SIMD Parallelism- More than 3 times speedups

for most applications

• MIMD Parallelism- Gain better performance due

to the pattern knowledge based task partitioning and scheduling strategies.






• Conclusion

36

Irregular Reduction• Irregular Applications

- Unstructured grid pattern- Random and indirect

memory accesses

• Molecular Dynamics- Indirection Array -> Edges

(Interactions)- Reduction Objects ->

Molecules (Attributes)- Computation Space ->

Interactions b/w molecules- Reduction Space ->

Attributes of Molecules

37

Main Issues• Traditional Strategies are not effective

- Full Replication (Private copy per thread)• Large memory overhead• Both intra-block and inter-block combination• Shared memory usage unlikely

- Locking Scheme (Private copy per block)• Heavy conflicts within a block• Avoid intra-block combination, but not inter-block combination• Shared memory is only available for small data sets

• Need to choose Partitioning Strategy- Make sure data can be put into shared memory- Choice of partitioning space (Computation VS. Reduction)- Tradeoffs: Partitioning overhead & Execution efficiency

38

Contributions• A Novel Partitioning-based Locking Strategy

- Efficient shared memory utilization- Eliminate both intra and inter-block combination

• Optimized Runtime Support- Multi-Dimensional Partitioning Method- Reordering & Updating components for correctness

and memory performance

• Significant Performance Improvements- Exhaustive evaluation- Up to 3.3x improvement over traditional strategies

39

Choice of Partitioning Space

• Two partitioning choices:- Computation Space

• Partition on edges- Reduction Space

• Partition on nodes

40

Computation Space Partitioning

• Pros:- Load Balance on

Computation

• Cons:- Unequal reduction size

in each partition- Replicated reduction

elements (4 out of 16 nodes are replicated)

- Combination cost

41

• Partitioning on the iterations of computation loop

1

2

3

4

8 6

9

16

5

7

12

16101311

14

Partition 1

Partition 2

Partition 3 Partition 4Shared memory is

infeasible

6

5

4 5

24

12

7

Reduction Space Partitioning

• Pros:- Balanced reduction space- Independent between each

two partitions- Avoid combination cost- Shared memory is feasible

• Cons:- Imbalance on computation

space- Replicated work caused by

the crossing edges

42

• Partitioning on the Reduction Elements

1

2

3

4

8 6

9

16

5

7

12

16101311

14

Partition 1

Partition 2

Partition 3

Partition 4

5

7

Reduction Space Partitioning - Challenges

• Unbalanced & Replicated Computation- Partitioning method can achieve balance

between Cost and Efficiency• Cost: Execution time of partitioning method• Efficiency: Reduce number of crossing edges

(Replicated work)

• Maintain correctness on GPU- Reorder reduction space- Update/Reorder computation space

43

Runtime Partitioning Approaches• Metis Partitioning (Multi-level k-way Partitioning)

- Execute sequentially on CPU- Minimizes crossing edges- Cons: Large overhead for data initialization

• GPU-based (Trivial) Partitioning- Parallel execution on GPU- Minimize execution time- Cons: Large number of crossing edges among partitions

• Multi-dimensional Partitioning (Coordinate Information)- Execute sequentially on CPU- Balance between cost and efficiency

44

High Cost

Low Efficiency

Performance Gains

Molecular Dynamics: Comparison between Partitioning-based Locking (PBL) , Locking, Full Replication, and Sequential CPU time

45

Euler: Comparison between Partitioning-based Locking (PBL) , Locking, Full Replication, and Sequential CPU time






• Conclusion

46

Current Recursion Support on Modern GPUs

• Problem definition: intra-warp thread scheduling- Each thread in a warp executes an independent recursive function

call

• SIMT (Single Instruction Multiple Thread): different from the traditional SSE extensions- Each thread owns a stack for function call - Branches control: different branches execute in serial

• Recursion support on current GPUs- AMD GPUs and OpenCL programming model

• Not support recursion - NVIDIA GPUs

• Support recursion from computing capability 2.0 and SDK 3.1• Performance is limited by the reconvergence method — immediate

post dominator reconvergence47

Immediate post-dominator Reconvergence in Recursion

/* General Branch */Fib(n){ if(n < 2) { return 1; } else { x = Fib(n-1); y = Fib(n-2); return x+y; }}

T0T1

• Reconvergence can only happen on the same recursion level

• Threads with shorter branch cannot return until the threads with longer branch coming back to the reconvergence point

48

Contributions• Dynamic reconvergence method for recursion

- Reconvergence can happen before or after immediate post-dominator

• Two kinds of implementations- Dynamic greedy reconvergence- Dynamic majority reconvergence

• Significant performance improvements- Exhaustive evaluation on six recursion benchmarks

with different characteristics- Up to 6x improvement over immediate post-dominator

method

49

Dynamic Reconvergence MechanismsT0T1 Divergence

50

Dynamic Reconvergence Mechanisms• Remove the static reconvergence on immediate post-dominator• Dynamic Reconvergence Implementations

- Greedy Reconvergence• Convergence group: The group of threads that have the same PC

(Program Counter) address with the current active threads• Greedily reconverge threads in the same convergence group• Reconvergence can cross different recursion levels

- Majority-based Reconvergence• Majority group: The largest group of the threads with the same PC• Tend to maximize IPC (Instructions Per Cycle)• The threads in minority have a chance to join in majority in future

- Other Dynamic Reconvergence mechanisms• Greedy-return: Schedule threads in return branch first• Majority-threshold: Prevent starvation of threads in minority group

51






• Conclusion

52

Different Strategies for Generalized Reductions on GPUs

• Tradeoffs between Full Replication and Locking Scheme• Hybrid Scheme

- Balance between Full Replication and Locking Scheme- Introduce a intermediate scheduling layer “group” under

thread block• Intra-group: Locking Scheme• Inter-group: Full Replications

- Benefits of Hybrid Scheme varies with group size• Reduced memory overhead• Better use of shared memory• Reduce combination cost & conflicts

• Extensive evaluations on different applications with different parameters

53

Porting Irregular Reductions on Heterogeneous CPU-GPU Architecture

• A Multi-level Partitioning Framework- Parallelize irregular reduction on heterogeneous architecture

• Coarse-grained Level: Tasks between CPU and GPU• Fine-grained Level: Task between thread blocks or threads• Both use reduction space partitioning

- Eliminate device memory limitation on GPU

• Runtime Support Scheme- Pipeline Scheme overlap partitioning and computation on GPU- Work stealing based scheduling strategy provide load balance and

increase the pipelining length

• Significant Performance Improvements- Achieve 11% and 22% improvement for Euler and Molecular

Dynamics

54

Accelerating Applications on Integrated CPU-GPU

• Thread Block Level Scheduling Framework on Fusion APU- Work Scheduling target -> thread blocks (Not devices)- Only launch one kernel in the beginning -> small command

launching overhead- No synchronization between devices or thread blocks- Inter and Intra-device load balance is achieved by fine-grained and

factoring scheduling policies

• Locking-free Implementations based on shared memory- Master-worker Scheduling - Token Scheduling

• Applications with different communication patters- Stencil Computing (Jacobi): 1.6x- Generalized Reduction (K-means): 1.92x- Irregular Reduction (Molecular Dynamics): 1.15x

55

Future Work• Extension of current SIMD programming system

- Support SIMD vectors with different lengths• Hide different implementations for different vector

lengths• Simulate new functions in IMCI for old SIMD

instruction set• Support varied vector lengths at runtime

- Support CPU-MIC heterogeneous environment• Shared memory between CPU and MIC• Global synchronization on MIC without involving

host

56

Conclusions• Different strategies for generalized reductions on GPU

- Approaches for Parallelizing Reductions on Modern GPUs(HiPC2010, Best Student Paper)

• Strategy and runtime support for irregular reductions- An Execution Strategy and Optimized Runtime Support for Parallelizing

Reductions on Modern GPUs (ICS2011, Best Paper Nomination)

• Task scheduling frameworks for heterogeneous architectures- Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations

(HiPC2011)- Accelerating MapReduce on a Coupled CPU-GPU Architecture (SC2012)

• Improve parallelism of SIMD for recursion on GPUs- Efficient Scheduling of Recursive Control Flow on GPUs (ICS2013)

• MIMD and SIMD parallel framework and API for Xeon Phi- A Programming System for Xeon Phis with Runtime SIMD Parallelization

(ICS2014)

57

Acknowledgements

58

Advisor: Dr. Gagan Agrawal Committee Members:Dr. P. SadayappanDr. Feng QinDr. Radu Teodorescu

Mentor in PNNL: Dr. Sriram KrishnamoorthyCurrent lab members:Dr. Bin RenDr. Tekin BicerLinchuan ChenYu SuYi WangMehmet Can KurtMucahid KutluJiaqi LiuSameh ShohdyPrevious lab members:Dr. Wenjing MaDr. Vignesh RaviDr. David ChiuDr. Qian ZhuDr. Fan WangDr. Wei JiangDr. Tantan LiuJiedan Zhu

Thanks for your attention!Q & A

Date post:	24-Feb-2016
Category:	Documents
Upload:	fausto
View:	46 times
Download:	0 times

Xin Huo Department of Computer Science and Engineering The Ohio State University

Documents