Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path...

transcript

Towards Intelligent Programing Systems for Modern Computing

Computer Science, North Carolina State University

Xipeng Shen

Unprecedented Scale

2sources: SciDAC, IBM

Y201120 petaflops10mw power

Y202X1000 petaflops20mw power

50X perf

2X power!

Heterogeneity becomes Norm

Massively parallel accelerators are

becoming ubiquitous.

Thesis

To address the challenges in modern computing, one of the keys exists in making programming systems more intelligent.

For advancing programming systems, right problem formulating goes a long way.

Modern Computing

Application Data analytics, Machine learning, …

Infrastructure Data centers, Cloud, IoT, …

Architecture Heterogeneous parallel processors, Emerging complex memory, …

TOP Algorithmic optimizer for data analytics [VLDB’15,ICML’15]

GStreamline+ PORPLE Memory optimization for GPU [ASPLOS’11, Micro’14, ICS’16]

TOP: Enabling Algorithmic Optimizations for

Distance-Related Problems

VLDB’2015, ICML’2015

Up to 100s X speedups.

Yufei Ding

Role of Compiler

ML Algorithm

Implementation

Execution

compiler

Runtime system/Architecture

compiler

ML experts

co-‐design

Can compilers optimize algorithms?

Learning Problem

compiler

Why algorithm level?

Reason 1: Large benefits: orders of magnitude speedups at no extra cost.

Reason 2: Compiler may outsmart ML experts. Really?

Example

Triangular Inequality: a-‐b ≤ d ≤ a+b

K-‐Means

NIPS’2012

K-Means

SIAM’2010

ICML’2003

IJCNN’11

VisionInterface’10

SSDM’10

P2P: Point-to-Point Shortest PathSIAM’05

ALENEX’04

Observations

• TI has led to many enhanced algorithms across problems and domains.

• Applying TI well is tricky, hence the many manual efforts and publications.

Thoughts• Can we have an abstraction to represent all the problems? • Can we then generalize the TI optimizations into compiler-‐based transformations?

Query Point Set Target Point Set

Distance

Relation

Constraints

Abstract Distance Problem(Q, T, D, R, C)

KMeansKNN ICP Shortest DistanceNBodyKNN join

Abstract Distance-‐Related Problem Essence & 7 Principles of TI Optimizations

Our Analysis and Abstraction

Key Insights•Reuse through Landmarks

•Spatial & temporal reuses

•Elasticity through hierarchical landmarks

•Efficient bounds update through ghosts for iterative alg.

•Order of comparison

q1 t1t2q3 t3

See VLDB’15 for details.

Abstract Distance-‐Related Problem Essence & 7 Principles of TI Optimizations

TOP Framework

TOP API

Compilerproblem semantic

building blocks

Opt Lib

TOP APIBasic algorithm description

Compiler

Staged program code

TI Opt LibEfficient execution

TOP_defDistance(Euclidean);T = init();changedFlag = 1;while (changedFlag){ N = TOP_findClosestTargets(1, S, T); TOP_update(T, &changedFlag, N, S); }

Ad hoc

Systematic

Baseline: Classic K-‐means

(16GB, 8-‐core Intel Ivy Bridge)Speedu

K-‐Means (K=1024)

TOP Yinyang K-Means

Code link in ICML’15 paper.

Clustering results are same as original method’s.

Speedu

Baseline: Classic K-‐means(16GB, 8-‐core)

On K-‐Means

Yinyang K-Means

Speedups(X) by manual version0 1 102 104

KnnKnnjoinKmeansICPNbodyP2PReference line

In manual version0 106 1013

KnnKnnjoinKmeansICPNbodyP2PReference line

Average speedups: 50X vs 20X. Save at least 93% calculations.

Speedups # distance calculations

Manually Optimized Manually Optimized TOP Optim

TOP Optim

Insight: The right abstraction and formulation turn a compiler into an automatic algorithm optimizer, giving out large speedups.

Intel i5-4570 CPU and 8G memory

On All Benchmarks

Modern Computing

Application Data analytics, Machine learning, …

Infrastructure Data centers, Cloud, IoT, …

Architecture Heterogeneous parallel processors, Emerging complex memory, …

TOP Algorithmic optimizer for data analytics [VLDB’15,ICML’15]

GStreamline+ PORPLE Memory optimization for GPU [ASPLOS’11, Micro’14, ICS’16]

Overcome GPU Limitations

Guoyang Chen (Qualcomm)

Bo Wu (Prof. @ Colorado Mines)

Zheng Zhang (Prof. @ Rutgers Univ)

Xipeng Shen xshen5@ncsu.edu 24

a SIMD group(warp)

Graphic Processing Unit (GPU)

• Massive parallelism• Favorable

• computing power• cost effectiveness• energy efficiency

Challenges

Irregular Mem & Control

Dyn Task Parallelism

Scheduling Limitations

Our ExplorationsCompiler-based software solutions

11/069/07

5/096/10

3/1110/11

6/122/13

9/1312/14

5/156/15

12/156/16

CUDA release

LCPC talk by David Kirk

IPDPS cross input adap. opt.

ICS remove thread diverg. dyn.

ASPLOS GStreamline

PACT treat synch. correct. GPU2CPU

ICS syn. relax. & opt. GPU2CPU

PPOPP mem coalesc.

PACT NVM for GPU

Micro PORPLE

ICS SM centric

HotOS Co-‐run on Fused

Micro Free Launch

ICS Multiview

PPOPP EffiSha

Sweet KNN; VersaPipe; Lean DNN; …

IPDPS Co-‐sched on Fused System

Solutions

Irregular Mem & Control

Dyn Task Parallelism

Scheduling Limitations

Compiler-based software solutions

SM-‐Centric & EffiSha [ics15,ppopp17]

FreeLaunch [micro15]

Monday PPoPP Session 1

GStreamline & PORPLE [asplos11,micro14, ics16]

Xipeng Shen xshen5@ncsu.edu

Dynamic Irregularities

P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2}

... = A[P[tid]];

tid: 0 1 2 3 4 5 6 7

Degrade throughput by up to (warp size - 1) times. (warp size = 32 in modern GPUs)

memory

2 4 10 0 6 0 0A[ ]:

tid: 0 1 2 3 4 5 6 7 if (A[tid]) {...}

control flow (thread divergence)

for (i=0;i<A[tid]; i++) {...}{a mem seg.

P[ ] = { 0, 1, 2, 3, 4, 5, 6, 7}

Solution 1: Thread-Data Remapping

{a mem seg.

4 trans/warp

a mem seg.

1 trans/warp

Irregularity in a warp: problematic; across warps: okey!

Principle of solution:Turn intra-warp irreg. into reg. or inter-warp irreg.

Trans-1: Data Reordering

P[ ] = {0,5,2,3,2,3,7,6}

... = A[P[tid]];

tid: 0 1 2 3 4 5 6 7

A’[ ]:

tid: 0 1 2 3 4 5 6 7

original

... = A’[Q[tid]];

Q[ ] = {0,1,2,3,2,3,6,7}

transformed

tid: thread ID; : a thread; : data access; : data relocation

maintain mapping between threads &

data values

Trans-2: Job Swapping • Job = operations + data elements accessed

newtid = Q[tid]; . . .... = A[P[newtid]];

Q[ ] = {0,4,2,3,1,5,6,7}

transformed

A[ ]:... = A[P[tid]];

tid: 0 1 2 3 4 5 6 7

original

P[ ] = {0,5,2,3,2,3,7,6}

tid: 0 1 2 3 4 5 6 7

G-Streamline[ASPLOS’2011]

1.08—2.5X speedups

First framework enabling runtime thread-data remapping.

CPU-GPU pipeline to hide transformation overhead.

Kernel splitting to resolve dependences.

Global memory

Texture memory

Shared memory

Constant memory

L1/L2 cacheRead-only cache

Texture cache

Solution 2: Data Placement

GPU Memory

Global memory

Texture memory

Shared memory

Constant memory

L1/L2 cacheRead-only cache

Texture cache

coalescing; cache hierarchy

2D/3D locality; texture cache; read-only

on-chip; bank conflicts

broadcasting; cached; read-only

private/shared

read-only data

2D/3D locality; read-only

Data Placement Problem

Global memory

Texture memory

Shared memory

Constant memory

(L1/L2 cache)(Read-only cache)

(Texture cache)

Data in a program

3X performance difference

Data Placement Problem

Properties:

Machine dependent

Changes across models/generations

Input dependent

Changes across runs

Options:

Manual efforts by programmers?

Offline autotuning?

PLACER(placing engine)

MSL(mem. spec. lang.)

PORPLE-C(compiler)

architect/usermem spec

org. program

access patterns

staged program

online profile

desired placement

efficient execution

offline online

microkernels

PORPLE in a Whole

More details in our Micro’2014 paper.

Properties of PORPLE

• Good portability to new memory

• Just need new MSL spec

• Program adapts automatically

• Adaptivity to new program inputs

• On-the-fly placement with placement-agnostic code.

• Generality to regular & irregular programs

• Static analysis + lightweight online profiling

• K20c

• M2075

• C1060

GPU Models

Potential for Future Memory Systems

3D Stacked Memory

Persistent Memory

DRAM (NUMA)

Final Takeaways

• Large potential of compilers for modern computing

• Right problem formulation is a key

TOPAn algorithmic optimizer. Up to 100x speedups.

PORPLEPortable solution to mem. complexity. Consistent speedups cross GPUs.

GStreamline

Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path...

Documents