Post on 20-Jan-2021
transcript
Towards Intelligent Programing Systems for Modern Computing
Computer Science, North Carolina State University
Xipeng Shen
Unprecedented Scale
2sources: SciDAC, IBM
Y201120 petaflops10mw power
Y202X1000 petaflops20mw power
50X perf
2X power!
Heterogeneity becomes Norm
3
Massively parallel accelerators are
becoming ubiquitous.
Thesis
4
To address the challenges in modern computing, one of the keys exists in making programming systems more intelligent.
For advancing programming systems, right problem formulating goes a long way.
Modern Computing
5
Application Data analytics, Machine learning, …
Infrastructure Data centers, Cloud, IoT, …
Architecture Heterogeneous parallel processors, Emerging complex memory, …
TOP Algorithmic optimizer for data analytics [VLDB’15,ICML’15]
GStreamline+ PORPLE Memory optimization for GPU [ASPLOS’11, Micro’14, ICS’16]
TOP: Enabling Algorithmic Optimizations for
Distance-Related Problems
6
VLDB’2015, ICML’2015
Up to 100s X speedups.
Yufei Ding
Role of Compiler
7
ML Algorithm
Implementation
Execution
compiler
Runtime system/Architecture
compiler
ML experts
co-‐design
Can compilers optimize algorithms?
Learning Problem
compiler
Why algorithm level?
8
Reason 1: Large benefits: orders of magnitude speedups at no extra cost.
Reason 2: Compiler may outsmart ML experts. Really?
Example
9
a
d b
Triangular Inequality: a-‐b ≤ d ≤ a+b
C1
C2X
K-‐Means
NIPS’2012
10
K-Means
SIAM’2010
ICML’2003
11
K-NN
IJCNN’11
VisionInterface’10
SSDM’10
12
P2P: Point-to-Point Shortest PathSIAM’05
ALENEX’04
Observations
13
• TI has led to many enhanced algorithms across problems and domains.
• Applying TI well is tricky, hence the many manual efforts and publications.
Thoughts• Can we have an abstraction to represent all the problems? • Can we then generalize the TI optimizations into compiler-‐based transformations?
14
Query Point Set Target Point Set
Distance
Relation
Constraints
Abstract Distance Problem(Q, T, D, R, C)
KMeansKNN ICP Shortest DistanceNBodyKNN join
15
Abstract Distance-‐Related Problem Essence & 7 Principles of TI Optimizations
KMeansKNN ICP Shortest DistanceNBodyKNN join
Our Analysis and Abstraction
16
Key Insights•Reuse through Landmarks
•Spatial & temporal reuses
•Elasticity through hierarchical landmarks
•Efficient bounds update through ghosts for iterative alg.
•Order of comparison
Lq2
q1 t1t2q3 t3
See VLDB’15 for details.
17
Abstract Distance-‐Related Problem Essence & 7 Principles of TI Optimizations
KMeansKNN ICP Shortest DistanceNBodyKNN join
TOP Framework
TOP API
Compilerproblem semantic
building blocks
Opt Lib
18
TOP APIBasic algorithm description
Compiler
Staged program code
TI Opt LibEfficient execution
Usage
TOP_defDistance(Euclidean);T = init();changedFlag = 1;while (changedFlag){ N = TOP_findClosestTargets(1, S, T); TOP_update(T, &changedFlag, N, S); }
Ad hoc
Systematic
19
Baseline: Classic K-‐means
(16GB, 8-‐core Intel Ivy Bridge)Speedu
p (X)
K-‐Means (K=1024)
TOP Yinyang K-Means
Code link in ICML’15 paper.
Clustering results are same as original method’s.
20
Speedu
p (X)
Baseline: Classic K-‐means(16GB, 8-‐core)
XX
TOP
On K-‐Means
Yinyang K-Means
21
Speedups(X) by manual version0 1 102 104
Spee
dups
(X) b
y TO
P ve
rsio
n
1
102
104
KnnKnnjoinKmeansICPNbodyP2PReference line
In manual version0 106 1013
In T
OP
vers
ion
106
1013
KnnKnnjoinKmeansICPNbodyP2PReference line
Average speedups: 50X vs 20X. Save at least 93% calculations.
Speedups # distance calculations
Manually Optimized Manually Optimized TOP Optim
ized
TOP Optim
ized
Insight: The right abstraction and formulation turn a compiler into an automatic algorithm optimizer, giving out large speedups.
Intel i5-4570 CPU and 8G memory
On All Benchmarks
Modern Computing
22
Application Data analytics, Machine learning, …
Infrastructure Data centers, Cloud, IoT, …
Architecture Heterogeneous parallel processors, Emerging complex memory, …
TOP Algorithmic optimizer for data analytics [VLDB’15,ICML’15]
GStreamline+ PORPLE Memory optimization for GPU [ASPLOS’11, Micro’14, ICS’16]
Overcome GPU Limitations
23
Guoyang Chen (Qualcomm)
Bo Wu (Prof. @ Colorado Mines)
Zheng Zhang (Prof. @ Rutgers Univ)
Xipeng Shen xshen5@ncsu.edu 24
a SIMD group(warp)
Graphic Processing Unit (GPU)
• Massive parallelism• Favorable
• computing power• cost effectiveness• energy efficiency
25
Challenges
Irregular Mem & Control
Dyn Task Parallelism
Scheduling Limitations
26
Our ExplorationsCompiler-based software solutions
11/069/07
5/096/10
3/1110/11
6/122/13
9/1312/14
5/156/15
12/156/16
2/17
CUDA release
LCPC talk by David Kirk
IPDPS cross input adap. opt.
ICS remove thread diverg. dyn.
ASPLOS GStreamline
PACT treat synch. correct. GPU2CPU
ICS syn. relax. & opt. GPU2CPU
PPOPP mem coalesc.
PACT NVM for GPU
Micro PORPLE
ICS SM centric
HotOS Co-‐run on Fused
Micro Free Launch
ICS Multiview
PPOPP EffiSha
Sweet KNN; VersaPipe; Lean DNN; …
5/17
IPDPS Co-‐sched on Fused System
27
Solutions
Irregular Mem & Control
Dyn Task Parallelism
Scheduling Limitations
Compiler-based software solutions
SM-‐Centric & EffiSha [ics15,ppopp17]
FreeLaunch [micro15]
Monday PPoPP Session 1
GStreamline & PORPLE [asplos11,micro14, ics16]
Xipeng Shen xshen5@ncsu.edu
Dynamic Irregularities
28
A[ ]:
P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2}
... = A[P[tid]];
tid: 0 1 2 3 4 5 6 7
Degrade throughput by up to (warp size - 1) times. (warp size = 32 in modern GPUs)
memory
2 4 10 0 6 0 0A[ ]:
tid: 0 1 2 3 4 5 6 7 if (A[tid]) {...}
control flow (thread divergence)
for (i=0;i<A[tid]; i++) {...}{a mem seg.
P[ ] = { 0, 1, 2, 3, 4, 5, 6, 7}
Solution 1: Thread-Data Remapping
29
{a mem seg.
4 trans/warp
{
a mem seg.
1 trans/warp
Irregularity in a warp: problematic; across warps: okey!
Principle of solution:Turn intra-warp irreg. into reg. or inter-warp irreg.
Trans-1: Data Reordering
30
P[ ] = {0,5,2,3,2,3,7,6}
... = A[P[tid]];
A[ ]:
tid: 0 1 2 3 4 5 6 7
A’[ ]:
tid: 0 1 2 3 4 5 6 7
<relocation>
original
... = A’[Q[tid]];
Q[ ] = {0,1,2,3,2,3,6,7}
<redirection>
transformed
tid: thread ID; : a thread; : data access; : data relocation
maintain mapping between threads &
data values
Trans-2: Job Swapping • Job = operations + data elements accessed
31
newtid = Q[tid]; . . .... = A[P[newtid]];
Q[ ] = {0,4,2,3,1,5,6,7}
<redirection>
transformed
A[ ]:... = A[P[tid]];
tid: 0 1 2 3 4 5 6 7
original
P[ ] = {0,5,2,3,2,3,7,6}
A[ ]:
tid: 0 1 2 3 4 5 6 7
G-Streamline[ASPLOS’2011]
32
1.08—2.5X speedups
First framework enabling runtime thread-data remapping.
CPU-GPU pipeline to hide transformation overhead.
Kernel splitting to resolve dependences.
Xipeng Shen xshen5@ncsu.edu 33
Global memory
Texture memory
Shared memory
Constant memory
L1/L2 cacheRead-only cache
Texture cache
Solution 2: Data Placement
Xipeng Shen xshen5@ncsu.edu
GPU Memory
34
Global memory
Texture memory
Shared memory
Constant memory
L1/L2 cacheRead-only cache
Texture cache
coalescing; cache hierarchy
2D/3D locality; texture cache; read-only
on-chip; bank conflicts
broadcasting; cached; read-only
private/shared
read-only data
2D/3D locality; read-only
Xipeng Shen xshen5@ncsu.edu
Data Placement Problem
35
Global memory
Texture memory
Shared memory
Constant memory
(L1/L2 cache)(Read-only cache)
(Texture cache)
A
B
C
D
…
Data in a program
?????
3X performance difference
Xipeng Shen xshen5@ncsu.edu
Data Placement Problem
36
Properties:
Machine dependent
Changes across models/generations
Input dependent
Changes across runs
Options:
Manual efforts by programmers?
Offline autotuning?
Xipeng Shen xshen5@ncsu.edu 37
PLACER(placing engine)
MSL(mem. spec. lang.)
PORPLE-C(compiler)
architect/usermem spec
org. program
access patterns
staged program
online profile
desired placement
efficient execution
offline online
microkernels
input
PORPLE in a Whole
More details in our Micro’2014 paper.
Xipeng Shen xshen5@ncsu.edu 38
Properties of PORPLE
• Good portability to new memory
• Just need new MSL spec
• Program adapts automatically
• Adaptivity to new program inputs
• On-the-fly placement with placement-agnostic code.
• Generality to regular & irregular programs
• Static analysis + lightweight online profiling
• K20c
• M2075
• C1060
GPU Models
Xipeng Shen xshen5@ncsu.edu
Potential for Future Memory Systems
40
3D Stacked Memory
Persistent Memory
DRAM (NUMA)
Final Takeaways
• Large potential of compilers for modern computing
• Right problem formulation is a key
TOPAn algorithmic optimizer. Up to 100x speedups.
PORPLEPortable solution to mem. complexity. Consistent speedups cross GPUs.
GStreamline