+ All Categories
Home > Documents > Bridging the Performance- Productivity Gap with Selective ... › files › archive ›...

Bridging the Performance- Productivity Gap with Selective ... › files › archive ›...

Date post: 25-Jun-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
32
Bridging the Performance- Productivity Gap with Selective Embedded Just-In-Time Specialization Shoaib Kamil CSAIL, MIT Armando Fox, Katherine Yelick UPCRC, EECS Dept, UC Berkeley 1 BERKELEY PAR LAB DSMC 2012
Transcript
Page 1: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

Bridging the Performance-Productivity Gap with Selective

Embedded Just-In-Time Specialization

Shoaib Kamil CSAIL, MIT

Armando Fox, Katherine Yelick UPCRC, EECS Dept, UC Berkeley

1

BERKELEY PAR LAB

DSMC 2012

Page 2: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

Productivity-Performance Gap

• Domain scientists want to write code in high-level languages that match domain

x = A \ b or model = gmm(….)

• Not worry about typing, parallelism, etc implicit def s2r[A,_,I<:Seq[A]](xs: I) {… } 2 DSMC 2012

Page 3: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

Productivity-Performance Gap

• Domain scientists want to write code in high-level languages that match domain

• For best performance, must rely on efficiency programmer

• Optimized code highly dependent on platform

1/10 LOC 1/100 Performance

10-100x LOC 100x Performance

3 Bryan Catanzaro & PALLAS Group

DSMC 2012

Page 4: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

4

4

Graph-Algorithms

Dynamic-Programming

Dense-Linear-Algebra

Sparse-Linear-Algebra

Unstructured-Grids

Structured-Grids

Model-View-Controller

Iterative-Refinement

Map-Reduce

Layered-Systems

Arbitrary-Static-Task-Graph

Pipe-and-Filter

Agent-and-Repository

Process-Control

Event-Based/Implicit-Invocation

Puppeteer

Graphical-Models

Finite-State-Machines

Backtrack-Branch-and-Bound

N-Body-Methods

Circuits

Spectral-Methods

Monte-Carlo

Applications

Structural Patterns Computational Patterns

Task-Parallelism Divide and Conquer

Data-Parallelism Pipeline

Discrete-Event Geometric-Decomposition Speculation

SPMD Data-Par/index-space

Fork/Join Actors

Distributed-Array Shared-Data

Shared-Queue Shared-map Partitioned Graph

MIMD SIMD

Parallel Execution Patterns

Concurrent Algorithm Strategy Patterns

Implementation Strategy Patterns

Message-Passing Collective-Comm. Transactional memory

Thread-Pool Task-Graph

Data structure Program structure

Point-To-Point-Sync. (mutual exclusion) collective sync. (barrier) Memory sync/fence

Loop-Par. Task-Queue

Transactions

Thread creation/destruction Process creation/destruction

Concurrency Foundation constructs (not expressed as patterns)

A = M x V

Software Stack Deals with

Implementation

“Our” Pattern Language (OPL-2010) (Kurt Keutzer, Tim Mattson)

DSMC 2012

Page 5: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

Example: Stencil Computations

for (int i=1; i<nx-1; i++)

for (int j=1; j<ny-1; j++)

output[i,j] = f(output[i,j], neighbors(input[i,j]))

• The function f() changes application-to-application

• Tuning of loops requires information about input set

5 DSMC 2012

x

y

z

x

y

z

x

y

z

Page 6: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

What is an embedded DSL

• DSL compiler using a host language’s syntax

– Common example: macro rewriting as in Lisp

– Difficulty depends on language capabilities

• Leverage capabilities of host language

• Often not same semantics

– Contrast with APIs & libraries

6 DSMC 2012

Page 7: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

“Stovepipes”: Using DSELs to Architect Applications

Multicore GPU “Cloud”

App 1 App 2 App 3

Dense Sparse Graph Trav.

Single program expresses computation, “stovepipes” turn computation into optimized code at run-time.

7

DSMC 2012 7

Page 8: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

Overview

• Motivation: Productivity-Performance Gap

• SEJITS Methodology

• Asp & DSELs for Python

• Mechanisms for DSEL Implementation

• Future/Current Work

• Conclusions

8 DSMC 2012

Page 9: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

.py

OS/HW

f() B.h()

Asp Framework

.c

Inte

rpre

ter DSEL Compiler

Productivity app

.so

cc/ld

cache

Selected Embedded JIT Specialization

9 DSMC 2012

Page 10: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB Selected Embedded Just-In-Time Specialization (SEJITS)

• Domain scientists (aka productivity programmers) write code in embedded DSLs

• Efficiency programmers create embedded DSLs instead of one-off libraries or application optimization

• Separation of concerns

• “Invisible” to productivity programmers

– Except it runs fast

10 DSMC 2012

Page 11: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

SEJITS Methodology

• Goal: productive portable performance

• Add DSEL support to productivity languages – Leverage features of modern “scripting” languages

– Leverage existing libraries for these languages

• Use external parallelizing/optimizing compilers – Leverage existing expertise of efficiency programmers

– Leverage existing high-performance external libraries

• Use auto-tuning – Search over multiple implementations

11 DSMC 2012

Page 12: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB Auto-tuning: Empirical Search for Best Performance

• A priori determining the best low-level code is difficult, even for expert programmers

• Idea: generate many parameterized versions of a kernel

• Run all of them on the target machine and choose the fastest

• Usually run at install-time

12

Page 13: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

Auto-tuning Matrix Multiply

13

0 50 100 150 200 250 3000

10

20

30

40

50

60

C, 3 nested loops

PHiPAC

Square matrix size

MFLOPS

Figure 10: Performance of single precision matrix multiply on a Sparcstation-20/61.

0 50 100 150 200 250 3000

10

20

30

40

50

60

PHiPAC

SGI R4k assembly libblas_mips2_serial.a SGEMM

C, 3 nested loops

Square matrix size

MFLOPS

Figure 11: Performance of single precision matrix multiply on a 100 MHz SGI Indigo R4K. We show the SGEMM

from SGI’s libblas mips2 serial.a library.

0 50 100 150 200 250 3000

10

20

30

40

50

60

70

80

Square matrix sizes

MFLOPS

PHiPAC

Vendor DGEMM

FORTRAN, 3 nested loops

Figure 12: Performance of double precision matrix multiply on a HP 712/80i. We show DGEMM from the pa1.1

version of libvec.a in HP’s compiler distribution.

25

Naïve Code

Vendor-provided Expert Code

Auto-tuned Code

Bilmes et al. PhiPAC Tech Report

Bet

ter

Page 14: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

Asp is SEJITS for Python

• Proof of concept framework for “easy-to-build” embedded parallel DSLs

• Pragmatic choice: Python used in scientific community

• DSL implementers can use some or all of the building blocks provided

http://sejits.org

14 DSMC 2012

Page 15: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

Implemented DSELs/Libraries

15

DSEL/Library Platforms

Stencil/Structured Grid x86+OpenMP

Semantic Graphs Filtering & Semiring Operations in KDT x86+MPI

Parallel Map x86+processes, cloud

Gaussian Mixture Modeling CUDA, Cilk Plus

CA Matrix Powers for CA Krylov Subspace Methods x86+pthreads

Bag of Little Bootstraps* x86+Cilk Plus, Cloud via Spark

GraphLab DSEL for Machine Learning via Graphs* x86+pthreads

CA Parallel Recursive Structural Pattern* x86+Cilk Plus DSMC 2012

Page 16: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

Stencil DSEL Performance

11x faster than auto-parallelizing. ~2.5x faster than state of art non-auto-tuning DSL

Geometric mean of 93% of attainable peak.

16 DSMC 2012

Page 17: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

Stencil DSEL Performance

DSMC 2012 17

Page 18: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB Communication-Avoiding Recursive Matrix Multiply

• For recursive algorithms with particular branching factor relative to memory usage

• Choose when to perform parallel steps vs serial steps

• Optimal choice attains lower bounds on communication for matrix multiply

CScADS Autotuning 2012 18

Page 19: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB CARMA Performance on NUMA Machine

DSMC 2012 19

Beat MKL by 10x, using MKL.

Libshitz, Schwartz, Eliahu, Spillinger, Demmel, K.

Page 20: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

Mechanism: Code Templates

• Code snippets in backend language with interspersed with Python

• For “simple” code generation

20

void vec_add(float *x, float *y) { % for i in range(vectorsize): x[${i}] += y[${i}]; % endfor }

void vec_add(float *x, float *y) { x[0] += y[0]; x[1] += y[1]; x[2] += y[2]; }

DSMC 2012

Page 21: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

Mechanism: Phased Transformations

Parse to Python AST

User code

Convert to Domain-Specific

IR

Optimize IR

Convert to Backend AST

Optimize Backend AST

Write Out Source Files

Call External Compiler

Load & Run Shared Lib

Return Value

21 DSMC 2012

Page 22: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

Example Code

from stencil_kernel import *

class Laplacian3D(StencilKernel):

def kernel(self, in_grid, out_grid):

for x in self.interior_points(out_grid):

for y in self.neighbors(in_grid, x, 1):

out_grid[x] += (1.0/6.0) * in_grid[y]

22 DSMC 2012

Page 23: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

Example Code

from stencil_kernel import *

class Laplacian3D(StencilKernel):

def kernel(self, in_grid, out_grid):

for x in self.interior_points(out_grid):

for y in self.neighbors(in_grid, x, 1):

out_grid[x] += (1.0/6.0) * in_grid[y]

23 DSMC 2012

Page 24: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

Optimized Output

24

Cache blocking

Parallelization U

nro

llin

g/r

egis

ter

blo

ckin

g

DSMC 2012

Page 25: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

Future/Current Work

• Improved auto-tuning via machine learning

• SEJITS + fast hardware prototyping & co-tuning

– CHISEL project

• Composition in pattern-based frameworks

• Multi-level debugging

• Synthesizing optimized code (versus compiling)

25 DSMC 2012

Page 26: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

Related Work

• Delite (Stanford)

• Petabricks (MIT) – Smart auto-tuning for algorithmic choice

• Auto-tuning compilers (Mary Hall) – User-guided auto-tuning for general compilers

– Difficult to automate due to domain knowledge required

• Auto-tuning motifs – PhiPAC, FFTW, ATLAS, OSKI, Spiral, & more

DSMC 2012 26

Page 27: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

Conclusions

• High performance productive programming is possible with the SEJITS approach

• Also makes easier to write autotuners

• Much work in progress to make it even more easier to use

• BSD Licensed, available

http://www.sejits.org/

27 DSMC 2012

Page 28: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

Acknowledgements

• Armando Fox, Katherine Yelick • Parlab professors: Krste Asanović, Ras Bodik, James Demmel, Armando Fox, Kurt

Keutzer, John Kubiatowicz, David Patterson, Koushik Sen, David Wessel, Katherine Yelick

• Grad students: Scott Beamer, Derrick Coetzee, Henry Cook, Michael Driscoll, Ekaterina Gonina, Jeffrey Morlan, Jonathan Harper, Erin Carson, Nick Knight

• LBNL/external: Aydin Buluc, Sam Williams, Adam Lugowski, John Gilbert, Leonid Oliker, John Shalf

• Intel/Microsoft: Burton Smith, Tim Mattson, Henry Gabb, Robert Geva, Juan Vargas

• Many undergrads

This work was performed at the UC Berkeley Parallel Computing Laboratory (Par Lab), supported by DARPA (contract #FA8750-10-1-0191) and by the Universal Parallel Computing Research Centers (UPCRC) awards from Microsoft Corp. (Award #024263) and Intel Corp. (Award #024894), with matching funds from the UC Discovery Grant (#DIG07-10227) and additional support from Par Lab affiliates National Instruments, NEC, Nokia, NVIDIA, Oracle, and Samsung.

28 DSMC 2012

Page 29: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

BACKUP SLIDES

29 DSMC 2012

Page 30: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

Introspect to Get AST

30 DSMC 2012

Page 31: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB

Transform into IR

31

Domain Specific Constructs

DSMC 2012

Page 32: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200

BERKELEY PAR LAB Transform into Platform AST & Optimize

32

• Bulk of performance expert’s knowledge

• Use of Asp’s infrastructure for common transformations

• Can generate many variants at once ( for auto-tuning)

DSMC 2012


Recommended