+ All Categories
Home > Documents > SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe...

SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe...

Date post: 25-Dec-2015
Category:
Upload: julie-ray
View: 219 times
Download: 0 times
Share this document with a friend
Popular Tags:
54
SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1 , Janghaeng Lee 1 , D. Anoushe Jamshidi 1 , Amir Hormati 2 , and Scott Mahlke 1 University of Michigan 1 , Google Inc. 2 December 2013 Compilers creating custom processors University of Michigan Electrical Engineering and Computer Science
Transcript
Page 1: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

SAGE: Self-Tuning Approximation for Graphics

Engines

Mehrzad Samadi1, Janghaeng Lee1, D. Anoushe Jamshidi1, Amir Hormati2, and Scott Mahlke1

University of Michigan1, Google Inc.2

December 2013

Compilers creating custom processorsUniversity of MichiganElectrical Engineering and Computer Science

Page 2: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

2

Approximate Computing

• Different domains:– Machine Learning– Image Processing– Video Processing– Physical Simulation– …

Higher performanceLower power consumption

Less work

Quality: 100% 95%

90% 85%

Page 3: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

3

Ubiquitous Graphics Processing Units

Super Computers Desktops Cell Phones

• Wide range of devices

• Mostly regular applications• Works on large data sets

Servers

Good opportunity for automatic approximation

Page 4: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

4

SAGE Framework

• Simplify or skip processing• Computationally expensive• Lowest impact on the output quality

Output Error

Speedup

10%

10%

2~3x

• Self-Tuning Approximation on Graphics Engines• Write the program once• Automatic approximation• Self-tuning dynamically

Page 5: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

5

Overview

Page 6: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

6

SAGE FrameworkInput Program

ApproximationMethods

Approximate Kernels

Tuning Parameters

Static Compiler

Runtime system

Page 7: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

7

Static Compilation

Evaluation Metric

Target Output

Quality (TOQ)

CUDA Code

Atomic Operation

Data Packing

ThreadFusion

Approximate Kernels

Tuning Parameters

Page 8: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

8

Runtime System

Preprocessing

Tuning Execution Calibration

CPU

GPU

Time

Quality

Speedup

TOQ

Tuning T T + C T + 2C

Tuning

Preprocessing

Calibration

Page 9: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

9

Runtime System

Preprocessing

Tuning Execution Calibration

CPU

GPU

Time

Quality

Speedup

TOQ

Tuning T T + C T

Page 10: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

10

Approximation Methods

Page 11: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

11

Approximation Methods

Evaluation Metric

Target Output

Quality (TOQ)

CUDA Code

Atomic Operation

Data Packing

ThreadFusion

Approximate Kernels

Tuning Parameters

Page 12: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

12

Atomic Operations

// Compute histogram of colors in an image __global__ void histogram(int n, int* color, int* bucket) int tid = threadIdx.x + blockDim.x * blockIdx.x; int nThreads = gridDim.x * blockDim.x; for ( int i = tid ; tid < n; tid += nThreads) int c = colors[i]; atomicAdd(&bucket[c], 1);

• Atomic operations update a memory location such that the update appears to happen atomically

Page 13: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

13

Atomic Operations

Threads

Page 14: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

14

Atomic Operations

Threads

Page 15: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

15

Atomic Operations

Threads

Page 16: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

16

Atomic Operations

Threads

Page 17: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

17

Atomic Add

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 310

2

4

6

8

10

12

14

Conflicts per Warp

Slowdown

Page 18: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

18

Atomic Operation Tuning

Iterations

T0 T32 T64 T96 T128 T160

// Compute histogram of colors in an image __global__ void histogram(int n, int* color, int* bucket) int tid = threadIdx.x + blockDim.x * blockIdx.x; int nThreads = gridDim.x * blockDim.x; for ( int i = tid ; tid < n; tid += nThreads) int c = colors[i]; atomicAdd(&bucket[c], 1);

Page 19: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

19

Atomic Operation Tuning

Iterations

T0 T32 T64 T96 T128 T160

• SAGE skips one iteration per thread• To improve the performance, it drops the iteration with the maximum number of conflicts

Page 20: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

20

Atomic Operation Tuning

T0 T32 T64 T96 T128 T160

It drops 50% of iterations• SAGE skips one iteration per thread• To improve the performance, it drops the iteration with the maximum number of conflicts

Page 21: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

21

Atomic Operation Tuning

T0 T32 T64

Drop rate goes down to 25%

Page 22: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

22

Dropping One Iteration

0

Conflicts

2

8

17

12

Iteration No.

1

2

3

After optimizationCD 0

0

Conflict Detection

1

3

CD 1

CD 2

CD 3

Max conflictIteration 0Iteration 1Iteration 2

Page 23: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

23

Approximation Methods

Evaluation Metric

Target Output

Quality (TOQ)

CUDA Code

Atomic Operation

Data Packing

ThreadFusion

Approximate Kernels

Tuning Parameters

Page 24: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

24

Data PackingThreads

Memory

Page 25: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

25

Data PackingThreads

Memory

Page 26: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

26

Data PackingThreads

Memory

Page 27: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

27

Data PackingThreads

Memory

Page 28: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

28

Quantization• Preprocessing finds min and max of the input

sets and packs the data

• During execution, each thread unpacks the data and transforms the quantization level to data by applying a linear transformation

Min Max

0 1 2 3 4 5 6 7

101

Quantization Levels

Page 29: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

29

Quantization• Preprocessing finds max and min of the input

sets and packs the data

• During execution, each thread unpacks the data and transforms the quantization level to data by applying a linear transformation

Min Max

0 1 2 3 4 5 6 7

101

Page 30: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

30

Approximation Methods

Evaluation Metric

Target Output

Quality (TOQ)

CUDA Code

Atomic Operation

Data Packing

ThreadFusion

Approximate Kernels

Tuning Parameters

Page 31: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

31

Thread Fusion

Threads

Memory

Memory ≈ ≈ ≈ ≈

Page 32: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

32

Thread Fusion

Threads

Memory

Memory

Page 33: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

33

Thread Fusion

T0

Computation

Output writing

T1

ComputationOutput writing

Page 34: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

34

Thread Fusion

T0 T1 T254 T255 T0 T1 T254 T255

Block 0 Block 1

ComputationOutput writing

Page 35: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

35

Thread Fusion

T0 T127

Block 0 Block 1

T0 T127

Reducing the number of threads per block results in poor resource utilization

Page 36: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

36

Block Fusion

T0 T254

Block 0 & 1 fused

T255T1

Page 37: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

37

Runtime

Page 38: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

38

How to Compute Output Quality?

Accurate Version

ApproximateVersion

EvaluationMetric

• High overhead• Tuning should find a good enough

approximate version as fast as possible• Greedy algorithm

Page 39: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

39

Tuning

K(0,0) Quality = 100%Speedup = 1x

TOQ = 90%

K(x,y)

Tuning parameter of the First optimization

Tuning parameter of the Second optimization

Page 40: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

40

Tuning

K(0,0) Quality = 100%Speedup = 1x

K(1,0) K(0,1)94%1.15X

96%1.5X

TOQ = 90%

Page 41: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

41

Tuning

K(0,0) Quality = 100%Speedup = 1x

K(1,0) K(0,1)

K(1,1) K(0,2)

94%1.15X

96%1.5X

94%2.5X

95%1.6X

TOQ = 90%

Page 42: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

42

Tuning

K(0,0) Quality = 100%Speedup = 1x

K(1,0) K(0,1)

K(1,1) K(0,2)

K(2,1) K(1,2)

94%1.15X

96%1.5X

94%2.5X

95%1.6X

88%3X

87%2.8X

Final ChoiceTuning Path

TOQ = 90%

Page 43: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

43

Evaluation

Page 44: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

44

Experimental Setup

• Backend of Cetus compiler

• GPU– NVIDIA GTX 560

• 2GB GDDR 5

• CPU– Intel Core i7

• Benchmarks– Image processing– Machine Learning

Page 45: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

45

K-Means

80

90

100

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 1011061111160

1

2

Output Quality

Accumulative Speedup

TOQ

Page 46: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

46

Confidence

0 50 100 150 200 250 300 350 40040

50

60

70

80

90

100

Calibration points

93

CI = 99%

CI = 95%

CI = 90%

After checking 50 samples, we will be 93% confident that 95% of the outputs satisfy the TOQ threshold𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 ∝𝑝𝑟𝑖𝑜𝑟× h𝑙𝑖𝑘𝑒𝑙𝑖 𝑜𝑜𝑑

Beta Uniform Binomial

Page 47: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

47

Calibration Overhead

0 20 40 60 80 100 120 140 160 180 2000

5

10

15

20

25

Calibration Interval

Gaussian

K-Means

Calibration Overhead(%)

Page 48: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

48

K-Means

NB

Histogram

SVM

Fuzzy

MeanShift

Binarization

Dynamic

Mean Filter

Gaussian

1 2 3 4

2.0

1.8

3.6

1.4

1.7

2.3

2.2

1.3

3.3

1.7

1.7

1.3

1.3

1.8

1.5

1.6

Speedup 1 2 3 4

2.5

2.2

6.4

1.4

1.8

2.7

2.2

1.3

3.3

3.1

3.1

1.4

1.6

3.5

1.3

1.3

1.5

1.4

1.3

Speedup

Performance

6.4

SAGE SAGE

TOQ = 95% TOQ = 90%Loop Perforation Loop Perforation

GeometricMean

Page 49: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

49

Conclusion• Automatic approximation is possible

• SAGE automatically generates approximate kernels with different parameters

• Runtime system uses tuning parameters to control the output quality during execution

• 2.5x speedup with less than 10% quality loss compared to the accurate execution

Page 50: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

SAGE: Self-Tuning Approximation for Graphics

Engines

Mehrzad Samadi1, Janghaeng Lee1, D. Anoushe Jamshidi1, Amir Hormati2, and Scott Mahlke1

University of Michigan1, Google Inc.2

December 2013

Compilers creating custom processorsUniversity of MichiganElectrical Engineering and Computer Science

Page 51: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

51

Backup Slides

Page 52: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

52

What Does Calibration Miss?

10 20 30 40 50 60 70 80 90 1000%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

flower(250) grandma(800) deadline(900) foreman(300) harbour(300)

ice(240) akiyo(300) carphone(380) crew(300) galleon(350)

Calibration Interval

Perc

ent D

iffer

ence

in E

rror

Page 53: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

53

SAGE Can Control The Output Quality

909192939495969798991001

2

3

4

5

6

7Naïve Bayes Fuzzy Kmeans Mean Filter

Output Quality(%)

Sp

eed

up

Page 54: SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1, Janghaeng Lee 1, D. Anoushe Jamshidi 1, Amir Hormati 2, and Scott Mahlke 1 University.

54

Distribution of Errors

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

HistogramKmeansNaïve BayesFuzzy KmeansSVMDynamicMean FilterGaussianBinarizationMeanshift

Error

Perc

enta

ge o

f Ele

men

ts


Recommended