+ All Categories
Home > Documents > A Dynamically Tuned Sorting Library

A Dynamically Tuned Sorting Library

Date post: 03-Jan-2016
Category:
Upload: chester-tyler
View: 39 times
Download: 0 times
Share this document with a friend
Description:
A Dynamically Tuned Sorting Library. Xiaoming Li, María Jesús Garzarán, and David Padua. In 2004 International Symposium on Code Generation and Optimization (CGO ’ 04). University of Illinois at Urbana-Champaign. Motivation. Sorting Core operation in many applications, such as databases - PowerPoint PPT Presentation
53
A Dynamically Tuned Sorting Library In 2004 International Symposium on Code Generation and Optimization (CGO’04) Xiaoming Li, María Jesús Garzarán, and David Padua University of Illinois at Urbana-Champaign
Transcript
Page 1: A Dynamically Tuned Sorting Library

A Dynamically Tuned Sorting Library

In 2004 International Symposium on Code Generation and Optimization (CGO’04)

Xiaoming Li, María Jesús Garzarán, and David Padua

University of Illinois at Urbana-Champaign

Page 2: A Dynamically Tuned Sorting Library

2

Motivation

Sorting – Core operation in many applications, such

as databases– Well understood symbolic computing

problem Libraries generators such as ATLAS

and SPIRAL have used empirical search to adapt to – Architectural features of the target

machine– Size of the input dataBut, performance of sorting also depends on the

distribution of the values to be sorted

Page 3: A Dynamically Tuned Sorting Library

3

Main difficulties to build a sorting library

1. Theoretical complexity is not sufficient to measure quality• Cache effect, instructions executed

2. Performance depends on the characteristics of the input• Amount & distribution of data to sort• A single algorithm is not optimal for all

possible input sets

Motivation

Page 4: A Dynamically Tuned Sorting Library

4

Contributions

1. Identify the architectural and runtime factors that affect the performance of the sorting algorithms.

2. Use empirical search to identify the best shape and parameter values of a sorting algorithm.

3. Use machine learning and runtime adaptation to select the best sorting algorithm for a specific input set.

Page 5: A Dynamically Tuned Sorting Library

5

Contributions

IBM Power 3, sorting 12 M keys (integer 32 bits)

Standard deviation of the inputs

Exe

cuti

on T

ime

(Cyc

les)

Page 6: A Dynamically Tuned Sorting Library

6

Outline

Sorting Algorithms Factors that determine performance The Library Evaluation Future Work Conclusions

Page 7: A Dynamically Tuned Sorting Library

7

Sorting Algorithms

Our sorting library contains– Quicksort– CC-Radix– Multiway Merge– Insertion Sort– Sorting Networks

For small partitions

Page 8: A Dynamically Tuned Sorting Library

8

Quicksort

Divide and conquer in-place sorting algorithm

Our implementation includes Sedgewick’s optimizations:– Set guardians at both ends of the input array.– Eliminate recursion.– Correctly select the pivot.– Use insertion sort for small partitions.

Page 9: A Dynamically Tuned Sorting Library

9

Radix sort

Non comparison algorithm

12233113 4 1

012345

Vectorto sort

2121

1234

counter

0235

1234

accum.

3

231341

012345

Dest.vector

31 1122333 4

1223

112334

3

123

1231

Page 10: A Dynamically Tuned Sorting Library

10

CC-radix (Cache Conscious Radix Sort) Tries to exploit data locality in caches Based on radix sort (Jimenez and Larriba – UPC)

if fits in cache (bucket) then radix sort (bucket)

CC-radix(bucket)

elsesub-buckets = Reverse sorting(bucket)

for each sub-bucket in sub-buckets CC-radix(sub-buckets) endfor endif

Page 11: A Dynamically Tuned Sorting Library

11

Multiway Merge Sort

SortedSubset

SortedSubset

SortedSubset

SortedSubset

Heap

p subsets

2*p -1 nodes

This algorithm exploits data locality very efficiently

Page 12: A Dynamically Tuned Sorting Library

12

Sorting algorithms for small partitions Insertion sort Exploits locality in the

cache line

Sorting networks Register blocking

Page 13: A Dynamically Tuned Sorting Library

13

Performance Comparison

4000

4500

5000

5500

6000

6500

7000

100 1000 10000 100000 1000000 10000000

Standard Deviation

Execution Time (Cycles)

Intel MKLQuicksort

Pentium III Xeon, 16 M keys (float)

Page 14: A Dynamically Tuned Sorting Library

14

Outline

Sorting Algorithms Factors that determine

performance The Library Evaluation Future Work Conclusions

Page 15: A Dynamically Tuned Sorting Library

15

Factors that determine performance Architectural Factors Considered

– Cache / TLB size– Number of Registers– Cache Line Size

Runtime Factors Considered– Amount of data to Sort– Distribution of the data

Page 16: A Dynamically Tuned Sorting Library

16

Architectural: Cache Size/TLB Size Tiling: Partition the data in subsets that fit in

the cache– Quicksort

•Using multiple pivots to tile– CC-radix

•Fit each partition into cache•The # active partitions < TLB size

– Multiway Merge Sort•Fit the heap into cache•Fit sorted subsets into cache

Page 17: A Dynamically Tuned Sorting Library

17

Architectural: Number of Registers For small partitions, sort in place using the processor

registers Optimizations like unroll and scheduling can be applied

cmp&swap(r0,r1)cmp&swap(r2,r3)cmp&swap(r1,r2)cmp&swap(r0,r3)cmp&swap(r4,r5)…..

cmp&swap(r0,r1)cmp&swap(r2,r3)cmp&swap(r4,r5)cmp&swap(r1,r2)cmp&swap(r0,r3)

Page 18: A Dynamically Tuned Sorting Library

18

Architectural: Cache Line Size

Fanout = Cache Line Size Increase cache line utilization when accessing children nodes

Cache Line

Page 19: A Dynamically Tuned Sorting Library

19

Runtime: Amount and Distribution Shape

Number of Keys (Millions)

Exe

cuti

on T

ime

(Cyc

les)

Page 20: A Dynamically Tuned Sorting Library

20

Runtime: Amount and Distribution Shape

Exe

cuti

on T

ime

(Cyc

les)

Number of Keys (Millions)

Page 21: A Dynamically Tuned Sorting Library

21

Runtime: Standard DeviationE

xecu

tion

Tim

e (C

ycle

s)

Standard deviation of the keys

Pentium III Xeon, 16 M keys

Page 22: A Dynamically Tuned Sorting Library

22

Outline

Sorting Algorithms Factors that determine performance The Library Evaluation Future Work Conclusions

Page 23: A Dynamically Tuned Sorting Library

23

Library adaptation

Architectural Factors– Cache / TLB size– Number of Registers – Cache Line Size

Empirical Search

Runtime Factors– Distribution shape of the data

– Amount of data to Sort – Standard Deviation

Does not matter

Machine learning and runtime adaptation

Page 24: A Dynamically Tuned Sorting Library

24

The Library

Building the library Intallation time– Empirical Search– Learning Procedure

• Use of training data

Running the library Runtime– Runtime Procedure

RuntimeAdaptation

Page 25: A Dynamically Tuned Sorting Library

25

Runtime Adaptation: Learning Procedure Goal function:

f:(N,E) {Multiway Merge Sort, Quicksort, CC-radix}

N: amount of input dataE: the entropy vector

– Use N to choose between Multiway Merge or Quicksort– Use the entropy and Winnow algorithm to learn the best

algorithm

• Output: weight vector (w) and threshold (S)

Page 26: A Dynamically Tuned Sorting Library

26

Runtime Adaptation:Runtime Procedure

Sample the input array Compute the entropy vector

Compute S = ∑i wi * entropyi

If S ≥ threshold choose CC-radix

elsechoose others

Page 27: A Dynamically Tuned Sorting Library

27

Outline

Sorting Algorithms Factors that determine performance The Library Evaluation Future Work Conclusions

Page 28: A Dynamically Tuned Sorting Library

28

Experimental Setup

Test Platforms:

– SGI R12000: 300 Mhz; L1I/D=32KB; L2 = 4MB

– UltraSparcIII: 750 Mhz; L1I/D=32KB, 64KB; L2 = 8MB

– PentiumIII Xeon: 550 Mhz; L1I/D=16KB; L2 = 512KB

– IBM Power3: 375 Mhz, L1I/D=64KB; L2 = 8MB

Page 29: A Dynamically Tuned Sorting Library

29

Sun UltraSparcIII: 12 M keysE

xecu

tion

Tim

e (C

ycle

s pe

r ke

y)

Standard deviation of the keys

Page 30: A Dynamically Tuned Sorting Library

30

IBM Power3: 12 M KeysE

xecu

tion

Tim

e (C

ycle

s pe

r ke

y)

Standard deviation of the keys

Page 31: A Dynamically Tuned Sorting Library

31

Conclusions

Identify the architectural and runtime factors

Use empirical search to find the best parameters values

Our machine learning techniques prove to be quite effective:– Always selects the best algorithm.– The wrong decision introduces a 37% average

performance degradation– Overhead (average 5%, worst case 7%)

Page 32: A Dynamically Tuned Sorting Library

32

Future Work

1. Search in the space of sorting algorithms using high-level primitives

2. Extend sorting to include more data types

3. Include other comparison strategies

4. Parallel algorithms

5. Explore other database operations, such as join.

For example, less than to sort vectors, graphs, …

Page 33: A Dynamically Tuned Sorting Library

A Memory Hierarchy Conscious and Self-

tunable Sorting Library

To appear in 2004 International Symposium on Code Generation and Optimization (CGO’04)

Xiaoming Li, María Jesús Garzarán, and David Padua

University of Illinois at Urbana-Champaign

Page 34: A Dynamically Tuned Sorting Library

34

Empirical search for small partitions

4M keys

16M keys

Threshold

Quicksort 2.43s 10.89s --

+ Insertsortat the end

2.17s 9.76s 20

+ Insertsortat each partition

2.32s 10.50s 20

+ Sorting networks

2.081s 9.20s 12

Intel Pentium III Xeon

Sorting networks obtains the best performance improvement (average 15%)

Page 35: A Dynamically Tuned Sorting Library

35

Runtime: Amount and Distribution Shape

Exe

cuti

on T

ime

(Cyc

les)

Number of Keys (Millions)

Page 36: A Dynamically Tuned Sorting Library

36

Performance vs. Distribution

Page 37: A Dynamically Tuned Sorting Library

37

Performance vs. Distribution

Page 38: A Dynamically Tuned Sorting Library

38

Performance vs. Sdev

Page 39: A Dynamically Tuned Sorting Library

39

Performance vs. Sdev

Page 40: A Dynamically Tuned Sorting Library

40

Multiway Merge Sort

Page 41: A Dynamically Tuned Sorting Library

41

Runtime: Distribution of Data

Distribution shapes: Uniform, Normal, Exponential, …

Page 42: A Dynamically Tuned Sorting Library

42

Architectural: Number of Registers

Page 43: A Dynamically Tuned Sorting Library

43

Sorting algorithms for small partitions Insertion sort Exploits locality in the

cache line Sorting networks Register blocking

Page 44: A Dynamically Tuned Sorting Library

44

Runtime: Distribution of Data

Distribution shapes: Uniform, Normal, Exponential, …

Distribution width:– Standard deviation (sdev):

• Only good for one-peak distribution• Expensive to calculate

– Entropy• Represents the distribution of each bit

The goal is to distinguish the comparison-based algorithm the radix based one

Page 45: A Dynamically Tuned Sorting Library

45

Entropy

Goal: determine when CC-radix is best

Standard Deviation – Expensive to compute– Not a good metric for our goal

Compute the entropy of of each digit

Entropy = ∑i -Pi * log2 Pi,

where Pi = ci/N; ci = number of keys that have a particular value for that digit.

Page 46: A Dynamically Tuned Sorting Library

46

Learning Procedure

f:(N,E) {Multiway merge, CC-radix} is a linear separable problem:– f(x1, x2, …,xn) is a decision problem where

there exists a weight vector

Use machine learning Winnow algorithm to learn f:(N,E). – The results of the learning are and Ө .

w→

f (x) is true if w * x ≥ Ө or false otherwise → → →

w→

Page 47: A Dynamically Tuned Sorting Library

47

Intel PIII Xeon

Page 48: A Dynamically Tuned Sorting Library

48

SGI R12000

Page 49: A Dynamically Tuned Sorting Library

49

Runtime: Amount of Data to Sort Quicksort

– Cache misses will increase with the increasing amount of data.

CC-radix– As amount of data increases, CC-radix needs

more partitioning passes.

Multiway Merge Sort– Can only show advantages when the amount of

data is big, i.e., when the gain in cache miss can compensate the complexity of the algorithm.

Page 50: A Dynamically Tuned Sorting Library

50

Empirical Search

Adaptation to the architecture of the machine– Quicksort and CC-radix,

• the best configuration does not change significantly with the characteristics of the input data set.

• Quicksort, CC-Radix:- Use of insertion sort/sorting networks for small

partitions- Threshold to use them

• CC-radix- Size of the radix

– Multiway Merge Sort• the best configuration changes with the amount and the

distribution of the input data. • The best values will be searched during the learning

procedure.

Page 51: A Dynamically Tuned Sorting Library

51

Page 52: A Dynamically Tuned Sorting Library

52

Multiway Merge Sort

SortedRun

SortedRun

SortedRun

SortedRun

Heap

11 21 23 607 42

21 60

60

42

28

60

42

28

4

42

28

23

Page 53: A Dynamically Tuned Sorting Library

53

Empirical Search

Example: Multiway Merge

• Search the heap size that obtains the best performance:- Different amount of data and

standard deviation


Recommended