+ All Categories
Home > Documents > CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

Date post: 14-Jan-2016
Category:
Upload: prudence-mclaughlin
View: 218 times
Download: 1 times
Share this document with a friend
50
CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001
Transcript
Page 1: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

CSE 326Sorting

David Kaplan

Dept of Computer Science & EngineeringAutumn 2001

Page 2: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

2

OutlineSorting: The Problem Space

Sorting by Comparison Lower bound for comparison sorts Insertion Sort Heap Sort Merge Sort Quick Sort

External Sorting

Comparison of Sorting by Comparison

Page 3: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

3

Sorting: The Problem SpaceGeneral problem

Given a set of N orderable items, put them in order

Without (significant) loss of generality, assume: Items are integers Ordering is (Most sorting problems map to the above in linear time.)

Page 4: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

4

Lower Bound for Sorting by ComparisonSorting by Comparison

Only information available to us is the set of N items to be sorted

Only operation available to us is pairwise comparison between 2 items

What is the best running time we can possibly achieve?

Page 5: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

5

Decision Tree Analysis of Sorting by Comparison

A<B

B<C

A<C C<A

C<B

B<A

A<C C<A

B<C C<B

A<B B<A

A<BC <B

A,B ,C .

A ,C ,B . C ,A ,B .

B ,A ,C . B<AC <A

B,C ,A . C ,B ,A

Legendfacts In ternal node, w ith facts known so far

A,B ,C Leaf node, w ith ordering of A ,B ,CC<A Edge, w ith result o f one com parison

Page 6: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

6

Max depth of decision tree How many permutations are there of N numbers?

How many leaves does the tree have?

What’s the shallowest tree with a given number of leaves?

What is therefore the worst running time (number of comparisons) by the best possible sorting algorithm?

Page 7: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

7

Lower Bound for Comparison Sort

n

e

nnn

2!

log( !) log 2

log( 2 ) lo ( log )g

n

n

nn n

e

nn n n

e

Stirling’s approximation:

Any comparison sort of n items must have a decision tree with n! leaves, of height log(n!)

n log(n) is a lower bound for log(n!)

Thus, no comparison sort can be faster than O(n log(n))

Page 8: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

8

Insertion SortBasic idea

After kth pass, ensure that first k+1 elements are sortedOn kth pass, swap (k+1)th element to left as necessary

7 2 8 3 5 9 6

2 7 8 3 5 9 6

2 7 8 3 5 9 6

Start

After Pass 1

After Pass 2

2 7 3 8 5 9 6

After Pass 3 2 3 7 8 5 9 6

What if array is initially sorted? Reverse sorted?

Page 9: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

9

Why Insertion Sort is SlowInversion: a pair (i,j) such that i<j butArray[i] > Array[j]

Array of size N can have (N2) inversions average number of inversions in a random

set of elements is N(N-1)/4

Insertion Sort only swaps adjacent elements

only removes 1 inversion!

Page 10: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

10

HeapSortSorting via Priority Queue (Heap)

756

27

18801

35

13

23 4487

8 13 18 23 27

Shove items into a priority queue, take them out smallest to largest

Worst Case:

Best Case:

Page 11: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

11

MergeSort

Merging Cars by key[Aggressiveness of driver].Most aggressive goes first.

MergeSort (Table [1..n])Split Table in halfRecursively sort each halfMerge two sorted halves together

Page 12: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

12

MergeSort AnalysisRunning Time

Worst case? Best case? Average case?

Other considerations besides running time?

Page 13: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

13

QuickSort

28

15 47< <

< <

< <

Basic idea: Pick a “pivot” Divide into less-than & greater-than pivot Sort each side recursively

Picture from PhotoDisc.com

Page 14: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

14

QuickSort Partition

7 2 8 3 5 9 6Pick pivot:

Partitionwith cursors

7 2 8 3 5 9 6

< >

7 2 8 3 5 9 6

< >

2 goes toless-than

Page 15: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

15

QuickSort Partition (cont’d)

7 2 6 3 5 9 8

< >

6, 8 swapless/greater-than

7 2 6 3 5 9 83,5 less-than9 greater-than

7 2 6 3 5 9 8Partition done.Recursivelysort each side.

Page 16: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

16

Analyzing QuickSort Picking pivot: constant time Partitioning: linear time Recursion: time for sorting left partition

(say of size i) + time for right (size N-i-1)T(1) = bT(N) = T(i) + T(N-i-1) + cN where i is the number of elements smaller than the pivot

Page 17: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

17

QuickSort Worst casePivot is always smallest element.

T(N) = T(i) + T(N-i-1) + cN

T(N) = T(N-1) + cN

= T(N-2) + c(N-1) + cN

= T(N-k) +

= O(N2)

1

0

( )k

i

c N i

Page 18: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

18

Optimizing QuickSortChoosing the Pivot

Randomly choose pivotGood theoretically and practically, but call to random number generator can be expensive

Pick pivot cleverly“Median-of-3” rule takes Median(first, middle, last). Works well in practice.

Cutoff Use simpler sorting technique below a certain problem

size (Weiss suggests using insertion sort, with a cutoff limit of 5-20)

Page 19: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

19

QuickSort Best CasePivot is always median element.

T(N) = T(i) + T(N-i-1) + cN

T(N) = 2T(N/2 - 1) + cN

2 ( / 2)

4 ( / 4) (2 / 2 )

8 ( / 8) (1 1 1)

(( / ) l go og( ) l )

T N cN

T N c N N

T N cN

kT N k cN k O N N

< < < <

Page 20: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

20

QuickSort Average CaseAssume all size partitions equally likely, with probability 1/N

0

1

0

1average value of T(i) or T(N-i-1)

( )

is (1/ )

( log )

( ) ( 1)

( ) (2 ) ( )

(

/

)N

j

N

j

T N T i T N i cN

T N N T j

N

j

N

N

O

cN

T

details: Weiss pg 278-279

Page 21: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

21

External SortingWhen you just ain’t got enough RAM …

e.g. Sort 10 billion numbers with 1 MB of RAM. Databases need to be very good at this

MergeSort Good for Something! Basis for most external sorting routines Can sort any number of records using a tiny

amount of main memoryin extreme case, keep only 2 records in memory at once!

Page 22: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

22

External MergeSort Split input into two tapes Each group of 1 records is sorted by

definition, so merge groups of 1 to groups of 2, again split between two tapes

Merge groups of 2 into groups of 4 Repeat until data entirely sorted

log N passes

Page 23: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

23

Better External MergeSortSuppose main memory can hold M records:

Read in groups of M records and sort them (e.g. with QuickSort)

Number of passes reduced to log(N/M)

k-way mergesort reduces number of passes to logk(N/M)

Requires 2k output devices (e.g. mag tapes)

But wait, there’s more …

Polyphase merge does a k-way mergesort using only k+1 output devices (plus kth-order Fibonacci numbers!)

Page 24: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

24

Sorting by ComparisonSummarySorting algorithms that only compare adjacent elements are

(N2) worst case – but may be (N) best case

HeapSort - (N log N) both best and worst case Suffers from two test-ops per data move

MergeSort - (N log N) running time Suffers from extra-memory problem

QuickSort - (N2) worst case, (N log N) best and average case In practice, median-of-3 almost always gets us (N log N) Big win comes from {sorting in place, one test-op, few

swaps}!

Any comparison-based sorting algorithm is (N log N) External sorting: MergeSort with (log N/M) passes

Page 25: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

25

Sorting:The Problem Space Revisited General problem

Given a set of N orderable items, put them in order

Without (significant) loss of generality, assume: Items are integers Ordering is Most sorting problems can be mapped to the above in linear

time

But what if we have more information available to us?

Page 26: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

26

Sorting in Linear TimeSorting by Comparison

Only information available to us is the set of N items to be sorted

Only operation available to us is pairwise comparison between 2 items

Best running time is O(N log(N)), given these constraints

What if we relax the constraints? Know something in advance about item values

Page 27: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

27

BinSort (a.k.a. BucketSort) If keys are known to be in {1, …, K} Have array of size K Put items into correct bin (cell) of array,

based on key

Page 28: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

28

BinSort exampleK=5 list=(5,1,3,4,3,2,1,1,5,4,5)

Bins in array

key = 1

1,1,1

key = 2

2

key = 3

3,3

key = 4

4,4

key = 5

5,5,5

Sorted list:1,1,1,2,3,3,4,4,5,5,5

Page 29: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

29

BinSort Pseudocodeprocedure BinSort (List L,K)

LinkedList bins[1..K]// Each element of array bins is linked list.// Could also BinSort with array of arrays.

For Each number x in Lbins[x].Append(x)

End ForFor i = 1..K

For Each number x in bins[i]Print x

End ForEnd For

Page 30: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

30

BinSort Running TimeK is a constant

BinSort is linear time

K is variable Not simply linear time

K is large (e.g. 232) Impractical

Page 31: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

31

BinSort is “stable”Stable Sorting algorithm

Items in input with the same key end up in the same order as when they began.

Important if keys have associated values Critical for RadixSort

Page 32: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

32

Mr. Radix

Herman Hollerith invented and developed a punch-card tabulation machine system that revolutionized statistical computation.Born in Buffalo, New York, the son of German immigrants, Hollerith enrolled in the City College of New York at age 15 and graduated from the Columbia School of Mines with distinction at the age of 19.His first job was with the U.S. Census effort of 1880. Hollerith successively taught mechanical engineering at the Massachusetts Institute of Technology and worked for the U.S. Patent Office. The young engineer developed an electrically actuated brake system for the railroads, but the Westinghouse steam-actuated brake prevailed.Hollerith began working on the tabulating system during his days at MIT, filing for the first patent in 1884. He developed a hand-fed 'press' that sensed the holes in punched cards; a wire would pass through the holes into a cup of mercury beneath the card closing the electrical circuit. This process triggered mechanical counters and sorter bins and tabulated the appropriate data.Hollerith's system-including punch, tabulator, and sorter-allowed the official 1890 population count to be tallied in six months, and in another two years all the census data was completed and defined; the cost was $5 million below the forecasts and saved more than two years' time.His later machines mechanized the card-feeding process, added numbers, and sorted cards, in addition to merely counting data.In 1896 Hollerith founded the Tabulating Machine Company, forerunner of Computer Tabulating Recording Company (CTR). He served as a consulting engineer with CTR until retiring in 1921.In 1924 CTR changed its name to IBM- the International Business Machines Corporation.

Herman Hollerith

Born February 29, 1860 - Died November 17, 1929

Art of Compiling Statistics; Apparatus for Compiling Statistics

Patent Nos. 395,781; 395,782; 395,783

Inducted 1990

Source: National Institute of Standards and Technology (NIST) Virtual Museum - http://museum.nist.gov/panels/conveyor/hollerithbio.htm

Page 33: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

33

RadixSort Radix = “The base of a number system”

(Webster’s dictionary) alternate terminology: radix is number of bits needed to

represent 0 to base-1; can say “base 8” or “radix 3”

Idea: BinSort on each digit, bottom up.

Page 34: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

34

RadixSort – magic! It works. Input list:

126, 328, 636, 341, 416, 131, 328 BinSort on lower digit:

341, 131, 126, 636, 416, 328, 328 BinSort result on next-higher digit:

416, 126, 328, 328, 131, 636, 341 BinSort that result on highest digit:

126, 131, 328, 328, 341, 416, 636

Page 35: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

35

RadixSort – how it works

0123456789

126 636 416

328 328

341 1310123456789

416 126 328 328 131 636341

0123456789

636

126 131

328 328 341416

126 328 636 341 416 131 328

126 131 328 328 341 416 636

Page 36: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

36

Not magic. It provably works.Keys

K-digit numbers base B

Claim: after ith BinSort, least significant i digits are sorted

e.g. B=10, i=3, keys are 1776 and 8234. 8234 comes before 1776 for last 3 digits.

Page 37: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

37

RadixSortProof by InductionBase case:

i=0. 0 digits are sorted (that wasn’t hard!)

Induction step: assume for i, prove for i+1. consider two numbers: X, Y. Say Xi is ith digit of X

(from the right) Xi+1 < Yi+1 then i+1th BinSort will put them in order

Xi+1 > Yi+1 , same thing

Xi+1 = Yi+1 , order depends on last i digits. Induction hypothesis says already sorted for these digits. (Careful about ensuring that your BinSort preserves order aka “stable”…)

Page 38: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

38

What data types can you RadixSort? Any type T that can be BinSorted Any type T that can be broken into parts

A and B, such that: You can reconstruct T from A and B A can be RadixSorted B can be RadixSorted A is always more significant than B, in

ordering

Page 39: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

39

RadixSort Examples 1-digit numbers can be BinSorted 2 to 5-digit numbers can be BinSorted

without using too much memory 6-digit numbers, broken up into A=first

3 digits, B=last 3 digits. A and B can reconstruct original 6-digits A and B each RadixSortable as above A more significant than B

Page 40: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

40

RadixSorting Strings 1 Character can be BinSorted Break strings into characters Need to know length of biggest string

(or calculate this on the fly). Null-pad shorter strings Running time:

N is number of strings L is length of longest string RadixSort takes O(N*L)

Page 41: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

41

Evaluating Sorting Algorithms What factors other than asymptotic

complexity could affect performance?

Suppose two algorithms perform exactly the same number of instructions. Could one be better than the other?

Page 42: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

42

Memory Hierarchy Stats (made up1)

CPU cycles Size

L1 (on chip) cache

0 32 KB

L2 cache 8 512 KB

RAM 35 256 MB

Hard Drive 500,000 8 GB

1But plausible : - )

Page 43: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

43

Memory Hierarchy ExploitsLocality of Reference

Idea: small amount of fast memoryKeep frequently used data in the fast

memoryLRU replacement policy

Keep recently used data in cache To free space, remove Least Recently Used

data

Page 44: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

44

Cache Details (simplified)Main Memory

Cache

Cache linesize (4 adjacent memory cells)

Page 45: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

45

Traversing an Array

One miss for every 4 accesses in a traversalcache misses

cache hits

Page 46: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

46

Iterative MergeSortPoor Cache Performance

Cache Sizeno temporal

locality!

Page 47: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

47

Recursive MergeSortBetter Cache Performance

Cache Size

Page 48: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

48

QuicksortPretty Good Cache Performance Initial partition causes a lot of cache misses As subproblems become smaller, they fit

into cache Generally good cache performance

Page 49: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

49

Radix SortLousy Cache PerformanceOn each BinSort

Sweep through input list – cache misses along the way (bad!)

Append to output list – indexed by pseudorandom digit (ouch!) Truly evil for large Radix (e.g. 216), which reduces #

of passes

Page 50: CSE 326 Sorting David Kaplan Dept of Computer Science & Engineering Autumn 2001.

SortingCSE 326 Autumn 2001

50

Sorting Summary Linear-time sorting is possible if we know more

about the input (e.g. that all input values fall within a known range)

BinSort Moderately useful O(n) categorizer Most useful as building block for RadixSort

RadixSort O(n), but input must conform to fairly narrow profile Poor cache performance

Memory Hierarchy and Cache Cache locality can have major impact on

performance of sorting, and other, algorithms


Recommended