8. External Sorting
Suppose that a file is so large that the whole file cannot be accommodated in the internal memory of a computer.
What shall we do?
Need to use EXTERNAL STORAGE DEVICE !!!
External Sorting
- Disk Sort
- Tape Sort
What is a major difference between two external sorts?
Sorting with Disk
k - way merging
“mergesort”
merge
internal sort
......
......
Example
4500 records
250 records/block
available memory = 3 blocks
Def’n : A segment of a file is said to be a run if all the records in the segment are sorted.
1 2 3 4 5 6
I
1 3 5
D1 ……
2 4 6
D2 ……
3
D1 D2
……
6 n
D3 D4
2
n
: the size of a run
1 3 5 7
Run size 2 4 6 8
1 3 5 7 2 4 6 8
3
12 34 56 78
6
1256 3478
12
12345678
24
How many passes?
1 + log2r
(r # of initial runs)
a
nn
ar
rn
an
2
2
log
,
)log(
O
size. run initial the
O
operations I/O of #
k-way merging
… … …… …
……
logkr ……………………………………………….
……
# of passes
1+logkr
# of I/O operations?
O(nlogkr)
better than 2-way merging !!!
How about # of comparisons?
Is k-way merging always better than 2-way merging?
Replacement Selection
… … …… …
……
……………………………………………….
……
# of passes
1+logkr #(P)
#(P) k rr run size
# of comparisons(k-way merge)
16 38 30 25 50 16 110 20
15 20 20 25 15 11 120 18
10 9 20 15 8 9 90 17
10 9 20 15 8 9 90 17
15 8 17
9 8
8
8
9
8 9
1
32
4 5 6 7
10 11 12 13 14 15
8
How many comparisons in a pass?
nlog2k why?
Total # of comparisons?
(# of passes) (# of comparisons in a pass)
= (logkr)(nlog2k)
= (nlog2r) independent of k !!!
#(c) r
How to increase run size(initial run size)
x1, x2, x3,…,xm, xm+1, xm+2, xm+3,…,x2m, x2m+1, x2m+2, x2m+3,…
m keys m keys m keys
r = # of runs = Any improvement?
Observation
See p.94 in textbook
!!!
…...
m
n
m
nr
4,2,32,12,18,24,91,11
(record size >> the size of pointer)
why do we need this?
11
91
24
18
11
18
11
4
5
6
7
2
3
A tree of losers
4 parent
2 loser
32
12 Updating pointers
18 ptr := winner.parent;
24 while ptr nil do
91 if (ptr.loser.key < winner.key) then
11 interchange(ptr.loser, winner);
end {if}
ptr := ptr.parent;
end {while}
11 91
winner
1824
Explain p.97-101, textbook !!!
Exercise :
In a complete 2-tree(T) with n leaf nodes,
show that
total # of nodes in T = 2n -1
Performance Analysis
(Average size of runs)
m0 # of records in (real) memory.
H. Seward (M.S. Thesis, MIT, 1954)
gave a good reason to believe that a run contains more than 1.5m0 records
(no proof)
E. Friend (JACM, 3, (1966))
experiment 2m0
E. Moore (1961)
Proved that 2m0 is the expected run length.
Sketch of Moore’s Proof
Snowplow
falling snow
2m0 m0
uniform distribution 2m0
Tape Sorting
• Balanced k-way merging
(similar to disk sorting)
• Polyphase merging
• Cascade merging
Polyphase Merging (Motivation)– (R1, R2, …, R5000)– length (Ri) 20 bytes– Only 1000 records fitted in the internal memory at one time.
( 20k bytes)– 4 tapes available
Balanced 2-way mergeT1 T2 T3 T4
R1,1000 R1001,2000
R2001,3000 R3001,4000 R4001,5000
R1,2000 R2001,4000
R4001,5000
R1,4000 R4001,5000 R1,5000
Total # of operations = 15000
Tape 1 Tape 2 Tape 3 Tape 4
R1,1000 R1001,2000 R2001,3000
R3001,4000 R4001,5000
(rewind)
R3001,4000 R4001,5000 R1,3000
R1,5000
• Total # of I/O operations
3000 + 5000 = 8000
Balanced Merge is not always best !!!
What if only 3 tapes available?
Tape 1 Tape 2 Tape 3
R1,1000 R1001,2000
R2001,3000 R3001,4000
R4001,5000
R1,2000
R2001,4000
R4001,5000
R1,2000 R2001,4000
R4001,5000
R1,4000
R4001,5000
R4001,5000 R1,4000
R1,5000
Total # of I/O Operations
5000 + 2000 + 5000 + 4000 + 5000 = 21,000 !!!
Tape 1 Tape 2 Tape 3
R1,1000 R1001,2000
R2001,3000 R3001,4000
R4001,5000
R1,2000
R4001,5000 R2001,4000
(rewind)
R1,2000; 4001,5000
(rewind)
R1,5000
Total # of I/O Operations
4000 + 3000 + 5000 = 11,000 !!!
4000,2001R
Polyphase merge
T1 T2 T3 T4 T5 T6
131 130 128 124 116 115 114 112 18 516
17 16 14 98 58
13 12 174 94 54
11 332 172 92 52
651 331 171 91 51
1291
How to assign initial runs?
Cascade MergeT1 T2 T3 T4 T5 T6
155 150 141 129 115 140 135 126 114 515
Pass 1 126 121 112 414 515
114 19 312 414 515
15 29 312 414 515
( 15 29 312 414 515)
155 24 37 49 510
155 144 33 45 56
Pass 2 155 144 123 42 53
155 144 123 92 51
(155 144 123 92 51 )
154 143 122 91 551
153 142 121 501 551
Pass 3 152 141 411 501 551
151 291 411 501 551
( 151 291 411 501 551)
Pass 4 1901
Polyphase Merge
T1 T2 T3 T4 T5 T6
phase 1 131 130 128 124 116 2 115 114 112 18 516
3 17 16 14 98 58
4 13 12 174 94 54 Gilstad(1960)
5 11 332 172 92 52
6 651 331 171 91 51
7 1291
{{1,0,0,0,0},{1,1,1,1,1},{2,2,2,2,1},{4,4,4,3,2},{8,8,7,6,4},
{16,15,14,12,8},{31,30,28,24,16}}
Perfect Fibonacci Distribution !!!
What is the underlying rule?
i ai bi ci di ei
0 1 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 1
3 4 4 4 3 2
4 8 8 7 6 4
5 16 15 14 12 8
6 31 30 28 24 16
(a0 + b0) (a0 + c0) (a0 + d0) (a0 + e0) a0
(a1 + b1) (a1 + c1) (a1 + d1) (a1 + e1) a1
(a2 + b2) (a2 + c2) (a2 + d2) (a2 + e2) a2
n an bn cn dn en
n+1 an + bn an + cn an + dn an + en an
an bn cn dn en
i ai bi ci di ei output
0 1 0 0 0 0 T6
1 1 1 1 1 1 T1
2 2 2 2 2 1 T2
3 4 4 4 3 2 T3
2 2 2 1 0 2
1 1 1 0 1 1
4 8 8 7 6 4 T4
5 16 15 14 12 8 T5
6 31 30 28 24 16 T6
7 61 59 55 47 31
T1 T2 T3 T4 T5
n-1 an-1 bn-1 cn-1 dn-1 en-1
n an-1+bn-1 an-1+cn-1 an-1+dn-1 an-1+en-1 an-1
an bn cn dn en
en = an-1
dn = an-1 + en = an-1 + an-2
cn = an-1 + dn-1 = an-1 + (an-2 + en-2) = an-1 + an-2 + an-3
………….
en = an-1
dn = an-1 + an-2
cn = an-1 + an-2 + an-3
bn = an-1 + an-2 + an-3 + an-4
an = an-1 + an-2 + an-3 + an-4 + an-5
(a0 = 1, ai = 0, i = -1, -2, -3, -4)
e = an-1
d = an-1 + an-2
c = an-1 + an-2 + an-3
b = an-1 + an-2 + an-3 + an-4
a = an-1 + an-2 + an-3 + an-4 + an-4
i -4 -3 -2 -1 0 1 2 3 4 5 6 7
ai 0 0 0 0 1 1 2 4 8 16 31 61
1
bi 0
ci 0
di 0
ei 0
1 2 4 8 16 31 61
1 2 4 8 15 30 59
1 2 4 7 14 28 55
1 2 3 6 12 24 47
1 1 2 4 8 16 31
ai = < 0, 0, 0, 0, 1, 1, 2, 4, 8, 16, 31, 61, …… >, i = -4, -3, -2, -1, 0, 1, 2,...“The kth order Fibonacci number”
Fnk = Fn-1
k + Fn-2k + …… + Fn-k
k
0, 0 n k-2 Fn
k = 1, n = k-1
e.g)The second order Fibonacci number
0 1 1 2 3 5 ……
Fn2 = Fn-1
2 + Fn-22
0, if n = 0 Fn
2 = 1, if n = 1
Fibonacci number !!!
an = Fn+k-1k if k tapes(input) are used
why?
What if not perfect Fib. Dist’n?
Use dummy runs !!!
5 input tapes and 53 initial runs.
Level T1 T2 T3 T4 T5
1 1 1 1 1 1 5
2 2 2 2 2 1 91 1 1 1 0
3 4 4 4 3 2 172 2 2 1 1
4 8 8 7 6 4 334 4 3 3 2
5 16 15 14 12 8 65>53(8 7 7 6 4)………………………………
T1 T2 T3 T4 T5
(34)(35) (36) (37)(38) (39) (40) (41)(42) (43) (44) (45)(46) (47) (48) (49) (50)(51) (52) (53)
T1 T2 T3 T4 T5 T6
(2) (2) (2) (3) (3)
18 17 16 14 58
(2) (2) (2) (3) 55
53
not best
but simple and good !!!
For better one, see Knuth !!!
1111
1111
1111
161 151 141 121 141
Example (3 tapes)
T1 T2 T3
(k)8 (k)5 (k)3 (2k)5
(3k)3 (2k)2 0, 1, 1, 2, 3, 5, 8
(5k)2 (3k)1 (5k)1 (8k)1
(13k)1
Runs on two input tapes (k)
# of runs run size(k) # of pairs # of I/O’s
8,5 1,1 5 10
5,3 2,1 3 9
3,2 3,2 2 10
2,1 5,3 1 8
1,1 8,5 1 13
1 13
How many passes over the data?
Total number Fs for some s.
of initial runs
the sth Fibonacci number
Fs
Fs-1 Fs-2
T1 T2 T3
Fs-1 Fs-2
Fs-3 Fs-2
Fs-3 Fs-4
…………
See Fig. p.107, textbook !!!
Total # of I/O operations =
# of passes =
2
11
s
iisi kFF
s
s
iisi
s
s
iisi
F
FF
kF
kFF
2
11
2
11
Lemma :
[proof] (By induction on S)
(s=2) LHS =
RHS =
(s=3) LHS =
RHS =
(s=k) Suppose that
(s=k+1)
Exercise !!!
See page 106-107 in textbook !!!
2,5
22
5
51
2
11
sF
sF
sFF ss
s
iisi
00
11
iisi FF
05
6
5
6
5
24
5
5223
FF
231
1
11
FFFF
iisi
25
16
5
6
5
26
5
5334
FF
kkFk
Fk
FF kk
k
iiki
'4,
5
2'2
5
5''1'
2'
11'
From the previous lemma,
# of passes =
Fs = r
(1)
why?
. Golden Ratio !!!
From (1) ,
5
22
5
5
522
55
1
1
2
11
s
F
Fs
F
Fs
Fs
F
FF
s
s
s
ss
s
s
iisi
KK
kF 512
151
2
1
5
1
k k
kF
51
2
1
5
1
8
131
j
j
F
F
ss F
Fs log43.167.1
1)51log(
log5log
5jfor
Theorem:
Fs-1 Fs-2
Polyphase merge
merge 3 tapes
Fs = r = # of initial runs
# of passes = 1.04 log2r
APPROXIMATED BEHAVIOR OF POLYPHASE MERGE SORTING
Tapes Phases Passes Pass/phase Growth percent ratio
3 2.078 lnS + 0.672 1.504 lnS + 0.992 72 1.6180340
4 1.641 lnS + 0.364 1.015 lnS + 0.965 62 1.8392868
5 1.524 lnS + 0.078 0.863 lnS + 0.921 57 1.9275620
6 1.479 lnS + 0.185 0.795 lnS + 0.864 54 1.9659482
7 1.460 lnS + 0.424 0.762 lnS + 0.797 52 1.9835828
8 1.451 lnS + 0.642 0.744 lnS + 0.723 51 1.9919642
9 1.447 lnS + 0.838 0.734 lnS + 0.646 51 1.9960312
10 1.445 lnS + 1.017 0.728 lnS + 0.568 50 1.9980295
20 1.443 lnS + 2.170 0.721 lnS – 0.030 50 1.9999981
APPROXIMATED BEHAVIOR OF CASCADE MERGE SORTING
Tapes Phases Passes Growth ratio
3 2.078 lnS + 0.672 1.504 lnS + 0.992 1.6180840
4 1.235 lnS + 0.754 1.012 lnS + 0.820 2.2469796
5 0.946 lnS + 0.796 0.897 lnS + 0.800 2.8793852
6 0.796 lnS + 0.821 0.773 lnS + 0.808 3.5133371
7 0.703 lnS + 0.839 0.691 lnS + 0.822 4.1481149
8 0.639 lnS + 0.852 0.632 lnS + 0.834 4.7833861
9 0.592 lnS + 0.861 0.587 lnS + 0.845 5.4189757
10 0.555 lnS + 0.869 0.552 lnS + 0.854 6.0547828
20 0.397 lnS + 0.905 0.397 lnS + 0.901 12.4174426
Cascade Merge
Level ai bi ci di ei
0 1 0 0 0 0
1 1 1 1 1 1
2 5 4 3 2 1
3 15 14 12 9 5
4 55 50 41 29 15
n an bn cn dn en
n+1 an+bn+cn an+1 bn+1 cn+1 dn+1
+dn+en -en -dn -cn -bn
an+1 an
Perfect dist’n
for detail see Knuth Vol III !!!