High Performance Comparison-Based Sorting Algorithm
on Many-Core GPUs
Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne
Key Laboratory of Computer System and ArchitectureICT, CAS, China
Outline
GPU computation modelOur sorting algorithm
– A new bitonic-based merge sort, named Warpsort
Experiment resultsconclusion
GPU computation model
Massively multi-threaded, data-parallel many-core architecture
Important features:– SIMT execution model
Avoid branch divergence
– Warp-based schedulingimplicit hardware synchronization among threads within a warp
– Access patternCoalesced vs. non-coalesced
Why merge sort ?
Similar case with external sorting– Limited shared memory on chip vs. limited main
memory
Sequential memory access– Easy to meet coalesced requirement
Why bitonic-based merge sort ?
Massively fine-grained parallelism – Because of the relatively high complexity, bitonic
network is not good at sorting large arrays– Only used to sort small subsequences in our
implementation
Again, coalesced memory access requirement
Problems in bitonic network naïve implementation
– Block-based bitonic network
– One element per thread Some problems
– in each stagen elements produce only n/2
compare-and-swap operations
Form both ascending pairs and descending pairs
– Between stagessynchronization
Phase
Stage
0 1 2
0 0 1 0 1 2
block
thread
Too many branch divergences and synchronization operations
What we use ?
Warp-based bitonic network– each bitonic network is assigned to an independent warp,
instead of a blockBarrier-free, avoid synchronization between stages
– threads in a warp perform 32 distinct compare-and-swap operations with the same order
Avoid branch divergencesAt least 128 elements per warp
And further a complete comparison-based sorting algorithm: GPU-Warpsort
Overview of GPU-Warpsort
...
...
...
...
......
merege by a warp merge by a warp merge by a warp
bitonic sort by a warp bitonic sort by a warp
Input
split into independent subsequences split into independent subsequences
merge by a warp
...Output
merge by a warp merge by a warp
merge by a warp
merge by a warp merge by a warp
merge by a warp
Step 2
Step 1
Step 3
Step 4
Divide input seq into small tiles, and each followed by a warp-based bitonic sort
Divide input seq into small tiles, and each followed by a warp-based bitonic sort
Merge, until the parallelism is insufficient.
Merge, until the parallelism is insufficient.
Split into small subsequences Split into small subsequences
Merge, and form the outputMerge, and form the output
Step1: barrier-free bitonic sort
divide the input array into equal-sized tiles
Each tile is sorted by a warp-based bitonic network– 128+ elements per tile to
avoid branch divergence– No need for
__syncthreads() – Ascending pairs +
descending pairs– Use max() and min() to
replace if-swap pairs
bitonic_warp_128_(key_t *keyin, key_t *keyout) { //phase 0 to log(128)-1 for(i=2;i<128;i*=2){ for(j=i/2;j>0;j/=2){ k0 ← position of preceding element in each pair to form ascending order if(keyin[k0]>keyin[k0+j]) swap(keyin[k0],keyin[k0+j]); k1 ← position of preceding element in each pair to form descending order if(keyin[k1]<keyin[k1+j]) swap(keyin[k1],keyin[k1+j]); } } //special case for the last phase for(j=128/2;j>0;j/=2){ k0 ← position of preceding element in the thread's first pair to form ascending order if(keyin[k0]>keyin[k0+j]) swap(keyin[k0],keyin[k0+j]); k1 ← position of preceding element in the thread's second pair to form ascending order if(keyin[k1]>keyin[k1+j]) swap(keyin[k1],keyin[k1+j]); }}
Step 2: bitonic-based merge sort t-element merge sort
– Allocate a t-element buffer in shared memory
– Load the t/2 smallest elements from seq A and B, respectively
– Merge
– Output the lower t/2 elements
– Load the next t/2 smallest elements from A or B
t = 8 in this example
6 4 2 0
0 2 4 6 8 10 12 14
1 3 5 7 9 11 13 15
1 3 5 7
barrier-free bitonic merge network
0 1 2 3 4 5 6 7
buf(shared memory)
Output
A[3]<B[3]?
14 12 10 8 4 5 6 7
Sequence A
Sequence B
Yes, then load the next 4 elements from A
15 13 11 9 4 5 6 7
No, then load the next 4 elements from B
buf
buf
buf
barrier-free bitonic merge network
barrier-free bitonic merge network
Step 3: split into small tiles
Problem of merge sort– the number of pairs decreases geometrically– Can not fit this massively parallel platform
Method– Divide the large seqs into independent small tiles
which satisfy:
( , ), ( , ) : ,a subsequence x i b subsequence y j a b
0 ,0 ,0 .x l y l i j s
Step 3: split into small tiles (cont.)
How to get the splitters?– Sample the input sequence randomly
... ... ... ... ... ... ... ... ... ... ... ... ... ...
...
... ... ... ... ... ... ... ... ...
...
sort
Input sequence
Sample sequence
Sorted sample sequence
Splitters
Step 4: final merge sort
Subsequences (0,i), (1,i),…, (l-1,i) are merged into Si
Then,S0, S1,…, Sl are assembled into a totally sorted array
s
0,0 0,1 0,2 0,3 ... 0,s-2 0,s-1
1,0 1,1 1,2 1,3 ... 1.s-2 1,s-1
...
l-1,0 l-1,1 l-1,2 l-1,3 ... l-1,s-2 l-1,s-1
l
Experimental setup Host
– AMD Opteron880 @ 2.4 GHz, 2GB RAMGPU
– 9800GTX+, 512 MB Input sequence
– Key-only and key-value configurations32-bit keys and values
– Sequence size: from 1M to 16M elements– Distributions
Zero, Sorted, Uniform, Bucket, and Gaussian
Performance comparison Mergesort
– Fastest comparison-based sorting algorithm on GPU (Satish, IPDPS’09)
– Implementations already compared by Satish are not included
Quicksort– Cederman, ESA’08
Radixsort– Fastest sorting algorithm on
GPU (Satish, IPDPS’09) Warpsort
– Our implementation
0
50
100
150
200
250
300
350
400
450
ko kv ko kv ko kv ko kv ko kv
1M 2M 4M 8M 16MSequence Size
Tim
e (m
sec)
mergesort radixsort warpsort quicksort
0
10
20
30
40
50
60
70
1M 2M 4M 8M 16MSequence Size
Sor
ting
Rat
e (m
illio
ns/s
ec)
warpsortradixsortmergesort
Performance results
Key-only– 70% higher performance than quicksort
Key-value– 20%+ higher performance than mergesort– 30%+ for large sequences (>4M)
Results under different distributions
Uniform, Bucket, and Gaussian distribution almost get the same performance
Zero distribution is the fastest
Not excel on Sorted distribution– Load imbalance
0
50
100
150
200
250
300
350
400
450
1M 2M 4M 8M 16MSequence Size
Tim
e (m
sec)
35
40
45
50
55
60
65
70
Sor
ting
Rat
e (m
illio
ns/s
ec)
Time_zero Time_uniform Time_gaussian Time_bucket Time_sortedRate_zero Rate_uniform Rate_gaussian Rate_bucket Rate_sorted
Conclusion We present an efficient comparison-based sorting algorithm for
many-core GPUs– carefully map the tasks to GPU architecture
Use warp-based bitonic network to eliminate barriers
– provide sufficient homogeneous parallel operations for each threadavoid thread idling or thread divergence
– totally coalesced global memory accesses when fetching and storing the sequence elements
The results demonstrate up to 30% higher performance – Compared with previous optimized comparison-based algorithms
Thanks