ParallelSor)ng
Ajungle
•
Illustra)on
h3ps://www.youtube.com/watch?v=kPRA0W1kECg
(Sequen)al)Sor)ng
• BubbleSort,Inser)onSort– O(n2)
• MergeSort,HeapSort,QuickSort– O(nlogn)– QuickSortbestonaverage
• Op)malParallelTimecomplexity– O(nlogn)/P– IfP=NthenO(logn)
Inser)onSortInsertion_Sort (A) for i from 1 to |A| - 1
j = i while j > 0 and A[j-1] > A[j] swap A[j] and A[j-1] j = j – 1 Return ( A ) Inherentlysequen)alsohardtoparallelize!!!!èOnlythroughpipeliningcanspeedupberealized
PipelinedInser)onSort•
Tpipelined=2n,withnprocessors,somaximalspeedup=n/4-3(wortcasesequen=al=me=(n-1)(n-2)/2)
ParallelMergeSortMerge_Sort (A) n = |A| halfway = floor(n/2)
DOINPARALLEL Merge_Sort (A[1]… A[halfway]) Merge_Sort (A[halfway+1]… A[n])
j = 1; current = 1 for i from 1 to halfway while j ≤ n-halfway and A[halfway + j] < A[i]
X[current] = A[halfway + j] j = j + 1; current = current+1 X[current] = A[i] current = current+1
Return ( X )
halfway halfway + j n i
A
Inapicture
•
NotesMergeSort• Collectssortedlistontooneprocessor,mergingasitemscometogether
• Mapswelltotreestructure,sor)nglocallyonleaves,thenmergingupthetree
• Asitemsapproachrootoftree,processorsaredropped,limi)ngparallelism
• O(n),ifP=n(1+2+4+…+n/2+n)=n(1+1/2+1/4…)=n.2
ParallelQuickSortQuickSort (A) if |A| == 1 then return A i = rand_int (|A|) p = A[i] DOINPARALLEL L = QuickSort({a A|a < p}) E = {a A|a = p} G = QuickSort({a A|a > p}) Return ( L || E || G )
∈
∈
∈
∈
IfweassumethatthepivotsarechosensuchthatLandGareaboutequalinsize,then
Sequen)al:T(n)=2T(n/2)+O(n)=O(nlogn)Infactitcanbeproventhatthisalwaysholds!Forparallelexecu)onthechoiceofiiscrucialforloadbalance.Evenmoreimportantlywewouldliketochoosemul)plepivots(p-1)atthesame)me,sothateach)mewegetppar))onswhichcanbeexecutedinparallel.
Ppar))ons• Foragivenp(numberofpivots)ands(oversamplingrate),firstselectatrandom
p*scandidatepivots for i from 1 to p*s Cand[i] = rand_int (|A|)
• Sortthelistofcandidatepivots:Cand[i]• ChooseCand[s],Cand[2*s]…Cand[(p-1)*s]Findagoodvaluefortheoversamplingrate:s>1,
èsshouldnotleadtoverylongsor)ng)mes
ParallelRadixSortInsteadofcomparingvalues:COMPAREDIGITS Radix_Sort (A, b) # Assumebinaryrepresenta)onsofkeys for i from 0 to b-1 FLAGS = { (a>>i) mod 2 | a A } NOTFLAGS = { 1-FLAGS[a] | a A } R_0 = SCAN (NOTFLAGS) s_0 = SUM (NOTFLAGS) R_1 = SCAN (FLAGS) R = {if FLAGS[j] == 0 then R_0[j] else R_1[j] + s_0 | j [0…|A|-1} A = A sorted by R
Return ( A )
∈∈
∈
(a>>i) mod 2: rightshiNi=mes,soe.g. 01101>>2 mod2 = 00011 mod 2 = 1
So(a>>i) mod 2equalsthe(i+1)thrightmostbitofa
LSD/MSDRadixSort
Insteadof (a>>i) mod 2
onecanalsoimplementsRadixSortwith: (a<<i) div 2^(b-1)
Thefirstimplementa)oniscalledleastsignificantdigitRadixSortorLSDRadixSortThela3eronisMSDRadixSort
NotesRadixSort
Ø Sequen)al)mecomplexity: T(n)=O(b.n),bitera)ons,eachitera)onO(n)
Ø Notethatb≈logn,soatotalofO(nlogn)Ø Insteadofsingledigitsablockofrdigitscanbetakeneach)me,resul)nginb/ritera)ons
Illustra)on(LSDRadixSort)
•
Sor)ngofeachselecteddigitinRadixSort,withPrefixSumBasedSor)ng
EachelementioftheprefixsumarrayhastheSUMofallelementswhichindexissmallerthani
Whatistherela)onshipwithsor)ng?
•
Ø Allbitswhichareequalto0areflaggedwitha1Ø ComputePrefixSumofthisflagarrayØ Storeallflagged(1)entriesofx[k]intheloca)onindicatedbytheprefixsum
Secondstage
•
Ø Allbitswhichareequalto1areflaggedwitha1Ø ComputePrefixSumofthisflagarrayØ Storeallflagged(1)entriesofx[k]inthenextloca)onsindicatedbytheprefixsum
Whataboutparallelexecu)on?
• Computa)onallythesor)ngalgorithmisreducedtocompu)ngtheprefixsumarraysforeachbitranking.
• However,compu)ngtheseprefixsumarraysseemstobeinherentlysequen)al.Ornot?
ParallelExecu)onofPrefixSums
Prefix_Sum (X) # X a n-bit array for index from 0 to log n DOINPARALLELforallk if k >= 2^index then X[k] = X[k]+X[k-2^index]
X >> 1 #Shift all entries to the right
Return ( X )
Illustra)onofparallelPrefixSums
•
ImprovingCachePerformanceØ Theparallelprefixsumalgorithmrequiresthewholearraytobe
fetchedateachitera)onØ BadcacheperformanceØ ThroughTilingTechniquestheXarraycanbecutintoslices()les)Ø Onceeverynumberofitera)onsre-)le!!Ø ACUDAimplementa)onoftheoverallalg.canbefoundon
h3ps://github.com/debda3abasu/amp-radix-sort
2index
X
P2
P1
P3
BitonicSor)ngBasedonbitonicsequences:A[1],A[2],….,A[n-1],A[n]isbitonic,iff thereisajandksuchthat
• A[1]…A[j]ismonotonicincreasing,• A[j]…A[k]ismonotonicdecreasing,• A[k]…A[n]A[1]!!ismonotonicincreasing
ORviseversa
A“be3er”defini)onofBitonicSequence
Abitonicsequenceisasequencewith A[1]<=A[2]<=….<=A[k]>=…>=A[n-1]>=A[n]
forsomek(1<=k<=n), oracircularshioofsuchasequence.
Inapicture
Bitonic:
NotBitonic
Ifrotated:TwoPeaks
A[1]>=A[2]>=….>=A[k]<=…<=A[n-1]<=A[n]leadstothesamedefini)on
Bitonic“Merge”Bitonic_Merge (A) # A is a bitonic sequence n = |A| if n == 1 then return A half_n = floor(n/2) for i from 1 to half_n c[i] = min(A[i],A[i+half_n]) d[i] = max(A[i],A[i+half_n])
DOINPARALLEL Bitonic_Merge (c[1]…c[half_n]) Bitonic_Merge (d[1]…d[half_n])
Return ( )
NotesBitonicMerge
• Eachcanddsequenceisabitonicsequenceagain
• Foralli: c[i] <= d[i] • Attheendwesortedbitonicsequencesoflength1,henceasortedsequence
BitonicMergealwaysyieldsbitonicsequences
•
BitonicMergeNetwork•
BitonicMergeNetwork(2)•
BitonicMergeNetwork(3)
•
ParallelBitonicSort
Bitonic_Sort (A) n = |A|
if n == 1 then return A for i from 0 to log(n) DOINPARALLELforallk=m.2^i,k<n Bitonic_Merge (A[k]…A[k+2^i-1])*
Return ( )
*Foroddvaluesofm,interchangeminandmax
NotesBitonicSort
• Eachitera)oncreateslongerandlongerbitonicsequences
• Inthelastitera)onthewholesequenceisbitonicandthefinalbitonicmergecreatesasortedlist
BitonicSortNetwork
•
four bitonic lists of length 2 constituting 2 bitonic lists of length 4
2 Bitonic Merge Networks
4 Bitonic Merge Networks
Whyalterna)ngmax/min?NotethatatthestartofeachBitonicMergeNetworkwehavetwoBitonicSequenceswhichcons)tutesOneBitonicSequence!!!Ifoneofthesesequencesis(monotonic)increasingandtheotheris(monotonic)decreasingthenthisisalwaysthecase.Ifbothareincreasingordecreasingthisisnotnecessarilythecase,i.e.
isnotbitonic
NotesBitonicSortNetwork• Assumen=2^k• Thebitonicmergestageshave1,2,3,…,kstepseach,so)metosortis
T(n) =1+2+…+k=k(k-1)/2 =O(k2)=O(log2n)
• Eachsteprequiresn/2processors,sothetotalnumberofprocessorsisO((n/2)log2n)
• Thenetworkcanhandledmul)plepipelinedlistproducingasortedlisteach)mestep