Download - Parallel Sor)ngliacs.leidenuniv.nl/~wijshoffhag/PPI2015_2016/Lecture_10.pdf · • Bubble Sort, Inser)on Sort – 2 O ( n ) • Merge Sort, Heap Sort, QuickSort – O ( n log n )

ParallelSor)ng

Ajungle

• 

Illustra)on

h3ps://www.youtube.com/watch?v=kPRA0W1kECg

(Sequen)al)Sor)ng

•  BubbleSort,Inser)onSort– O(n2)

•  MergeSort,HeapSort,QuickSort– O(nlogn)– QuickSortbestonaverage

•  Op)malParallelTimecomplexity– O(nlogn)/P–  IfP=NthenO(logn)

Inser)onSortInsertion_Sort (A) for i from 1 to |A| - 1

j = i while j > 0 and A[j-1] > A[j] swap A[j] and A[j-1] j = j – 1 Return ( A ) Inherentlysequen)alsohardtoparallelize!!!!èOnlythroughpipeliningcanspeedupberealized

PipelinedInser)onSort• 

Tpipelined=2n,withnprocessors,somaximalspeedup=n/4-3(wortcasesequen=al=me=(n-1)(n-2)/2)

ParallelMergeSortMerge_Sort (A) n = |A| halfway = floor(n/2)

DOINPARALLEL Merge_Sort (A[1]… A[halfway]) Merge_Sort (A[halfway+1]… A[n])

j = 1; current = 1 for i from 1 to halfway while j ≤ n-halfway and A[halfway + j] < A[i]

X[current] = A[halfway + j] j = j + 1; current = current+1 X[current] = A[i] current = current+1

Return ( X )

halfway halfway + j n i

A

Inapicture

• 

NotesMergeSort•  Collectssortedlistontooneprocessor,mergingasitemscometogether

•  Mapswelltotreestructure,sor)nglocallyonleaves,thenmergingupthetree

•  Asitemsapproachrootoftree,processorsaredropped,limi)ngparallelism

•  O(n),ifP=n(1+2+4+…+n/2+n)=n(1+1/2+1/4…)=n.2

ParallelQuickSortQuickSort (A) if |A| == 1 then return A i = rand_int (|A|) p = A[i] DOINPARALLEL L = QuickSort({a A|a < p}) E = {a A|a = p} G = QuickSort({a A|a > p}) Return ( L || E || G )

∈

∈

∈

∈

IfweassumethatthepivotsarechosensuchthatLandGareaboutequalinsize,then

Sequen)al:T(n)=2T(n/2)+O(n)=O(nlogn)Infactitcanbeproventhatthisalwaysholds!Forparallelexecu)onthechoiceofiiscrucialforloadbalance.Evenmoreimportantlywewouldliketochoosemul)plepivots(p-1)atthesame)me,sothateach)mewegetppar))onswhichcanbeexecutedinparallel.

Ppar))ons•  Foragivenp(numberofpivots)ands(oversamplingrate),firstselectatrandom

p*scandidatepivots for i from 1 to p*s Cand[i] = rand_int (|A|)

•  Sortthelistofcandidatepivots:Cand[i]•  ChooseCand[s],Cand[2*s]…Cand[(p-1)*s]Findagoodvaluefortheoversamplingrate:s>1,

èsshouldnotleadtoverylongsor)ng)mes

ParallelRadixSortInsteadofcomparingvalues:COMPAREDIGITS Radix_Sort (A, b) # Assumebinaryrepresenta)onsofkeys for i from 0 to b-1 FLAGS = { (a>>i) mod 2 | a A } NOTFLAGS = { 1-FLAGS[a] | a A } R_0 = SCAN (NOTFLAGS) s_0 = SUM (NOTFLAGS) R_1 = SCAN (FLAGS) R = {if FLAGS[j] == 0 then R_0[j] else R_1[j] + s_0 | j [0…|A|-1} A = A sorted by R

Return ( A )

∈∈

∈

(a>>i) mod 2: rightshiNi=mes,soe.g. 01101>>2 mod2 = 00011 mod 2 = 1

So(a>>i) mod 2equalsthe(i+1)thrightmostbitofa

LSD/MSDRadixSort

Insteadof (a>>i) mod 2

onecanalsoimplementsRadixSortwith: (a<<i) div 2^(b-1)

Thefirstimplementa)oniscalledleastsignificantdigitRadixSortorLSDRadixSortThela3eronisMSDRadixSort

NotesRadixSort

Ø Sequen)al)mecomplexity: T(n)=O(b.n),bitera)ons,eachitera)onO(n)

Ø Notethatb≈logn,soatotalofO(nlogn)Ø Insteadofsingledigitsablockofrdigitscanbetakeneach)me,resul)nginb/ritera)ons

Illustra)on(LSDRadixSort)

• 

Sor)ngofeachselecteddigitinRadixSort,withPrefixSumBasedSor)ng

EachelementioftheprefixsumarrayhastheSUMofallelementswhichindexissmallerthani

Whatistherela)onshipwithsor)ng?

• 

Ø Allbitswhichareequalto0areflaggedwitha1Ø ComputePrefixSumofthisflagarrayØ Storeallflagged(1)entriesofx[k]intheloca)onindicatedbytheprefixsum

Secondstage

• 

Ø Allbitswhichareequalto1areflaggedwitha1Ø ComputePrefixSumofthisflagarrayØ Storeallflagged(1)entriesofx[k]inthenextloca)onsindicatedbytheprefixsum

Whataboutparallelexecu)on?

•  Computa)onallythesor)ngalgorithmisreducedtocompu)ngtheprefixsumarraysforeachbitranking.

•  However,compu)ngtheseprefixsumarraysseemstobeinherentlysequen)al.Ornot?

ParallelExecu)onofPrefixSums

Prefix_Sum (X) # X a n-bit array for index from 0 to log n DOINPARALLELforallk if k >= 2îndex then X[k] = X[k]+X[k-2îndex]

X >> 1 #Shift all entries to the right

Return ( X )

Illustra)onofparallelPrefixSums

• 

ImprovingCachePerformanceØ  Theparallelprefixsumalgorithmrequiresthewholearraytobe

fetchedateachitera)onØ  BadcacheperformanceØ  ThroughTilingTechniquestheXarraycanbecutintoslices()les)Ø Onceeverynumberofitera)onsre-)le!!Ø  ACUDAimplementa)onoftheoverallalg.canbefoundon

h3ps://github.com/debda3abasu/amp-radix-sort

2index

X

P2

P1

P3

BitonicSor)ngBasedonbitonicsequences:A[1],A[2],….,A[n-1],A[n]isbitonic,iff thereisajandksuchthat

• A[1]…A[j]ismonotonicincreasing,• A[j]…A[k]ismonotonicdecreasing,• A[k]…A[n]A[1]!!ismonotonicincreasing

ORviseversa

A“be3er”defini)onofBitonicSequence

Abitonicsequenceisasequencewith A[1]<=A[2]<=….<=A[k]>=…>=A[n-1]>=A[n]

forsomek(1<=k<=n), oracircularshioofsuchasequence.

Inapicture

Bitonic:

NotBitonic

Ifrotated:TwoPeaks

A[1]>=A[2]>=….>=A[k]<=…<=A[n-1]<=A[n]leadstothesamedefini)on

Bitonic“Merge”Bitonic_Merge (A) # A is a bitonic sequence n = |A| if n == 1 then return A half_n = floor(n/2) for i from 1 to half_n c[i] = min(A[i],A[i+half_n]) d[i] = max(A[i],A[i+half_n])

DOINPARALLEL Bitonic_Merge (c[1]…c[half_n]) Bitonic_Merge (d[1]…d[half_n])

Return ( )

NotesBitonicMerge

•  Eachcanddsequenceisabitonicsequenceagain

•  Foralli: c[i] <= d[i] •  Attheendwesortedbitonicsequencesoflength1,henceasortedsequence

BitonicMergealwaysyieldsbitonicsequences

• 

BitonicMergeNetwork• 

BitonicMergeNetwork(2)• 

BitonicMergeNetwork(3)

• 

ParallelBitonicSort

Bitonic_Sort (A) n = |A|

if n == 1 then return A for i from 0 to log(n) DOINPARALLELforallk=m.2î,k<n Bitonic_Merge (A[k]…A[k+2î-1])*

Return ( )

*Foroddvaluesofm,interchangeminandmax

NotesBitonicSort

•  Eachitera)oncreateslongerandlongerbitonicsequences

•  Inthelastitera)onthewholesequenceisbitonicandthefinalbitonicmergecreatesasortedlist

BitonicSortNetwork

• 

four bitonic lists of length 2 constituting 2 bitonic lists of length 4

2 Bitonic Merge Networks

4 Bitonic Merge Networks

Whyalterna)ngmax/min?NotethatatthestartofeachBitonicMergeNetworkwehavetwoBitonicSequenceswhichcons)tutesOneBitonicSequence!!!Ifoneofthesesequencesis(monotonic)increasingandtheotheris(monotonic)decreasingthenthisisalwaysthecase.Ifbothareincreasingordecreasingthisisnotnecessarilythecase,i.e.

isnotbitonic

NotesBitonicSortNetwork•  Assumen=2^k•  Thebitonicmergestageshave1,2,3,…,kstepseach,so)metosortis

T(n) =1+2+…+k=k(k-1)/2 =O(k2)=O(log2n)

•  Eachsteprequiresn/2processors,sothetotalnumberofprocessorsisO((n/2)log2n)

•  Thenetworkcanhandledmul)plepipelinedlistproducingasortedlisteach)mestep