[BIRS]Data Structures of the Future · Data Structures of the Future: Concurrent, Optimistic, and...

DataStructuresoftheFuture:Concurrent,Optimistic,andRelaxed

DanAlistarhETHZurich/ISTAustria

WhyConcurrent?

Simple:Togetspeeduponnewerhardware.Scaling:morethreads shouldimplymoreusefulwork.

TheProblemwithConcurrency

Concurrencycanbeverybadvalueformoney.

Isthisprobleminherent?

0.00E+00

1.00E+06

2.00E+06

3.00E+06

4.00E+06

5.00E+06

6.00E+06

0 10 20 30 40 50 60 70

Throug

hput(Events/Second

)

NumberofThreads

ThroughputofLock-FreeQueue(PacketProcessing)

<$1000/machine

>$10000/machine

InherentSequentialBottlenecks

Datastructureswithstrongorderingsemantics• Stacks,Queues,PriorityQueues,ExactCounters

Thisisbadnews becauseofAmdahl’sLaw• Programswhosecriticalpathcontainscontendeddatastructureswon’tparallelizewell

Theorem:Givennthreads,any deterministic,stronglyordereddatastructurehasanexecutioninwhich

aprocessortakeslinearinntimetoreturn.[Ellen,Hendler,Shavit,SICOMP2013]

[Alistarh,Aspnes,Gilbert,Guerraoui, JACM2014]

Togetperformance,itiscriticaltospeedupshareddatastructures.

Today’sTalk

Howcanwescalesuchdatastructures?

Theory↔Software↔Hardware

NewHardwareInstructions!

NewDataStructureDesigns!

Theorem:Givennthreads,any deterministic,stronglyordereddatastructurehasanexecutioninwhich

aprocessortakeslinearinntimetoreturn.[Ellen,Hendler,Shavit,SICOMP2013]

[Alistarh,Aspnes,Gilbert,Guerraoui, JACM2014]

Lock-FreeDataStructures101• Optimisticprogrammingpatterns

• Donotuselocks,butatomicinstructions(Compare&Swap)

• Blockingofonethreadshouldn’t stopthewholesystem• Lotsofimplementations:HashTables,Lists,Trees,Queues,Stacks,etc.

Memory location R;void fetch-and-increment ( ) {

int val;do {

val = Read( R );new_val = val + 1;

} while (! Compare&Swap ( &R, val, new_val ));return val;

}

Example:Lock-freecounter.

TheLock-FreeParadox

val 0

Thread 0 Thread 1

Intheory,threadscouldstarve inoptimisticlock-freeimplementations.

Practice:thisdoesn’talwayshappen.Threadsrarelystarve.

Why?

Usemorecomplexwait-free algorithms.

Memory location R;void fetch-and-increment ( ) {

int val;do {

val = Read( R );new_val = val + 1;

} while (! Compare&Swap ( &R, val, new_val ));return val;

}

Example:Lock-freecounter.

val 0

0

Counter Value R

1

val 1val 1

2

AnalyzingLock-FreePatterns• StochasticScheduler [STOC14,Transact15]:

• Ateachschedulingstep,thenextscheduledthreadpickedfromadistribution p=(p1,p2,…,pn)withpi >0foralli

Theorem1:Underanystochasticscheduler,anylock-freealgorithm iswait-freewithprobability1.[Alistarh,Censor-Hillel, Shavit,STOC14/JACM16].

Theorem2:Underhighcontention,roughlyoneinϴ (1/norm2(p )) opssucceeds.[Alistarh,Sauerwald,Vojnovic,PODC15]

Lock-FreeAlgorithmStochasticSchedulerStochastic

ContentionGame

The ContentionGame

READ(R)

CAS(R,old,old +1)

success

READ(R)

CAS(R,old,old +1)

success

READ(R)

CAS(R,old,old +1)

success

RegisterR

Value=0

READ(R)

CAS(R,old,old +1)

success

LocationR

Value=1

Distribution(p1,p2,…,pn)

LocationR

Value=2

Givenarbitraryp,whatisthestationarybehaviourofthissystem?

TheContentionGame,Balls&Bins view

Thread1 Thread2 Thread3 Thread4

Rules fortheCounter• Bins=threads• Balls=steps• Placementaccordingtop• ToCompletetheOperation

• 3ballsbeforeothers• Resetsallbinswith2Balls• Winnerkeepsoneball

READ(R)

CAS(R,old,old +1)

success

Howmanyballsdoesabinreceiveonaveragebetweentwowins?

Howmanytotalballsaredistributedbetweentwowins,onaverage?

StepComplexity

SystemLatency

Distribution(p1,p2,…,pn)

TheResult

Examples:1. Uniformp=(1/n,1/n,…,1/n):

• Systemlatencyisϴ (sqrt n)[ACHS,JACM16]• Individuallatencyisϴ (nsqrt n)

2. Non-uniformp=(↗1,↘0,…,↘0)• Systemlatencyis(closeto)constant• Individuallatencyiseitherconstant,or↘0

3. Giventhreadsi andj,relativethroughputsare(pi /pj)2

Theorem.Givenarbitrarydistributionp andconstant-lengthlock-freealgorithm,thefollowinghold:

• Systemlatencyisϴ (1/norm2(p ))• Individuallatencyisϴ (norm2(p )/pi2 )

Fairness-ThroughputTrade-off

Othergametypescovered,e.g.obstruction-freealgorithms.

Moral:Underhighcontention,roughlyoneinsqrt (n)opssucceeds.

Whydoesthisgraphlooksobad?

0.00E+00

1.00E+06

2.00E+06

3.00E+06

4.00E+06

5.00E+06

6.00E+06

0 10 20 30 40 50 60 70

Throug

hput(Events/Second

)

NumberofThreads

ThroughputofParallelEventProcessingQueue

WhatHappensattheHardwareLevel?

Directory-basedcachecoherence (Intel,AMD)

Resp(R)Read(R)

CAS(R,old,new)

…

FailureRead(R)

CAS(R,old,new)

Thread0

Thread1

13WewastetimebecauseownershipofRcirculateswithoutusefulwork!Example:At64threads,only onein8messageexchangesisuseful.

Fixingit:Lease/Release[Alistarh,Haider,Hasenplaugh, PPOPP2016]

Directory-basedcachecoherence(Intel,AMD)

Resp(R)Read(R)

CAS(R,old,new)

…

Read(R)

CAS(R,old,new)

Core0

Core1LeaseIntervalT

LeaseIntervalT

success

Delayed

14

Eachtransferresultsinatleastoneuseful

operation!

Doublingdownonoptimism!

Lease/Release,MorePrecisely• Programmeroptimisticallyleasesvariablesforboundedtime

• void ReqLease(void* address,int data_size,time T);• void ReqRelease(void*address,int data_size,time T);• Leasetimeintheorderof1000cycles

• Performancepenaltyifleasesexpirebeforeoperationcompletion• Usuallyoccurs<5%ofthetime

• PrototypeintheMITGraphiteProcessorSimulator• Directory-basedMESICacheCoherenceProtocol• Protocolremainsprovablycorrect• Minimalchanges tothearchitecture

15

Doesitwork?

PacketProcessingQueuewithLease-Release(SimulatedinGraphite)

16

0.00E+00

1.00E+06

2.00E+06

3.00E+06

4.00E+06

5.00E+06

6.00E+06

7.00E+06

0 10 20 30 40 50 60 70

Throughput

NumberofThreads

QueueThroughput

NO_LEASE

SINGLE_LEASE4.5X

• Dequeue Operation

1. Top_Node=Lease&Read(Head )2. Next_Node =Read(Top_Node.ptr )3. ATOMIC

{if(Read(Head ) == Top_Node )thenWrite&Release(Head, Next_Node )

elseReleaseandgoto 1

}

0.00E+00

5.00E+03

1.00E+04

1.50E+04

2.00E+04

2.50E+04

0 10 20 30 40 50 60 70

#Threads

EnergyfortheQueue(nJ /operation)

NO_LEASE

SINGLE_LEASE

WhatElse?Locks

Canweavoidthewastedcoherencemessages?

Req(R,EX)

Resp(R)

Directory-basedcache(Intel,AMD)

Core0Core1

Req(R,EX)

Resp(R)Resp(R)

Req(R,EX)

Resp(R)Req(R,EX)

Resp(R)

Acquire(L)

CAS(L)Acquire(L)

Write(L)Unlock (L)

…CAS(L)

…

Resp(R)

Resp(R)

Spinning

CAS(L)

Delayed

17

LeaseIntervalT

SimplyLeasethelockonacquire!

PageRankwithL/R• Workswithlock-basedprogramsaswell

• Leasethelockbeforeacquiringit• Releasebeforegivingitup

0

5E+09

1E+10

1.5E+10

2E+10

2.5E+10

2 4 8 16 32

CompletionTime(ns)

ParallelPageRankRunningTime(lowerisbetter)

NO_LEASE WITH_LEASE

9.5X

Lease/Release• HardwareLockQueues[iQOLB:Rajwar,Kaegi,Goodman;HPCA2000]

• LocksusingLoad-Linked/Store-Conditional• Load-Linkedtakesa“lease”onthelock,Store-Conditional“releases”• Applied automaticallybytheprocessorspeculationmechanism

• TransientBlockingSynchronization[Shalev,Shavit;SunTechReport2004]• ProposeLoad&Lease /Store&Release instructionsfornon-coherentDSMmachines• Differentsemantics,neverimplemented

• Thepaperalsocontains:• Hardwareimplementationdetails(nodirectorymodifications!)• Blueprintforimplementingmultipleconcurrentleases(transactions)• Lotsofexperiments

19

TheHigh-LevelView• TheProblemwithConcurrency

• Inherentbottlenecksleadtomeltdowns

• Why?• Contentionhurtsoptimisticpatterns,quantifiablyso

• Lease/Release:• Wecannowscalebottlenecks,withinreason• Optimismenforcedatthehardwarelevel

Canwescalebeyondbottlenecks?

Let’s Relax!

ConcurrentPriorityQueues

1 task

3 task

4 task

5 task

7 task

8 task

Methods:• GetTopTask• InsertaTask

• SearchforTask15 task

11 task

18 task

PriorityQueue<key,value>

Search(key)

Insert/Delete(k,v)

DeleteMin()

Extremelyuseful:• GraphOperations(ShortestPaths)

• OperatingSystemKernel• Time-BasedSimulations

Wearelookingforafastconcurrent PriorityQueue.

TheProblem

Target:fast, concurrent PriorityQueue.

Lots ofwork onthetopic:[Sanders97],[Lotan&Shavit00],[Sundell&Tsigas07],

[Linden&Jonsson13], [Lenhart etal.14],[Wimmer etal.14]

Currentsolutionsarehardtoscale:DeleteMin is highlycontended.

Everyonewantsthesameelement!

ConcurrentSolution

head

● Linkedlist,sortedbypriority● Eachnodehasrandom“height”(geometricallydistributedwithparameter½)● Elementsatthesameheight formtheirownlists

H 1 3 4 5 9 … T

● Linkedlist,sortedbypriority● Eachnodehasrandom“height”(geometricallydistributedwithparameter½)● Elementsatthesameheight formtheirownlists● Averagetime Search,Insert,Deletelogarithmic, workconcurrently [Pugh98,Fraser04]

H 1 3 4 5 9 … T

head tail

Search(5)

!

ConcurrentSolution:theSkipList [Pugh90]

[H,9]

[H,9]

[1,9]

[5,9]stop

I.LotanandN.Shavit.Skiplist-BasedConcurrentPriorityQueues.2000.

● DeleteMin:simply removethesmallestelementfromthebottom list● Allprocessorscompeteforsmallestelement● Doesnotscale!

head tail

TheSkipList asaPQ

● Wewanttochooseanitematrandomwith‘good’ guarantees● Minimize lossofexactness byonlychoosing itemsnearthefrontofthelist● Minimizecontention bykeepingcollision probabilitylow

TheIdea:Relax!

Two examples for starting height 4

procedureSpray()● Ateachskiplist level,flipcointostay orjumpforward● Repeatforeachlevelfrom logndownto1 (thebottom)● Asifremovingarandompriorityelementnearthehead

jump

stayjump

jump

DeleteMin:TheSpray[Alistarh,Kopinsky,Li,Shavit,PPoPP2015]

Spray and pray?

✓ MaximumvaluereturnedbySprayhasrankO(𝑛 log3𝑛)- Spraysaren’ttoowide

✓ Forallx,p(x)=Õ(1/𝑛)- Spraysdon’tclustertoomuch

✓ Ifx>yisreturnedbysomeSpray,thenp(y)=Ω,(1/𝑛)- Elementsdonotstarveinthelist

p(x) = probability that a spray returns value at

index x

Õ(𝑛)

SprayListProbabilisticGuarantees

• DiscreteEventSimulation• Exactalgorithms havenegativescalingafter8threads• SprayList competitivewiththerandomremover

(noguarantees,incorrectexecution)

Inmanypractical settings(D.E.S.,shortestpaths),priority inversionsarenotexpensive.

OneBenchmark

TheMultiQueue [Rihani,Dementiev,Sanders,SPAA15]

• nlock-freeorlock-basedqueues• Insert:pickarandomqueue,lock,andinsertintoit• Remove:picktwoqueuesatrandom,lockandremovethebetterelement

0

20

40

60

80

0 7 14 21 28 35 42 49 56Threads

Thro

ughp

ut (M

Ops

/s)

MultiQ c=2MultiQ HT c=2MultiQ c=4SpraylistLindenLotan

Figure 3: Throughput of 50% insert 50% deleteMin operations of uniformly distributedkeys

0

20

40

60

80

0 7 14 21 28 35 42 49 56Threads

Thro

ughp

ut (M

Ops

/s)

MultiQ c=2MultiQ HT c=2MultiQ c=4Spray

Figure 4: Throughput of 50% insert 50% deleteMin operations of monotonic keys

8

Looks good, but does it actually guarantee anything?

TheRandomProcess

1

6

10

13

4

7

12

16

2

3

8

15

5

9

11

14

We are interested in the average rank removed at each step.

Q1 Q2 Q3 Q4WLOG, elements are consecutive labels.1. Insert Elements u.a.r.2. Remove using two choices• Cost = rank of element removed

among remaining elements

Cost(2) = 2

Cost(4) = 3

Cost(1) = 1

Intuitively, the distance from optimal.

TheResult

Theorem: Given n queues, foranyt>0,thecostatt isO(n)inexpectation,andO(nlogn) w.h.p.

• Strategy1:reductiontopoweroftwo-choicesanalysis?[Azaretal.,SICOMP99]• Wouldapplyifwecouldequatequeuesizewithtoplabel(round-robininsert)

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4The reduction does not hold in general, and in fact experimentally height and top priority appear to be uncorrelated.

• Strategy2:somesimple sortofinduction• Theinitialcostdistributionisnice;canweproveitalwaysstaysnice?

TheResult

Theorem: Foranyt>0,thecostatt isO(n)inexpectation,andO(nlogn) w.h.p.

1

2

…

K K+1 K + 2 K+3Hard case: over time, we’ll eventually get arbitrary distributions.We have to prove that the algorithm gets out of those reasonably fast.

• Strategy3:somesimple complicated sortofinduction/potentialargument• Idea:characterizewhat’sgoingonstep-by-step

TheResult

Theorem 1: Foranyt>0,thecostatt isO(n)inexpectation,andO(nlogn) w.h.p.

1

7

?

3

11

?

?

?

4 2 5

9

In expectation, increment is n.

Problem: the behavior at a step is highly correlated with what happened in previous steps.

ProofStrategy

Theorem 1: Foranyt>0,thecostatt isO(n)inexpectation,andO(nlogn) w.h.p.

• Step1:reducetoanuncorrelated exponentialprocess• Provethattherankdistributionispreserved

• Step2:characterizetheexponentialprocess• Characterizeaverageweightontopofqueuesviapotentialargument

• Step3:characterizerankdistributionofexponentialprocess• ProvethataveragerankisO(n)

• Insert:pickarandomqueue• Insertexponentiallydistributedincrementwithmeannintoit• Remove:picktwoqueuesatrandom,removethelowerlabel• Cost:therankoftheelementremoved(still)

Step1:Theexponentialprocess

1.8

5.9

10.2

13.2

4.7

7.3

12.5

16.8

2.2

3.2

8.3

15.2

5.1

9.5

11.7

14.2

Theorem: Thedistributionofremovedranksisthesameinthediscreteprocessandintheexponentialprocess.

Pr[ rank k is in queue j ] = 1 / n. Holds since the exponential is memoryless.

Expected value n

Uses tools from [Peres, Talwar, Wieder, R.S.A. 14]

Step2:Analyzingtheexponentialprocess• Fixaremovalstep t.Let𝒘𝒊(𝒕) bethelabel(realvalue)ontopofbini.• Let𝒙𝒊 𝒕 = 𝒘𝒊 𝒕

𝒏 (normalizedweights),and𝝁(𝒕) = ∑ 𝒙𝒊(𝒕)/𝒏𝒏𝒊6𝟏

• Let𝜱 𝒕 = ∑ 𝐞𝐱𝐩(𝒙𝒊 𝒕 −𝒏𝒊6𝟏 𝝁(𝒕)) and𝜳 𝒕 = ∑ 𝒆𝒙𝒑(−(𝒙𝒊 𝒕 −𝒏

𝒊6𝟏 𝝁(𝒕))).

• Nomorecorrelations: sinceweightincrementsareindependentofprevioussteps,wecanboundtheexpectedincreaseinpotentialateachstep.

• Badconfigurations:Φ 𝑡 andΨ 𝑡 cannotbothbelargeatthesametime.IftheirsumbreakstheO(n)barrier,thenthelargepotentialwilldecreaseveryfast.

• 𝜱 𝒕 +𝜳 𝒕 isthenasuper-martingale,whichimpliesthebound.

Theorem: For any t > 0, 𝔼[𝜱 𝒕 +𝜳 𝒕 ] = 𝑶 𝒏 .

Step3:Whatdoesallthishavetodowithranks?

• Let𝑩H𝒔(𝒕) bethenumberofbinswithweight> µ + 𝑠 attimet.• Let𝑩NO𝒔(𝒕) bethenumberofbinswithweight< µ− 𝑠 attimet.

• Butonaverage,we’llchoosesomethingclosetothemeanvalue!So,weconclude:

Theorem: For any t > 0, 𝔼 𝐵HR 𝑡 = 𝑂 TUVWX

Y

and 𝔼 𝐵NOR 𝑡 = 𝑂 TUVWX

Y

.

Weights become “rarefied” at ranks s-higher and s-lower than the mean value.

Theorem: Foranyt>0,therankcostatt isO(n)inexpectation.

Worst-case bound follows in a similar way.

Applications

What if we do two choices only 𝜷% of the time?(one choice otherwise)

Theorem: Foranyt>0,thecostatt isO(n/𝜷*)inexpectation,andO(nlogn/𝜷* ) w.h.p.

What if the input distribution is biased?Still works (within reason).

Works well in practice.

We can use this for approximate queues, stacks, counters, timestamps.

ConcurrentDataStructures

“Thedatastructuresofourchildhoodarechanging.”Nir Shavit

Arelaxationrenaissance[KarpZhang93], [DeoP92],[Sanders98],

[HenzingerKPSS13],[NguyenLP13],[WimmerCVTT14],[LenhartNP15], [RihaniSD15],[JeffreySYES16]

DatastructuressuchastheSpraylist andtheMultiQueuemergebothrelaxedsemantics and

optimisticprogresstoachievescalability.

TheLastSlide

Howcanwescalethem?

Theory ↔ Software ↔ Hardware

Howdowespecify andprove relaxeddatastructurescorrect?

Whatnewdatastructuresareoutthere?

Theorem: Strongly ordered data structures won’t scale.[Ellen, Hendler, Shavit, SICOMP 2013]

[Alistarh, Aspnes, Gilbert, Guerraoui, JACM 2014]

Howdothesedatastructuresinteractwithexistingapplications?

Canweprovestrongerlowerbounds?

WorkshopAnnouncement

• Theory&PracticeinConcurrentDataStructures• Co-locatedwithDISC2017(Vienna)

• Overall goals• Fosteringcollaborationbetweenpractically-minded (PPoPP,SOSPetc)conferences,andthePODC/DISCcommunity

• Newchallengesinconcurrentdatastructuredesign

• Precisegoals• Betterbenchmarksforconcurrentdatastructures• Realapplicationsandpracticalissues(e.g.memorymanagement)• Usefulnessofrelaxeddesigns

Date post:	11-Aug-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

[BIRS]Data Structures of the Future · Data Structures of the Future: Concurrent, Optimistic, and...

Documents