DataStructuresoftheFuture:Concurrent,Optimistic,andRelaxed
DanAlistarhETHZurich/ISTAustria
WhyConcurrent?
Simple:Togetspeeduponnewerhardware.Scaling:morethreads shouldimplymoreusefulwork.
TheProblemwithConcurrency
Concurrencycanbeverybadvalueformoney.
Isthisprobleminherent?
0.00E+00
1.00E+06
2.00E+06
3.00E+06
4.00E+06
5.00E+06
6.00E+06
0 10 20 30 40 50 60 70
Throug
hput(Events/Second
)
NumberofThreads
ThroughputofLock-FreeQueue(PacketProcessing)
<$1000/machine
>$10000/machine
InherentSequentialBottlenecks
Datastructureswithstrongorderingsemantics• Stacks,Queues,PriorityQueues,ExactCounters
Thisisbadnews becauseofAmdahl’sLaw• Programswhosecriticalpathcontainscontendeddatastructureswon’tparallelizewell
Theorem:Givennthreads,any deterministic,stronglyordereddatastructurehasanexecutioninwhich
aprocessortakeslinearinntimetoreturn.[Ellen,Hendler,Shavit,SICOMP2013]
[Alistarh,Aspnes,Gilbert,Guerraoui, JACM2014]
Togetperformance,itiscriticaltospeedupshareddatastructures.
Today’sTalk
Howcanwescalesuchdatastructures?
Theory↔Software↔Hardware
NewHardwareInstructions!
NewDataStructureDesigns!
Theorem:Givennthreads,any deterministic,stronglyordereddatastructurehasanexecutioninwhich
aprocessortakeslinearinntimetoreturn.[Ellen,Hendler,Shavit,SICOMP2013]
[Alistarh,Aspnes,Gilbert,Guerraoui, JACM2014]
Lock-FreeDataStructures101• Optimisticprogrammingpatterns
• Donotuselocks,butatomicinstructions(Compare&Swap)
• Blockingofonethreadshouldn’t stopthewholesystem• Lotsofimplementations:HashTables,Lists,Trees,Queues,Stacks,etc.
Memory location R;void fetch-and-increment ( ) {
int val;do {
val = Read( R );new_val = val + 1;
} while (! Compare&Swap ( &R, val, new_val ));return val;
}
Example:Lock-freecounter.
TheLock-FreeParadox
val 0
Thread 0 Thread 1
Intheory,threadscouldstarve inoptimisticlock-freeimplementations.
Practice:thisdoesn’talwayshappen.Threadsrarelystarve.
Why?
Usemorecomplexwait-free algorithms.
Memory location R;void fetch-and-increment ( ) {
int val;do {
val = Read( R );new_val = val + 1;
} while (! Compare&Swap ( &R, val, new_val ));return val;
}
Example:Lock-freecounter.
val 0
0
Counter Value R
1
val 1val 1
2
AnalyzingLock-FreePatterns• StochasticScheduler [STOC14,Transact15]:
• Ateachschedulingstep,thenextscheduledthreadpickedfromadistribution p=(p1,p2,…,pn)withpi >0foralli
Theorem1:Underanystochasticscheduler,anylock-freealgorithm iswait-freewithprobability1.[Alistarh,Censor-Hillel, Shavit,STOC14/JACM16].
Theorem2:Underhighcontention,roughlyoneinϴ (1/norm2(p )) opssucceeds.[Alistarh,Sauerwald,Vojnovic,PODC15]
Lock-FreeAlgorithmStochasticSchedulerStochastic
ContentionGame
The ContentionGame
READ(R)
CAS(R,old,old +1)
success
READ(R)
CAS(R,old,old +1)
success
READ(R)
CAS(R,old,old +1)
success
RegisterR
Value=0
READ(R)
CAS(R,old,old +1)
success
LocationR
Value=1
Distribution(p1,p2,…,pn)
LocationR
Value=2
Givenarbitraryp,whatisthestationarybehaviourofthissystem?
TheContentionGame,Balls&Bins view
Thread1 Thread2 Thread3 Thread4
Rules fortheCounter• Bins=threads• Balls=steps• Placementaccordingtop• ToCompletetheOperation
• 3ballsbeforeothers• Resetsallbinswith2Balls• Winnerkeepsoneball
READ(R)
CAS(R,old,old +1)
success
Howmanyballsdoesabinreceiveonaveragebetweentwowins?
Howmanytotalballsaredistributedbetweentwowins,onaverage?
StepComplexity
SystemLatency
Distribution(p1,p2,…,pn)
TheResult
Examples:1. Uniformp=(1/n,1/n,…,1/n):
• Systemlatencyisϴ (sqrt n)[ACHS,JACM16]• Individuallatencyisϴ (nsqrt n)
2. Non-uniformp=(↗1,↘0,…,↘0)• Systemlatencyis(closeto)constant• Individuallatencyiseitherconstant,or↘0
3. Giventhreadsi andj,relativethroughputsare(pi /pj)2
Theorem.Givenarbitrarydistributionp andconstant-lengthlock-freealgorithm,thefollowinghold:
• Systemlatencyisϴ (1/norm2(p ))• Individuallatencyisϴ (norm2(p )/pi2 )
Fairness-ThroughputTrade-off
Othergametypescovered,e.g.obstruction-freealgorithms.
Moral:Underhighcontention,roughlyoneinsqrt (n)opssucceeds.
Whydoesthisgraphlooksobad?
0.00E+00
1.00E+06
2.00E+06
3.00E+06
4.00E+06
5.00E+06
6.00E+06
0 10 20 30 40 50 60 70
Throug
hput(Events/Second
)
NumberofThreads
ThroughputofParallelEventProcessingQueue
WhatHappensattheHardwareLevel?
Directory-basedcachecoherence (Intel,AMD)
Resp(R)Read(R)
CAS(R,old,new)
…
FailureRead(R)
CAS(R,old,new)
Thread0
Thread1
13WewastetimebecauseownershipofRcirculateswithoutusefulwork!Example:At64threads,only onein8messageexchangesisuseful.
Fixingit:Lease/Release[Alistarh,Haider,Hasenplaugh, PPOPP2016]
Directory-basedcachecoherence(Intel,AMD)
Resp(R)Read(R)
CAS(R,old,new)
…
Read(R)
CAS(R,old,new)
Core0
Core1LeaseIntervalT
LeaseIntervalT
success
Delayed
14
Eachtransferresultsinatleastoneuseful
operation!
Doublingdownonoptimism!
Lease/Release,MorePrecisely• Programmeroptimisticallyleasesvariablesforboundedtime
• void ReqLease(void* address,int data_size,time T);• void ReqRelease(void*address,int data_size,time T);• Leasetimeintheorderof1000cycles
• Performancepenaltyifleasesexpirebeforeoperationcompletion• Usuallyoccurs<5%ofthetime
• PrototypeintheMITGraphiteProcessorSimulator• Directory-basedMESICacheCoherenceProtocol• Protocolremainsprovablycorrect• Minimalchanges tothearchitecture
15
Doesitwork?
PacketProcessingQueuewithLease-Release(SimulatedinGraphite)
16
0.00E+00
1.00E+06
2.00E+06
3.00E+06
4.00E+06
5.00E+06
6.00E+06
7.00E+06
0 10 20 30 40 50 60 70
Throughput
NumberofThreads
QueueThroughput
NO_LEASE
SINGLE_LEASE4.5X
• Dequeue Operation
1. Top_Node=Lease&Read(Head )2. Next_Node =Read(Top_Node.ptr )3. ATOMIC
{if(Read(Head ) == Top_Node )thenWrite&Release(Head, Next_Node )
elseReleaseandgoto 1
}
0.00E+00
5.00E+03
1.00E+04
1.50E+04
2.00E+04
2.50E+04
0 10 20 30 40 50 60 70
#Threads
EnergyfortheQueue(nJ /operation)
NO_LEASE
SINGLE_LEASE
WhatElse?Locks
Canweavoidthewastedcoherencemessages?
Req(R,EX)
Resp(R)
Directory-basedcache(Intel,AMD)
Core0Core1
Req(R,EX)
Resp(R)Resp(R)
Req(R,EX)
Resp(R)Req(R,EX)
Resp(R)
Acquire(L)
CAS(L)Acquire(L)
Write(L)Unlock (L)
…CAS(L)
…
Resp(R)
Resp(R)
Spinning
CAS(L)
Delayed
17
LeaseIntervalT
SimplyLeasethelockonacquire!
PageRankwithL/R• Workswithlock-basedprogramsaswell
• Leasethelockbeforeacquiringit• Releasebeforegivingitup
0
5E+09
1E+10
1.5E+10
2E+10
2.5E+10
2 4 8 16 32
CompletionTime(ns)
ParallelPageRankRunningTime(lowerisbetter)
NO_LEASE WITH_LEASE
9.5X
Lease/Release• HardwareLockQueues[iQOLB:Rajwar,Kaegi,Goodman;HPCA2000]
• LocksusingLoad-Linked/Store-Conditional• Load-Linkedtakesa“lease”onthelock,Store-Conditional“releases”• Applied automaticallybytheprocessorspeculationmechanism
• TransientBlockingSynchronization[Shalev,Shavit;SunTechReport2004]• ProposeLoad&Lease /Store&Release instructionsfornon-coherentDSMmachines• Differentsemantics,neverimplemented
• Thepaperalsocontains:• Hardwareimplementationdetails(nodirectorymodifications!)• Blueprintforimplementingmultipleconcurrentleases(transactions)• Lotsofexperiments
19
TheHigh-LevelView• TheProblemwithConcurrency
• Inherentbottlenecksleadtomeltdowns
• Why?• Contentionhurtsoptimisticpatterns,quantifiablyso
• Lease/Release:• Wecannowscalebottlenecks,withinreason• Optimismenforcedatthehardwarelevel
Canwescalebeyondbottlenecks?
Let’s Relax!
ConcurrentPriorityQueues
1 task
3 task
4 task
5 task
7 task
8 task
Methods:• GetTopTask• InsertaTask
• SearchforTask15 task
11 task
18 task
PriorityQueue<key,value>
Search(key)
Insert/Delete(k,v)
DeleteMin()
Extremelyuseful:• GraphOperations(ShortestPaths)
• OperatingSystemKernel• Time-BasedSimulations
Wearelookingforafastconcurrent PriorityQueue.
TheProblem
Target:fast, concurrent PriorityQueue.
Lots ofwork onthetopic:[Sanders97],[Lotan&Shavit00],[Sundell&Tsigas07],
[Linden&Jonsson13], [Lenhart etal.14],[Wimmer etal.14]
Currentsolutionsarehardtoscale:DeleteMin is highlycontended.
Everyonewantsthesameelement!
ConcurrentSolution
head
● Linkedlist,sortedbypriority● Eachnodehasrandom“height”(geometricallydistributedwithparameter½)● Elementsatthesameheight formtheirownlists
H 1 3 4 5 9 … T
● Linkedlist,sortedbypriority● Eachnodehasrandom“height”(geometricallydistributedwithparameter½)● Elementsatthesameheight formtheirownlists● Averagetime Search,Insert,Deletelogarithmic, workconcurrently [Pugh98,Fraser04]
H 1 3 4 5 9 … T
head tail
Search(5)
!
ConcurrentSolution:theSkipList [Pugh90]
[H,9]
[H,9]
[1,9]
[5,9]stop
I.LotanandN.Shavit.Skiplist-BasedConcurrentPriorityQueues.2000.
● DeleteMin:simply removethesmallestelementfromthebottom list● Allprocessorscompeteforsmallestelement● Doesnotscale!
head tail
TheSkipList asaPQ
● Wewanttochooseanitematrandomwith‘good’ guarantees● Minimize lossofexactness byonlychoosing itemsnearthefrontofthelist● Minimizecontention bykeepingcollision probabilitylow
TheIdea:Relax!
Two examples for starting height 4
procedureSpray()● Ateachskiplist level,flipcointostay orjumpforward● Repeatforeachlevelfrom logndownto1 (thebottom)● Asifremovingarandompriorityelementnearthehead
jump
stayjump
jump
DeleteMin:TheSpray[Alistarh,Kopinsky,Li,Shavit,PPoPP2015]
Spray and pray?
✓ MaximumvaluereturnedbySprayhasrankO(𝑛 log3𝑛)- Spraysaren’ttoowide
✓ Forallx,p(x)=Õ(1/𝑛)- Spraysdon’tclustertoomuch
✓ Ifx>yisreturnedbysomeSpray,thenp(y)=Ω,(1/𝑛)- Elementsdonotstarveinthelist
p(x) = probability that a spray returns value at
index x
Õ(𝑛)
SprayListProbabilisticGuarantees
• DiscreteEventSimulation• Exactalgorithms havenegativescalingafter8threads• SprayList competitivewiththerandomremover
(noguarantees,incorrectexecution)
Inmanypractical settings(D.E.S.,shortestpaths),priority inversionsarenotexpensive.
OneBenchmark
TheMultiQueue [Rihani,Dementiev,Sanders,SPAA15]
• nlock-freeorlock-basedqueues• Insert:pickarandomqueue,lock,andinsertintoit• Remove:picktwoqueuesatrandom,lockandremovethebetterelement
0
20
40
60
80
0 7 14 21 28 35 42 49 56Threads
Thro
ughp
ut (M
Ops
/s)
MultiQ c=2MultiQ HT c=2MultiQ c=4SpraylistLindenLotan
Figure 3: Throughput of 50% insert 50% deleteMin operations of uniformly distributedkeys
0
20
40
60
80
0 7 14 21 28 35 42 49 56Threads
Thro
ughp
ut (M
Ops
/s)
MultiQ c=2MultiQ HT c=2MultiQ c=4Spray
Figure 4: Throughput of 50% insert 50% deleteMin operations of monotonic keys
8
Looks good, but does it actually guarantee anything?
TheRandomProcess
1
6
10
13
4
7
12
16
2
3
8
15
5
9
11
14
We are interested in the average rank removed at each step.
Q1 Q2 Q3 Q4WLOG, elements are consecutive labels.1. Insert Elements u.a.r.2. Remove using two choices• Cost = rank of element removed
among remaining elements
Cost(2) = 2
Cost(4) = 3
Cost(1) = 1
Intuitively, the distance from optimal.
TheResult
Theorem: Given n queues, foranyt>0,thecostatt isO(n)inexpectation,andO(nlogn) w.h.p.
• Strategy1:reductiontopoweroftwo-choicesanalysis?[Azaretal.,SICOMP99]• Wouldapplyifwecouldequatequeuesizewithtoplabel(round-robininsert)
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4The reduction does not hold in general, and in fact experimentally height and top priority appear to be uncorrelated.
• Strategy2:somesimple sortofinduction• Theinitialcostdistributionisnice;canweproveitalwaysstaysnice?
TheResult
Theorem: Foranyt>0,thecostatt isO(n)inexpectation,andO(nlogn) w.h.p.
1
2
…
K K+1 K + 2 K+3Hard case: over time, we’ll eventually get arbitrary distributions.We have to prove that the algorithm gets out of those reasonably fast.
• Strategy3:somesimple complicated sortofinduction/potentialargument• Idea:characterizewhat’sgoingonstep-by-step
TheResult
Theorem 1: Foranyt>0,thecostatt isO(n)inexpectation,andO(nlogn) w.h.p.
1
7
?
3
11
?
?
?
4 2 5
9
In expectation, increment is n.
Problem: the behavior at a step is highly correlated with what happened in previous steps.
ProofStrategy
Theorem 1: Foranyt>0,thecostatt isO(n)inexpectation,andO(nlogn) w.h.p.
• Step1:reducetoanuncorrelated exponentialprocess• Provethattherankdistributionispreserved
• Step2:characterizetheexponentialprocess• Characterizeaverageweightontopofqueuesviapotentialargument
• Step3:characterizerankdistributionofexponentialprocess• ProvethataveragerankisO(n)
• Insert:pickarandomqueue• Insertexponentiallydistributedincrementwithmeannintoit• Remove:picktwoqueuesatrandom,removethelowerlabel• Cost:therankoftheelementremoved(still)
Step1:Theexponentialprocess
1.8
5.9
10.2
13.2
4.7
7.3
12.5
16.8
2.2
3.2
8.3
15.2
5.1
9.5
11.7
14.2
Theorem: Thedistributionofremovedranksisthesameinthediscreteprocessandintheexponentialprocess.
Pr[ rank k is in queue j ] = 1 / n. Holds since the exponential is memoryless.
Expected value n
Uses tools from [Peres, Talwar, Wieder, R.S.A. 14]
Step2:Analyzingtheexponentialprocess• Fixaremovalstep t.Let𝒘𝒊(𝒕) bethelabel(realvalue)ontopofbini.• Let𝒙𝒊 𝒕 = 𝒘𝒊 𝒕
𝒏 (normalizedweights),and𝝁(𝒕) = ∑ 𝒙𝒊(𝒕)/𝒏𝒏𝒊6𝟏
• Let𝜱 𝒕 = ∑ 𝐞𝐱𝐩(𝒙𝒊 𝒕 −𝒏𝒊6𝟏 𝝁(𝒕)) and𝜳 𝒕 = ∑ 𝒆𝒙𝒑(−(𝒙𝒊 𝒕 −𝒏
𝒊6𝟏 𝝁(𝒕))).
• Nomorecorrelations: sinceweightincrementsareindependentofprevioussteps,wecanboundtheexpectedincreaseinpotentialateachstep.
• Badconfigurations:Φ 𝑡 andΨ 𝑡 cannotbothbelargeatthesametime.IftheirsumbreakstheO(n)barrier,thenthelargepotentialwilldecreaseveryfast.
• 𝜱 𝒕 +𝜳 𝒕 isthenasuper-martingale,whichimpliesthebound.
Theorem: For any t > 0, 𝔼[𝜱 𝒕 +𝜳 𝒕 ] = 𝑶 𝒏 .
Step3:Whatdoesallthishavetodowithranks?
• Let𝑩H𝒔(𝒕) bethenumberofbinswithweight> µ + 𝑠 attimet.• Let𝑩NO𝒔(𝒕) bethenumberofbinswithweight< µ− 𝑠 attimet.
• Butonaverage,we’llchoosesomethingclosetothemeanvalue!So,weconclude:
Theorem: For any t > 0, 𝔼 𝐵HR 𝑡 = 𝑂 TUVWX
Y
and 𝔼 𝐵NOR 𝑡 = 𝑂 TUVWX
Y
.
Weights become “rarefied” at ranks s-higher and s-lower than the mean value.
Theorem: Foranyt>0,therankcostatt isO(n)inexpectation.
Worst-case bound follows in a similar way.
Applications
What if we do two choices only 𝜷% of the time?(one choice otherwise)
Theorem: Foranyt>0,thecostatt isO(n/𝜷*)inexpectation,andO(nlogn/𝜷* ) w.h.p.
What if the input distribution is biased?Still works (within reason).
Works well in practice.
We can use this for approximate queues, stacks, counters, timestamps.
ConcurrentDataStructures
“Thedatastructuresofourchildhoodarechanging.”Nir Shavit
Arelaxationrenaissance[KarpZhang93], [DeoP92],[Sanders98],
[HenzingerKPSS13],[NguyenLP13],[WimmerCVTT14],[LenhartNP15], [RihaniSD15],[JeffreySYES16]
DatastructuressuchastheSpraylist andtheMultiQueuemergebothrelaxedsemantics and
optimisticprogresstoachievescalability.
TheLastSlide
Howcanwescalethem?
Theory ↔ Software ↔ Hardware
Howdowespecify andprove relaxeddatastructurescorrect?
Whatnewdatastructuresareoutthere?
Theorem: Strongly ordered data structures won’t scale.[Ellen, Hendler, Shavit, SICOMP 2013]
[Alistarh, Aspnes, Gilbert, Guerraoui, JACM 2014]
Howdothesedatastructuresinteractwithexistingapplications?
Canweprovestrongerlowerbounds?
WorkshopAnnouncement
• Theory&PracticeinConcurrentDataStructures• Co-locatedwithDISC2017(Vienna)
• Overall goals• Fosteringcollaborationbetweenpractically-minded (PPoPP,SOSPetc)conferences,andthePODC/DISCcommunity
• Newchallengesinconcurrentdatastructuredesign
• Precisegoals• Betterbenchmarksforconcurrentdatastructures• Realapplicationsandpracticalissues(e.g.memorymanagement)• Usefulnessofrelaxeddesigns