©201560EastTechnologies,Inc
HighPerformanceonWallStreet2016ChallengesofHPCCodeWri@ng–Capacity,
Performance,SpeedandCost
www.crankuptheamps.com
JeffreyM.BirnbaumCEOandCo-Founderof60EastTechnologies,Inc.
MakersofAMPS:TheAdvancedMessagingProcessingSystem
Premise
• NoSQL:30XPerformanceoverMongoDBonIngesAonandQueries
• Queue:Over25XtheThroughputofRabbitMQatupto60Xlowerlatency• ReducingFootprintfrom11machinesto1
• 4XMoreThroughputthanaPubSubSystemwith0messagelosstolerance
• 2.5XMoreDurableThroughputthanHardwareMessagingAppliance
hOp://www.crankuptheamps.com//blog/posts/2016/02/26/rabbitmq-comparison-to-amps/hOp://www.crankuptheamps.com//blog/posts/2015/07/22/reality-check-pure-soUware-beats-hardware/hOp://www.crankuptheamps.com//blog/posts/2014/09/24/ul@mate-shock-absorber/
Formorecontext,pleaseseethefollowingblogar@cles:
ByconsideringhowourAMPSTechnologyoutperformspopularsystems,weseehowmuchsystemsleaveonthetableintermsofH/Wresources?
PerformanceDrivers
• TradiAonalDriversforHPC?– Cri@caltoMarketMakingandotherlowlatencymarkets
– HandleLargerandLargerDataVolumesandLoad
– HandlePeakLoadsbyleveragingfullextentofMachineresources
• WhyHPCisnowneededeverywhere?
– SignificantlyReduceH/WFootprintCostsbyscalingoutless
– LoweringLatencyenablesonetoaddvalue-addedcapabili@esthatwerepreviouslycostprohibi@ve(i.e.streaminganaly@csforHPCorcontentfiltering)
TruthorMyth:“NoMoreFreeLunch”?
SeeScaleUpandOut:hOp://www.crankuptheamps.com/downloads/documenta@on/CXO-Insight-Mar-2015.pdf
But,WeareLeavingTooMuchontheTable…..
TwentyYearsago,appsweredesignedforslowdisksandnetworksandsinglethreaded.Itseemsliketheys@llare.
ArchitecturalPaOernss@llassumemul@-threadedistoohard,disksandnetworksaretooslowandscalingoutandoutisthenorm.Wearelearningthatbigdataandreal@meproblemsrequirebeOerthinking:hOp://www.crankuptheamps.com//blog/posts/2015/10/01/nba-of-data-science/
Let’sReviewSomeImportant#s
hOp://[email protected]/media/research.google.com/en//people/jeff/stanford-295-talk.pdf
EvolvingElementsofHPC
• CPUs–Coresfrom8to24• Storage–marchtowardsSSD,NVME,XPoint
• Network–around$3000fora100Gbperport (i.e.Arista100Tbswitch)
• Memory–512GBiscommon;Largercache sizes
MEMORY STORAGECPUNETWORK
CPUs,CoresandConcurrency• EmbracingMulA-core
• Mul@-Threaded• Beyondthe80-20Rule
• TuneEverything• LockfreeDataStructures
• Genera@onCount
• NUMAtuning• MostDBproductssayturnNUMAoff.Wekeepitonbecausewedidthehardwork.
• Throughput:• MessagePipelines• DivideandConquer
hOp://www.extremetech.com/compu@ng/188911-intel-haswell-e-review-the-best-consumer-performance-chip-you-can-buy-with-some-caveats/2
LeavingThingsontheTable
hOp://www.crankuptheamps.com//blog/posts/2016/02/26/rabbitmq-comparison-to-amps/
Oneisnottakingadvantageoftheresources.AMPSQueuestoreandforwardmodelisimpactedbylargermessagesizeswriOentopersistentdisk.TheboOleneckshouldbeduetotheh/wlimits.
LeavingThingsontheTable
hOp://www.crankuptheamps.com//blog/posts/2016/02/26/rabbitmq-comparison-to-amps/
LeavingThingsontheTable
hOp://www.crankuptheamps.com/downloads/documenta@on/hpcws-apr2013-v4.pdf
BestPrac@ces1
FocusonForwardScalability:Whenthenextchipcomeswithajumpfrom14to28coresonasingledie,themoreconcurrencythatcanberealized.
FlipSide:A[enAontoDetailSinglethreadedcanbelessworsethanpoorly/excessivelyimplementedlocking.Theyareonlyscalingviapar11oningthedatai.e.REDIS,VoltDB
BestPrac@ces2• ManyApproaches:
– �ScyllaDB/Cassandra-aresendingworkacrossnodeconstantlywithoutNUMA-ness.(ADispatchModelthatpar@@onspercore)
– Justgottodoittherightway(everycoreisassignedaprocessingengine-withawidescaledispatch(i.e.48core=48workers(+dispatchworker)).
• Obviousthingisn’talwaysthebestthinginahighlyconcurrentcontext.A“B-Tree”iscommonfordbindexschemesbutitsnotthebestdatastructureforhighlyconcurrentac@vi@es.(30XperformancegainoverMongoDBinInges@on+Querying).
• i.e.AMPSdoesn’thaveasinglemodel..modelforIOisnotthesameasquerying(i.e.paralleldivideandconquer)��Wepar@@onwhatisimportantacrossthemachine.��
Storage• Howwouldyouwriteyoursystemdifferentlyifyouknewtherewouldbea20X
to40XimprovementinstorageI/O?
• Disk– FineforLogAppending(lowdiskheadmovement/seek)
• SSD– MemoryMapped
Files,Key-ValueStores
• PromiseofXPoint80msthroughputlimited,Onceyouhitmemorylimit,iton
isstuckat80msduetobackpressure
*Whatifthatgoesdownto8ms
hOp://www.crankuptheamps.com//blog/posts/2014/12/08/extreme_storage_performance/hOp://www.crankuptheamps.com//blog/posts/2014/05/01/amps-faster-than-ever-with-memory-channel-storage/
NotLeavingThingsontheTable
InvestinabeOerstoragedevice,andthesoUwareshouldrewardyou…..
NetworkingAdvancementsHowwouldyouwriteyoursystemdifferentlyifyouknewtherewouldbea10Ximprovement
innetworkI/O?
548K 550K
927K
761K
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
1P1S 2P4S
msg
s/se
c
AMPS vs AMPS with Mellanox VMA Subscriber Rate in msgs/sec E5-2690 v3 @ 2.60GHz over 40Gb
network and 10M 512 byte msgs
AMPS AMPS_VMA
Thesearepreliminary#s–wehaven’top@mizedit,thiswasasimpleLD_Preload
Memory
• Knowyourcachelines–andkeepthingsincachewheneverpossible
• KeepyourstructuresandaccesspaOernsaligned(par@@onread/writesec@ons)
• MinimizeHeapMemoryAlloca@on
• LeverageModernAllocators
• AvoidThreadBleeding
Wash,Rinse,Repeat
• ScaleUpandthenOut• ScaleForward(“enjoythefreelunch”)andplanforadvancements(i.e.40XXpoint)
• Forgetabout80-20,Op@mizeeverylastpartofyourcodebase
• LeverageManyModels/Approaches;Con@nuallyImprove