StreamBox-HBMStreamAnalyticsonHighBandwidthHybridMemory
HongyuMiao, PurdueECE; MyeongjaeJeon,UNIST;GennadyPekhimenko, UToronto;KathrynS.McKinley, Google; FelixXiaozhuLin, PurdueECE
http://xsel.rocks/p/streambox
Timelyprocessingofstreamingdata
2HighThroughput&LowLatency!
On100+GBmemory
HybridMemory:3DMemory+DRAMDRAM• Largercapacity,but lowerbandwidth
3Cores
3D Memory DRAM
80GB/s
100+GB16GB
375GB/s
3DMemory• Higherbandwidth,but smallercapacity• NO latencybenefit(Unlikecache:SRAM+DRAM)• Same asDRAMwithouthighparallelismorsequentialaccess• AscacheofDRAM?à Poorperformance…
Canhybridmemspeedupstreamanalytics?Yes!StreamBox-HBM
• Thefirst streamengineoptimizedfor3Dmemory+DRAMonrealhardware• Achievesthebest reportedthroughputonsinglenode(win-avg:110MRec/s)• Speedsupstreamanalyticsby7x
4
05
101520253035
0 10 20 30 40 50 60#cores
Throughp
utM
rec/s 3D+DRAM
in-mem-index
3Dascachefull-records
7xspeedup
TopKPerKey
Challenges
1. HashGroupingperformspoorlyon3Dmemory
2. 3Dmemoryiscapacitylimited
3. Howtodynamicallymapstreamingdatatohybridmem?
5
Challenge1:HashGroupingperformspoorlyon3Dmemory
• Operators:computationsconsume/producestreams• Pipeline:agraphofstreamingoperators
6
• DataGrouping• Asetofverycommonandexpensiveoperators thatreorganizerecords• Hash withrandomaccessinexistingenginesà Performspoorlyon3Dmemory…
Ingestion Groupbykey
AverageperkeyWindow TopKey
10:00-10:05
130500302100150
500302Time10:01
ID:0x1024Value:200
Grouping
Challenge2:3Dmemoryiscapacitylimited
• Streamingdata
• Highdatavolume(100+GB)
• 3DMemory
• Capacitylimited(~16GB)
• 3DmemoryisNOTlargeenoughtoholdallstreamingdata….7
Cores
3D Memory
16GB
Cannotfit!
Challenge3:managingtwotypesofmemory• Howtodynamicallymapdata/operatorstotwotypesofmemory?
8
Whattomap? Wheretomap?
Unboundeddata
Variousqueries
Hybridmemory:benefit&limitation
Ingestion Groupbykey
AverageperkeyWindow TopKey
10:00-10:05
130500302100150
500302
StreamBox-HBMSolutions
1.Hashgroupingperformspoorlyon3Dmemory
• à Solution1:UsehighparallelSortforgrouping
2.3Dmemoryiscapacitylimited
• à Solution2:Onlyuse3Dmemorytostorein-memoryindexes
3.Howtomanagetwotypesofmemory?
• à Solution3:Balancetwolimitedresourcewithasingleknob
9
Solution1:ParallelSortforGrouping
KnowndualsofGrouping:Hashvs.Sort• DRAM:Hashisthebest[VLDB’09,VLDB’13,SIGMOD’15]• Contribution:3Dmemoryreverses thedebate.Sort outperformsHash.
Sortisworse thanHashonalgorithmiccomplexity• O(NlogN)vs.O(N)
Yet,Sortoutperforms Hashafterweexploitall:• Abundantmemorybandwidth• Hightaskparallelism• WideSIMD(avx512)
10[VLDB’09]Sortvs.hashrevisited:Fastjoinimplementationonmodernmulti-corecpus.[VLDB’13]Multi-core,main-memoryjoins:Sortvs.hashrevisited[SIGMOD’15]Rethinkingsimd vectorizationforin-memorydatabases
11
020406080
100120140160180
0 20 40 60
millionpairs/sec
#cores
0
50
100
150
200
250
300
0 20 40 60
GB/sec
#cores
Solution1:ParallelSortforGrouping
Throughput Membandwidth
SortoutperformsHashon3Dmemory
12
020406080
100120140160180
0 20 40 60
millionpairs/sec
#cores
0
50
100
150
200
250
300
0 20 40 60
GB/sec
#cores
HashDRAM
Solution1:ParallelSortforGrouping
HashDRAM
Throughput Membandwidth
SortoutperformsHashon3Dmemory
13
020406080
100120140160180
0 20 40 60
millionpairs/sec
#cores
0
50
100
150
200
250
300
0 20 40 60
GB/sec
#cores
Hash3Dmem
HashDRAM
Solution1:ParallelSortforGrouping
HashDRAM
Hash3Dmem
Throughput Membandwidth
SortoutperformsHashon3Dmemory
14
020406080
100120140160180
0 20 40 60
millionpairs/sec
#cores
0
50
100
150
200
250
300
0 20 40 60
GB/sec
#cores
Hash3Dmem
HashDRAM
SortDRAMSortDRAM
HashDRAM
Hash3Dmem
Solution1:ParallelSortforGrouping
Throughput Membandwidth
SortoutperformsHashon3Dmemory
15
020406080
100120140160180
0 20 40 60
millionpairs/sec
#cores
0
50
100
150
200
250
300
0 20 40 60
GB/sec
#cores
Throughput Membandwidth
Hash3Dmem
HashDRAM Hash3Dmem
HashDRAM
SortDRAM
Sort3Dmem Sort3Dmem
SortDRAM
Solution1:ParallelSortforGrouping
SortoutperformsHashon3Dmemory
Solution2:Onlyuse3Dmemoryforin-memoryindex
16
Streamingdata
FullRecords<key,key1,v1,v2,v3…>Index<key,pointer>
Cores
3D Memory DRAM
80GB/s
96GB16GB
375GB/s
Minimizetheuseofprecious3Dmem’scapacitywhileexploithighbandwidth
SmallerFaster
MoreefficientKSwapping
Solution3:balancetwolimitedresources
17
3D Memory
DRAMBandwidth
3Dm
emory
Capacity
DRAM
Cores 80GB/s
16GB
Solution3:balancetwolimitedresources
18
Cores
Highpressureon3DMemorycapacity
DRAM
DRAMBandwidth
3Dm
emory
Capacity
3D Memory
80GB/s
16GB
Solution3:balancetwolimitedresources
19
Cores
Highpressureon3DMemorycapacityà indexesonDRAM
DRAM
DRAMBandwidth
3D-stacked
Capacity
3D Memory
80GB/s
16GB
Solution3:balancetwolimitedresources
20
3D Memory
DRAMBandwidth
3D-stacked
Capacity
DRAM
Cores
Pressurerebalanced
80GB/s
16GB
Solution3:balancetwolimitedresources
21
3D Memory DRAM
Cores
HighpressureonDRAMbandwidth
DRAMBandwidth
3D-stacked
Capacity
80GB/s
16GB
Solution3:balancetwolimitedresources
22
3D Memory DRAM
Cores
HighpressureonDRAMbandwidthàmoreindexeson3Dmemory
DRAMBandwidth
3D-stacked
Capacity
80GB/s
16GB
Solution3:balancetwolimitedresources
23
3D Memory
DRAMBandwidth
3D-stacked
Capacity
DRAM
Cores
Pressurerebalanced
80GB/s
16GB
Solution3:balancetwolimitedresources
24
3D Memory DRAM
Highpressureonboth…à reachhardwarelimità limitdataingestion
DRAMBandwidth
3D-stacked
Capacity
Cores
Backpressure
80GB/s
16GB
Otheroptimizations
• Customizedmemoryallocator• Customizedtaskschedulerforhighpipelineanddataparallelism• Highparallelmerge-sortkernelsusingavx-512• Dynamicallyhandlekeychanges• Parallelaggregation• Co-designRDMAingestionwithmemorymanagementandtaskscheduling• TaskparallelismtoutilizeallCPUcores• …
25
StreamBox-HBMImplementation• BasedonourpriorworkStreamBox [USENIXATC’17]• Implementonrealhardware(IntelKNL)withRDMAnetwork• 61KlinesofC++11,ofwhich38Klinesarenew• Opensource:http://xsel.rocks/p/streambox
26[USENIXATC’17]StreamBox:ModernStreamProcessingonaMulticoreMachine,HongyuMiao,Heejin Park,MyeongjaeJeon,GennadyPekhimenko,KathrynS.McKinley,andFelixXiaozhu Lin,inProc.USENIXAnnualTechnicalConference,2017.
NinjaDeveloperPlatform(KNL) Mellanox ConnectX-2
40Gb/s
Evaluation
• Comparingtowidelyusedstreamanalyticsengine• Validatingourkeysystemdesigns
27
StreamBox-HBMis10xfasterthanFlink
28
0
10
20
30
40
50
60
2 10 18 26 34 42 50 58
Throughp
utM
Rec/s
#Cores
Flink@x56
Flink@KNL
Ours@KNLRDMAingestionlimit
KNL:IntelXeonPhiKnightsLandingw/[email protected].$5,000x56:[email protected].$23,000
Benchmark:YahooStreamBenchmark.Outputdelay:1second
5-10x
Poor performancewithout anykeydesigns
29
0
5
10
15
20
25
30
35
0 10 20 30 40 50 60
#cores
Throughp
utM
rec/s
3Dascachefull-records
TopKPerKey
In-mem-index performsbetterthanfull-record
30
0
5
10
15
20
25
30
35
0 10 20 30 40 50 60
#cores
Throughp
utM
rec/s 3Dascache
in-mem-index
3Dascachefull-records
Usingin-memindex
TopKPerKey
3Dmemoryboostsperformance
31
0
5
10
15
20
25
30
35
0 10 20 30 40 50 60
#cores
Throughp
utM
rec/s 3Dascache
in-mem-index
DRAMonlyin-mem-index
3Dascachefull-records
Using3Dmemory
TopKPerKey
SW bettermanageshybridmemorythanHW
32
0
5
10
15
20
25
30
35
0 10 20 30 40 50 60
#cores
Throughp
utM
rec/s
3D+DRAMin-mem-index
3Dascachein-mem-index
DRAMonlyin-mem-index
3Dascachefull-records
SWmanageshybridmemory
TopKPerKey
Performanceimprovewithallsystemdesigns
33
0
5
10
15
20
25
30
35
0 10 20 30 40 50 60
#cores
Throughp
utM
rec/s
3D+DRAMin-mem-index
3Dascachein-mem-index
DRAMonlyin-mem-index
3Dascachefull-records
Usingallkeysystemdesigns
TopKPerKey
Thefirst streamengineoptimizedfor3DMemory+DRAM onrealhardware
34
BalancelimitedresourcesMinimizeuseofcapacity
Hashà Sort
Abundantmemory
Highparallelism
WideSIMD(avx512)
Sequentialaccess
1.GroupingwithSort 2.In-memoryindexin3DMemory 3.Mng hybridmem
DRAMBandwidth
3Dm
emory
Capacity
http://xsel.rocks/p/streambox
Exploithighbandwidth
StreamBox-HBM
35
Lessonsonexploiting3Dmemory+DRAM
CheapVM(hugepage)
Apps
OSkernelRDMAnetwork
bypasskernel,freeCPU
Hightaskparallelism
Custommemallocator
Sequentialmemaccess
Runtime Threadpool+customtaskscheduler
WideSIMD(avx512)
HybridMemory
Packeddatastructure