+ All Categories
Home > Documents > StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 [email protected]. $5,000 x56: Intel...

StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 [email protected]. $5,000 x56: Intel...

Date post: 15-Oct-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
35
StreamBox-HBM Stream Analytics on High Bandwidth Hybrid Memory Hongyu Miao, Purdue ECE; Myeongjae Jeon, UNIST; Gennady Pekhimenko, UToronto; Kathryn S. McKinley, Google; Felix Xiaozhu Lin, Purdue ECE http://xsel.rocks/p/streambox
Transcript
Page 1: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

StreamBox-HBMStreamAnalyticsonHighBandwidthHybridMemory

HongyuMiao, PurdueECE; MyeongjaeJeon,UNIST;GennadyPekhimenko, UToronto;KathrynS.McKinley, Google; FelixXiaozhuLin, PurdueECE

http://xsel.rocks/p/streambox

Page 2: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

Timelyprocessingofstreamingdata

2HighThroughput&LowLatency!

On100+GBmemory

Page 3: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

HybridMemory:3DMemory+DRAMDRAM• Largercapacity,but lowerbandwidth

3Cores

3D Memory DRAM

80GB/s

100+GB16GB

375GB/s

3DMemory• Higherbandwidth,but smallercapacity• NO latencybenefit(Unlikecache:SRAM+DRAM)• Same asDRAMwithouthighparallelismorsequentialaccess• AscacheofDRAM?à Poorperformance…

Page 4: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

Canhybridmemspeedupstreamanalytics?Yes!StreamBox-HBM

• Thefirst streamengineoptimizedfor3Dmemory+DRAMonrealhardware• Achievesthebest reportedthroughputonsinglenode(win-avg:110MRec/s)• Speedsupstreamanalyticsby7x

4

05

101520253035

0 10 20 30 40 50 60#cores

Throughp

utM

rec/s 3D+DRAM

in-mem-index

3Dascachefull-records

7xspeedup

TopKPerKey

Page 5: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

Challenges

1. HashGroupingperformspoorlyon3Dmemory

2. 3Dmemoryiscapacitylimited

3. Howtodynamicallymapstreamingdatatohybridmem?

5

Page 6: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

Challenge1:HashGroupingperformspoorlyon3Dmemory

• Operators:computationsconsume/producestreams• Pipeline:agraphofstreamingoperators

6

• DataGrouping• Asetofverycommonandexpensiveoperators thatreorganizerecords• Hash withrandomaccessinexistingenginesà Performspoorlyon3Dmemory…

Ingestion Groupbykey

AverageperkeyWindow TopKey

10:00-10:05

130500302100150

500302Time10:01

ID:0x1024Value:200

Grouping

Page 7: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

Challenge2:3Dmemoryiscapacitylimited

• Streamingdata

• Highdatavolume(100+GB)

• 3DMemory

• Capacitylimited(~16GB)

• 3DmemoryisNOTlargeenoughtoholdallstreamingdata….7

Cores

3D Memory

16GB

Cannotfit!

Page 8: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

Challenge3:managingtwotypesofmemory• Howtodynamicallymapdata/operatorstotwotypesofmemory?

8

Whattomap? Wheretomap?

Unboundeddata

Variousqueries

Hybridmemory:benefit&limitation

Ingestion Groupbykey

AverageperkeyWindow TopKey

10:00-10:05

130500302100150

500302

Page 9: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

StreamBox-HBMSolutions

1.Hashgroupingperformspoorlyon3Dmemory

• à Solution1:UsehighparallelSortforgrouping

2.3Dmemoryiscapacitylimited

• à Solution2:Onlyuse3Dmemorytostorein-memoryindexes

3.Howtomanagetwotypesofmemory?

• à Solution3:Balancetwolimitedresourcewithasingleknob

9

Page 10: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

Solution1:ParallelSortforGrouping

KnowndualsofGrouping:Hashvs.Sort• DRAM:Hashisthebest[VLDB’09,VLDB’13,SIGMOD’15]• Contribution:3Dmemoryreverses thedebate.Sort outperformsHash.

Sortisworse thanHashonalgorithmiccomplexity• O(NlogN)vs.O(N)

Yet,Sortoutperforms Hashafterweexploitall:• Abundantmemorybandwidth• Hightaskparallelism• WideSIMD(avx512)

10[VLDB’09]Sortvs.hashrevisited:Fastjoinimplementationonmodernmulti-corecpus.[VLDB’13]Multi-core,main-memoryjoins:Sortvs.hashrevisited[SIGMOD’15]Rethinkingsimd vectorizationforin-memorydatabases

Page 11: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

11

020406080

100120140160180

0 20 40 60

millionpairs/sec

#cores

0

50

100

150

200

250

300

0 20 40 60

GB/sec

#cores

Solution1:ParallelSortforGrouping

Throughput Membandwidth

SortoutperformsHashon3Dmemory

Page 12: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

12

020406080

100120140160180

0 20 40 60

millionpairs/sec

#cores

0

50

100

150

200

250

300

0 20 40 60

GB/sec

#cores

HashDRAM

Solution1:ParallelSortforGrouping

HashDRAM

Throughput Membandwidth

SortoutperformsHashon3Dmemory

Page 13: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

13

020406080

100120140160180

0 20 40 60

millionpairs/sec

#cores

0

50

100

150

200

250

300

0 20 40 60

GB/sec

#cores

Hash3Dmem

HashDRAM

Solution1:ParallelSortforGrouping

HashDRAM

Hash3Dmem

Throughput Membandwidth

SortoutperformsHashon3Dmemory

Page 14: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

14

020406080

100120140160180

0 20 40 60

millionpairs/sec

#cores

0

50

100

150

200

250

300

0 20 40 60

GB/sec

#cores

Hash3Dmem

HashDRAM

SortDRAMSortDRAM

HashDRAM

Hash3Dmem

Solution1:ParallelSortforGrouping

Throughput Membandwidth

SortoutperformsHashon3Dmemory

Page 15: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

15

020406080

100120140160180

0 20 40 60

millionpairs/sec

#cores

0

50

100

150

200

250

300

0 20 40 60

GB/sec

#cores

Throughput Membandwidth

Hash3Dmem

HashDRAM Hash3Dmem

HashDRAM

SortDRAM

Sort3Dmem Sort3Dmem

SortDRAM

Solution1:ParallelSortforGrouping

SortoutperformsHashon3Dmemory

Page 16: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

Solution2:Onlyuse3Dmemoryforin-memoryindex

16

Streamingdata

FullRecords<key,key1,v1,v2,v3…>Index<key,pointer>

Cores

3D Memory DRAM

80GB/s

96GB16GB

375GB/s

Minimizetheuseofprecious3Dmem’scapacitywhileexploithighbandwidth

SmallerFaster

MoreefficientKSwapping

Page 17: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

Solution3:balancetwolimitedresources

17

3D Memory

DRAMBandwidth

3Dm

emory

Capacity

DRAM

Cores 80GB/s

16GB

Page 18: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

Solution3:balancetwolimitedresources

18

Cores

Highpressureon3DMemorycapacity

DRAM

DRAMBandwidth

3Dm

emory

Capacity

3D Memory

80GB/s

16GB

Page 19: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

Solution3:balancetwolimitedresources

19

Cores

Highpressureon3DMemorycapacityà indexesonDRAM

DRAM

DRAMBandwidth

3D-stacked

Capacity

3D Memory

80GB/s

16GB

Page 20: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

Solution3:balancetwolimitedresources

20

3D Memory

DRAMBandwidth

3D-stacked

Capacity

DRAM

Cores

Pressurerebalanced

80GB/s

16GB

Page 21: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

Solution3:balancetwolimitedresources

21

3D Memory DRAM

Cores

HighpressureonDRAMbandwidth

DRAMBandwidth

3D-stacked

Capacity

80GB/s

16GB

Page 22: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

Solution3:balancetwolimitedresources

22

3D Memory DRAM

Cores

HighpressureonDRAMbandwidthàmoreindexeson3Dmemory

DRAMBandwidth

3D-stacked

Capacity

80GB/s

16GB

Page 23: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

Solution3:balancetwolimitedresources

23

3D Memory

DRAMBandwidth

3D-stacked

Capacity

DRAM

Cores

Pressurerebalanced

80GB/s

16GB

Page 24: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

Solution3:balancetwolimitedresources

24

3D Memory DRAM

Highpressureonboth…à reachhardwarelimità limitdataingestion

DRAMBandwidth

3D-stacked

Capacity

Cores

Backpressure

80GB/s

16GB

Page 25: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

Otheroptimizations

• Customizedmemoryallocator• Customizedtaskschedulerforhighpipelineanddataparallelism• Highparallelmerge-sortkernelsusingavx-512• Dynamicallyhandlekeychanges• Parallelaggregation• Co-designRDMAingestionwithmemorymanagementandtaskscheduling• TaskparallelismtoutilizeallCPUcores• …

25

Page 26: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

StreamBox-HBMImplementation• BasedonourpriorworkStreamBox [USENIXATC’17]• Implementonrealhardware(IntelKNL)withRDMAnetwork• 61KlinesofC++11,ofwhich38Klinesarenew• Opensource:http://xsel.rocks/p/streambox

26[USENIXATC’17]StreamBox:ModernStreamProcessingonaMulticoreMachine,HongyuMiao,Heejin Park,MyeongjaeJeon,GennadyPekhimenko,KathrynS.McKinley,andFelixXiaozhu Lin,inProc.USENIXAnnualTechnicalConference,2017.

NinjaDeveloperPlatform(KNL) Mellanox ConnectX-2

[email protected]

40Gb/s

Page 27: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

Evaluation

• Comparingtowidelyusedstreamanalyticsengine• Validatingourkeysystemdesigns

27

Page 28: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

StreamBox-HBMis10xfasterthanFlink

28

0

10

20

30

40

50

60

2 10 18 26 34 42 50 58

Throughp

utM

Rec/s

#Cores

Flink@x56

Flink@KNL

Ours@KNLRDMAingestionlimit

KNL:IntelXeonPhiKnightsLandingw/[email protected].$5,000x56:[email protected].$23,000

Benchmark:YahooStreamBenchmark.Outputdelay:1second

5-10x

Page 29: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

Poor performancewithout anykeydesigns

29

0

5

10

15

20

25

30

35

0 10 20 30 40 50 60

#cores

Throughp

utM

rec/s

3Dascachefull-records

TopKPerKey

Page 30: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

In-mem-index performsbetterthanfull-record

30

0

5

10

15

20

25

30

35

0 10 20 30 40 50 60

#cores

Throughp

utM

rec/s 3Dascache

in-mem-index

3Dascachefull-records

Usingin-memindex

TopKPerKey

Page 31: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

3Dmemoryboostsperformance

31

0

5

10

15

20

25

30

35

0 10 20 30 40 50 60

#cores

Throughp

utM

rec/s 3Dascache

in-mem-index

DRAMonlyin-mem-index

3Dascachefull-records

Using3Dmemory

TopKPerKey

Page 32: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

SW bettermanageshybridmemorythanHW

32

0

5

10

15

20

25

30

35

0 10 20 30 40 50 60

#cores

Throughp

utM

rec/s

3D+DRAMin-mem-index

3Dascachein-mem-index

DRAMonlyin-mem-index

3Dascachefull-records

SWmanageshybridmemory

TopKPerKey

Page 33: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

Performanceimprovewithallsystemdesigns

33

0

5

10

15

20

25

30

35

0 10 20 30 40 50 60

#cores

Throughp

utM

rec/s

3D+DRAMin-mem-index

3Dascachein-mem-index

DRAMonlyin-mem-index

3Dascachefull-records

Usingallkeysystemdesigns

TopKPerKey

Page 34: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

Thefirst streamengineoptimizedfor3DMemory+DRAM onrealhardware

34

BalancelimitedresourcesMinimizeuseofcapacity

Hashà Sort

Abundantmemory

Highparallelism

WideSIMD(avx512)

Sequentialaccess

1.GroupingwithSort 2.In-memoryindexin3DMemory 3.Mng hybridmem

DRAMBandwidth

3Dm

emory

Capacity

http://xsel.rocks/p/streambox

Exploithighbandwidth

StreamBox-HBM

Page 35: StreamBox-HBM · KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark.

35

Lessonsonexploiting3Dmemory+DRAM

CheapVM(hugepage)

Apps

OSkernelRDMAnetwork

bypasskernel,freeCPU

Hightaskparallelism

Custommemallocator

Sequentialmemaccess

Runtime Threadpool+customtaskscheduler

WideSIMD(avx512)

HybridMemory

Packeddatastructure


Recommended