Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB...

EnergyEfficient DataStream Processing onUltra-Low-PowerEmbedded Multicore

Devices.

IvanWalulyaChalmersUniversityofTechnology

This project is part of the portfolio of theA.3 – Advanced Computing and Complex System UnitCommunications Networks, Content and Technology DGEuropean Commission

www.excess-project.euCopyright © 2013 - 2016 The EXCESS Consortium

Contract Number: 611183Total Cost [€]: 3.31 millionStarting Date: 2013-09-01

Duration: 36 months

New World Order

I. Walulya @ CTH 2

n Traditional DBMS: data stored in finite, persistent data sets

n Data is continuously growing faster than our ability to store or index it

n Data Streams: distributed, continuous, unbounded, rapid, time varying, noisy, . . .

n Data-Stream Sources:

n Network monitoring and traffic engineering

n Sensor networks

n Telecom call-detail records

n Financial applications

n Manufacturing processes

n Web logs and clickstreams

n Others…...

Real-time Stream Processing

I. Walulya @ CTH 3

Motivation:NetworkMonitoringQueries

DBMS(Oracle, DB2)

Back-end Data Warehouse

Off-line analysis –slow, expensive

DSL/CableNetworks

EnterpriseNetworks

Peer

Network OperationsCenter (NOC)

What are the top (most frequent) 1000 (source, dest) pairs seen over the last 1 hour?

SELECT COUNT (R1.source, R2.dest)FROM R1, R2WHERE R1.dest = R2.source

SQL Join Query

How many distinct (source, dest) pairs have been seen by both R1 and R2 but not R3?

Set-Expression Query

PSTN

n Store-then-process is not feasible!!!n Extra complexity comes from limited space and time

R1

R2

R3

Network OperationsCenter (NOC)


I. Walulya @ CTH 4

Motivation:NetworkMonitoringQueries

n Must process network streams in real-time and one pass - spacen Critical NM tasks: fraud, DoS attacks, SLA violations - latency

n Real-time traffic engineering to improve utilizationn Tradeoff result accuracy vs. space/time/communication

n Fast responses, small space/timen Minimize use of communication resources

IPNetwork

PSTN

DSL/CableNetworks

BGP


I. Walulya @ CTH 5

Example:ICU

H. CM Andrade, B. Gedik, and D. S. Turaga. "Fundamentals of Stream Processing.“, Cambridge University Press, 2014


I. Walulya @ CTH 6

Example:Cyber-PhysicalSystems(CPS)

http://www.kapsch.net/se/

Processing:• On-the-fly• distributed• alsoparallel…


I. Walulya @ CTH 7

WhatisDataStreaming?

n Data Stream Processingn Alternative to the store-and-processn Data Processed in real timen Suitable for systems processing huge amounts of data

n Data Streamsn Flow of tuples, each containing application related datan distributed, continuous, unbounded, rapid, time varying, noisy, . . .


I. Walulya @ CTH 8

DataStreaming:Requirements

n High throughputn Low latencyn Determinism

n Same output for same input – regardless of #cores

<2,blue>

<1,red>

<3,red>

Filterred

Counttuples

Alertif…

<2,red>

Operator Operator


I. Walulya @ CTH 9

WhatisStreamAggregation?

n Data summarizationn General form:

n select G, F1 from S where P group by G having F2 n G: grouping attributes, F1,F2: aggregate expressions

n Window techniques are needed!n Aggregate expressions:

n distributive: sum, count, min, maxn algebraic: avgn holistic: count-distinct, median

LowComputationCosts


I. Walulya @ CTH 10

MultiwayStreamAggregation

n Multiple streams of incoming tuples n Windows:

n Time-Based Windows: n Count-Based Windows: n Sliding windows vs Tumbling windows

n 4 Stages:n Add stage: Fetch tuples from each input stream.n Merge stage: Merge and sort fetched tuples according to

timestamps.n Update stage: Update the state of windows a tuple contributes ton Output stage: Forward output tuples to the next aggregation stage.



5 3 2

7 6 4

11 7 3

12 4 1

queuesoftuples



5 3

7 6 4

add

11 7

12 4

2 1 3

queuesoftuples



5

7 6

add

11 7

12

3 4 4 1 2 3

sort


MultiwayStreamingAggregation

14

7 6 1-4 3-6 5- 8

add

11 7

12

5 6 7 3 4 41

2

3

update17 15

3

sort

queuesoftuples


MultiwayStreamingAggregation

n Input: Raw data converted to tuples and stored in queues.n Output: A flow of tuples with the aggregated values.

14

7 3-6 5- 8

add

116 7 12 5 6 7

sort update output

1-4

1

2

3

val.

17 15

3

3

4

4

1316 14

3


I. Walulya @ CTH 16

Whylow-powerembeddedsystems?

n Salient characteristics:n Heavy reliance on data transfersn Relatively low computations per byten Relatively small amounts of data at a time

n Modern multi/many-core embedded systems:n Low latency programmable local storage vs cachesn high-bandwidth access to main memoryn VPU and ILP enabledn Ultra-low power • Communicationvscomputationcosts,

• memoryaccesspatternsand• granularityofdataaccesspatterns.

Data Streaming on embedded systems

I. Walulya @ CTH 17

Designchallenges!

n how stream aggregation can map to the different parallel architectures is still an open problem

n Potential of such low power processors for use in high end computations.

n Can high-performance computing techniques be deployed on these processors?

n Addressing Hardware constraints n Understanding memory access patterns in their algorithms

in relation to the computation


I. Walulya @ CTH 18

ParallelStreamAggregation

ConcurrentDataStructures:• Usedbetweendifferentstagesofaggregationprocess

forcommunicationpurposes.• Sharedataacrossdifferentthreads/processes• Allowfordata-parallelism• Loadbalancingontheworkload

n Tuples from each input stream placed in queues by multiple threads

n A consumer thread performing merge, update and output stages One final aggregator used

How?Synchronization


I. Walulya @ CTH 19

Concurrentdatastructures:SynchronizationTechniques

n Coarse grained lockingn Easy but slow...

n Fine grained lockingn Fast/scalable but: error-

prone, not composable, deadlocks

n Non-blockingn Based on atomic

hardware primitives (e.g. TAS, CAS)

n Good progress guarantees (lock/wait-freedom)

n Scalable

Fig.Yiannis Nikolakopoulos


I. Walulya @ CTH 20

Concurrentdatastructures:QueueBuffers

n Single Producer Single Consumer (SPSC)n Lamport 1983 : Lamport Queuen Giacomoni et al. 2008 : FastForward Queuen Lee et al. 2009 : MCRingBuffern Preud'homme et al. 2010 : BatchQueue

n Multi Producer Multi Consumer (MPMC)n Michael & Scott 1997 : MS-Queue (1-lock, 2-lock)n Mellor-Crummey 2016 : Fetch-and-Add Queuen Message-Passing based queues

Target Architecture

I. Walulya @ CTH 21

Myriad1architecturehighlights Myriad2architecturehighlights

DDR Controller

128kB 2-way L2 cache (SHAVE)

32kB LRAM

4kB 2-wayI-cache

4kB 2-wayD-cache

LEON3RISC

VRF 32x128

I RF 32x32

(12 ports)

(17 ports)

DCU

IDC

1kBD-cache PEU BRU VAUIAULSU0 LSU1 SAU CMU

SHAVE VLIW Vector Processor

x 8 SHAVEs128-bitCMX InstrPort

64-bitCMXPort

64-bitCMXPort

32-bitAPB

128-bit AXI 128-bit AHB

1MB CMX SRAM

SRF 32x32 (12 ports)

128/256MB LPDDR2/3 Stacked Die

DDR Controller

256kB 2-way L2 cache (SHAVE)

2MB CMX SRAM

256kB 4-wayL2 cache (LEON4)

32kB 2-wayI-cache (LEON4)

32kB 2-wayD-cache (LEON4)

LEON4RISC2

32kB 4-wayL2 cache (LEON4)

4kB 2-wayI-cache (LEON4)

4kB 2-wayD-cache (LEON4)

LEON4RISC1

VRF 32x128

I RF 32x32

(10 ports)

(17 ports)

DCU

IDC

1kBD-cache

1kBI-cache PEU BRU VAUIAULSU0 LSU1 SAU CMU

SHAVE VLIW Vector Processor

x 12 SHAVEs128-bitPorts

64-bitCMXPort

64-bitCMXPort32-bit

APB

128-bit AXI 128-bit AHB 128-bit AHB

Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands.

Ø Hardware support for SIMD, matrix transpose,sparse data, sqrt@fp16, predicated execution...

Ø Heterogeneous SoC: 1 Leon3@fp64 + 8 Shaves@fp32.

Ø 32KB LRAM, 1MB CMX, 16/64MB DDR, DMAs.Ø Power efficiency of 1Tops/W (max 8-bit

equivalent).Ø FIFO buffers

Ø 28nm ultra-low power (≤ 0.5W@600MHz) with 17 power islands.

Ø Extended hardware support over Myriad 1: clock-gating, hard-wired configurable accelerators for imaging and vision, etc.

Ø Heterogeneous SoC: 2 Leon4@fp64 + 12 Shaves@fp32.

Ø 256+32KB LRAM, 2MB CMX, DDR3 support, DMAs. Power efficiency of 2Tops/W(max 16-bit equivalent).

Ø FIFO buffers

Evaluation of Streaming Aggregation Operator in Low Power Embedded Systems

22

n One producer feeding tuples to ten aggregators in a roundn Three producers feeding tuples to 8 aggregators n One final aggregator used n All processes run on SHAVES n Queues placed in CMX slices of aggregators

SingleProducerVariation


23

SingleProducerVariation


24

n All processes run on SHAVES n Queues placed in CMX slices of aggregators n Three producers feeding tuples to 8 aggregators n One final aggregator used

Threeproducervariation


25

Threeproducersvariation

Streaming Aggregation Operator Customization In Embedded Systems

I. Walulya @ CTH 26

Streamingaggregationdesignspace

n Category A: n consists of decision trees that refer to memory configuration and

allocation

n Category B:n are assigned decision trees related to data movement and means by

which accesses to shared resources are synchronized


I. Walulya @ CTH 27

Metric2..................

Application Constraints

Hardware Constraints

Remove non-applicable options from the design space

Exploration for all customized streaming aggregation implementations

STEP 1: Design space exploration

step 1output:

Throughput, latency, energy, scalabilityfor each customized implementation

STEP 2: Identification of Pareto efficient

implementations

Throughput vs. memory sizeLatency vs. energy consumption

Scalability

Customized streaming aggregation implementation

...

Methodology output

Metric17.41647.41597.35627.35677.33657.3336

Q160: A1(loc), A2(loc), …, B4(b.w.)Q160: A1(loc), A2(loc), …, B4(p.s.)Q320: A1(loc), A2(loc), …, B4(b.w.)Q320: A1(loc), A2(loc), …, B4(p.s.)Q640: A1(loc), A2(loc), …, B4(p.s.)

......

Implementations evaluated:

7.25

7.3

7.35

7.4

7.45

40 60 80 100

P1

Metric1 vs. Metric2

P2P3

P4M

etric

1Metric2

Input:

METHODOLOGY

EXAMPLE


I. Walulya @ CTH 28

Evaluationsetup

n Dataset: Soundcloud (user id, timestamp, song id, comment)n Query: user id with the highest number of comments.n Platforms: Myriad1 (8 cores), Myriad2 (12 cores). n Evaluation metrics: Throughput, Memory size, Latency, energy

consumption


I. Walulya @ CTH 29

Multiwaystreamingaggregationresults:throughput,latency,energyandmemory


I. Walulya @ CTH 30

Performanceperwatt

Latency(usec) Throughput (t/sec) (t/sec)/watt

Myriad1 140.38 123,622 379,041

Myriad2 39.8 497,154 1,004,766

Intel XeonE5 15 1,105,221 18,412

n x20 highest performance per watt in Myriad1n x54 highest performance per watt in Myriad2

Conclusions

I. Walulya @ CTH 31

n Designed efficient concurrent data structure implementations for

embedded system applications.

n Evaluation of a concurrent data structure implementation model

based on message-passing. Design space exploration of streaming

aggregation implementation on embedded architectures.

n Data Streaming: Major departure from traditional persistent

database paradigm

n Fundamental re-thinking of models, assumptions, algorithms, system

architectures, …

I. Walulya @ CTH 32

References

I. Walulya @ CTH 33

1. Lamport L.: Specifying Concurrent program modules. ACM Transactions on Programming Languages and Systems 5, (1983), 190 -222

2. Giacomoni, J., Moseley, T., Vachharajani, M.: FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, ACM (2008) 43-52

3. Preud'homme, T., Sopena, J., Thomas, G., Folliot, B.: BatchQueue: Fast and Memory-Thrifty Core to Core Communication. In: 22nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). (2010) 215-222

4. Lee, P.P.C., Bu, T., Chandranmenon, G.: A lock-free, cache-efficient shared ring Buffer for multi-core architectures. In: Proceedings of the 5th ACM/IEEE Symposium on architectures for Networking and Communications Systems. ANCS '09, New York, NY, USA, ACM (2009) 78-79

5. Michael, M., Scott, M.: Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In: Proceedings of the 15th annual ACM symposium on Principles of distributed computing, ACM (1996) 267-275

6. Tsigas, P., Zhang, Y.: A Wait-free Queue As Fast As Fetch-and-add. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP(2016) 16:1--16:13


34

1:Dataarefetchedfromoff-chipDDR,inCMXusingDMA.(L2Cacheisenabledinnormalmode).2:TuplesarecopiedfromoneCMXslicetoanotherCMXsliceusingmemcpy orDMA.3:4BytepointersaretransferredconstantlyfromoneSHAVEtoanotherfordeterminingwhichwindowsshouldberemoved.


35

Threeproducersvariation

Queues

36

1000

5

10

15

20

25

1 2 4 6 8 10 12Shaves

Throughput(Mops/s)

1−lock 2−lock FAAQueue

Queues

37

1000

500

600

700

800

1 2 4 6 8 10 12Shaves

Power(mW)


Queues

38

1000

50

100

150

1 2 4 6 8 10 12Shaves

Ener

gy p

er O

pera

tion

(mJ/

op)


Date post:	13-Sep-2020
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB...

Documents