EnergyEfficient DataStream Processing onUltra-Low-PowerEmbedded Multicore
Devices.
IvanWalulyaChalmersUniversityofTechnology
This project is part of the portfolio of theA.3 – Advanced Computing and Complex System UnitCommunications Networks, Content and Technology DGEuropean Commission
www.excess-project.euCopyright © 2013 - 2016 The EXCESS Consortium
Contract Number: 611183Total Cost [€]: 3.31 millionStarting Date: 2013-09-01
Duration: 36 months
New World Order
I. Walulya @ CTH 2
n Traditional DBMS: data stored in finite, persistent data sets
n Data is continuously growing faster than our ability to store or index it
n Data Streams: distributed, continuous, unbounded, rapid, time varying, noisy, . . .
n Data-Stream Sources:
n Network monitoring and traffic engineering
n Sensor networks
n Telecom call-detail records
n Financial applications
n Manufacturing processes
n Web logs and clickstreams
n Others…...
Real-time Stream Processing
I. Walulya @ CTH 3
Motivation:NetworkMonitoringQueries
DBMS(Oracle, DB2)
Back-end Data Warehouse
Off-line analysis –slow, expensive
DSL/CableNetworks
EnterpriseNetworks
Peer
Network OperationsCenter (NOC)
What are the top (most frequent) 1000 (source, dest) pairs seen over the last 1 hour?
SELECT COUNT (R1.source, R2.dest)FROM R1, R2WHERE R1.dest = R2.source
SQL Join Query
How many distinct (source, dest) pairs have been seen by both R1 and R2 but not R3?
Set-Expression Query
PSTN
n Store-then-process is not feasible!!!n Extra complexity comes from limited space and time
R1
R2
R3
Network OperationsCenter (NOC)
Real-time Stream Processing
I. Walulya @ CTH 4
Motivation:NetworkMonitoringQueries
n Must process network streams in real-time and one pass - spacen Critical NM tasks: fraud, DoS attacks, SLA violations - latency
n Real-time traffic engineering to improve utilizationn Tradeoff result accuracy vs. space/time/communication
n Fast responses, small space/timen Minimize use of communication resources
IPNetwork
PSTN
DSL/CableNetworks
BGP
Real-time Stream Processing
I. Walulya @ CTH 5
Example:ICU
H. CM Andrade, B. Gedik, and D. S. Turaga. "Fundamentals of Stream Processing.“, Cambridge University Press, 2014
Real-time Stream Processing
I. Walulya @ CTH 6
Example:Cyber-PhysicalSystems(CPS)
http://www.kapsch.net/se/
Processing:• On-the-fly• distributed• alsoparallel…
Real-time Stream Processing
I. Walulya @ CTH 7
WhatisDataStreaming?
n Data Stream Processingn Alternative to the store-and-processn Data Processed in real timen Suitable for systems processing huge amounts of data
n Data Streamsn Flow of tuples, each containing application related datan distributed, continuous, unbounded, rapid, time varying, noisy, . . .
Real-time Stream Processing
I. Walulya @ CTH 8
DataStreaming:Requirements
n High throughputn Low latencyn Determinism
n Same output for same input – regardless of #cores
<2,blue>
<1,red>
<3,red>
Filterred
Counttuples
Alertif…
<2,red>
Operator Operator
Real-time Stream Processing
I. Walulya @ CTH 9
WhatisStreamAggregation?
n Data summarizationn General form:
n select G, F1 from S where P group by G having F2 n G: grouping attributes, F1,F2: aggregate expressions
n Window techniques are needed!n Aggregate expressions:
n distributive: sum, count, min, maxn algebraic: avgn holistic: count-distinct, median
LowComputationCosts
Real-time Stream Processing
I. Walulya @ CTH 10
MultiwayStreamAggregation
n Multiple streams of incoming tuples n Windows:
n Time-Based Windows: n Count-Based Windows: n Sliding windows vs Tumbling windows
n 4 Stages:n Add stage: Fetch tuples from each input stream.n Merge stage: Merge and sort fetched tuples according to
timestamps.n Update stage: Update the state of windows a tuple contributes ton Output stage: Forward output tuples to the next aggregation stage.
Real-time Stream Processing
MultiwayStreamAggregation
5 3 2
7 6 4
11 7 3
12 4 1
queuesoftuples
Real-time Stream Processing
MultiwayStreamAggregation
5 3
7 6 4
add
11 7
12 4
2 1 3
queuesoftuples
Real-time Stream Processing
MultiwayStreamAggregation
5
7 6
add
11 7
12
3 4 4 1 2 3
sort
Real-time Stream Processing
MultiwayStreamingAggregation
14
7 6 1-4 3-6 5- 8
add
11 7
12
5 6 7 3 4 41
2
3
update17 15
3
sort
queuesoftuples
Real-time Stream Processing
MultiwayStreamingAggregation
n Input: Raw data converted to tuples and stored in queues.n Output: A flow of tuples with the aggregated values.
14
7 3-6 5- 8
add
116 7 12 5 6 7
sort update output
1-4
1
2
3
val.
17 15
3
3
4
4
1316 14
3
Real-time Stream Processing
I. Walulya @ CTH 16
Whylow-powerembeddedsystems?
n Salient characteristics:n Heavy reliance on data transfersn Relatively low computations per byten Relatively small amounts of data at a time
n Modern multi/many-core embedded systems:n Low latency programmable local storage vs cachesn high-bandwidth access to main memoryn VPU and ILP enabledn Ultra-low power • Communicationvscomputationcosts,
• memoryaccesspatternsand• granularityofdataaccesspatterns.
Data Streaming on embedded systems
I. Walulya @ CTH 17
Designchallenges!
n how stream aggregation can map to the different parallel architectures is still an open problem
n Potential of such low power processors for use in high end computations.
n Can high-performance computing techniques be deployed on these processors?
n Addressing Hardware constraints n Understanding memory access patterns in their algorithms
in relation to the computation
Data Streaming on embedded systems
I. Walulya @ CTH 18
ParallelStreamAggregation
ConcurrentDataStructures:• Usedbetweendifferentstagesofaggregationprocess
forcommunicationpurposes.• Sharedataacrossdifferentthreads/processes• Allowfordata-parallelism• Loadbalancingontheworkload
n Tuples from each input stream placed in queues by multiple threads
n A consumer thread performing merge, update and output stages One final aggregator used
How?Synchronization
Data Streaming on embedded systems
I. Walulya @ CTH 19
Concurrentdatastructures:SynchronizationTechniques
n Coarse grained lockingn Easy but slow...
n Fine grained lockingn Fast/scalable but: error-
prone, not composable, deadlocks
n Non-blockingn Based on atomic
hardware primitives (e.g. TAS, CAS)
n Good progress guarantees (lock/wait-freedom)
n Scalable
Fig.Yiannis Nikolakopoulos
Data Streaming on embedded systems
I. Walulya @ CTH 20
Concurrentdatastructures:QueueBuffers
n Single Producer Single Consumer (SPSC)n Lamport 1983 : Lamport Queuen Giacomoni et al. 2008 : FastForward Queuen Lee et al. 2009 : MCRingBuffern Preud'homme et al. 2010 : BatchQueue
n Multi Producer Multi Consumer (MPMC)n Michael & Scott 1997 : MS-Queue (1-lock, 2-lock)n Mellor-Crummey 2016 : Fetch-and-Add Queuen Message-Passing based queues
Target Architecture
I. Walulya @ CTH 21
Myriad1architecturehighlights Myriad2architecturehighlights
DDR Controller
128kB 2-way L2 cache (SHAVE)
32kB LRAM
4kB 2-wayI-cache
4kB 2-wayD-cache
LEON3RISC
VRF 32x128
I RF 32x32
(12 ports)
(17 ports)
DCU
IDC
1kBD-cache PEU BRU VAUIAULSU0 LSU1 SAU CMU
SHAVE VLIW Vector Processor
x 8 SHAVEs128-bitCMX InstrPort
64-bitCMXPort
64-bitCMXPort
32-bitAPB
128-bit AXI 128-bit AHB
1MB CMX SRAM
SRF 32x32 (12 ports)
128/256MB LPDDR2/3 Stacked Die
DDR Controller
256kB 2-way L2 cache (SHAVE)
2MB CMX SRAM
256kB 4-wayL2 cache (LEON4)
32kB 2-wayI-cache (LEON4)
32kB 2-wayD-cache (LEON4)
LEON4RISC2
32kB 4-wayL2 cache (LEON4)
4kB 2-wayI-cache (LEON4)
4kB 2-wayD-cache (LEON4)
LEON4RISC1
VRF 32x128
I RF 32x32
(10 ports)
(17 ports)
DCU
IDC
1kBD-cache
1kBI-cache PEU BRU VAUIAULSU0 LSU1 SAU CMU
SHAVE VLIW Vector Processor
x 12 SHAVEs128-bitPorts
64-bitCMXPort
64-bitCMXPort32-bit
APB
128-bit AXI 128-bit AHB 128-bit AHB
Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands.
Ø Hardware support for SIMD, matrix transpose,sparse data, sqrt@fp16, predicated execution...
Ø Heterogeneous SoC: 1 Leon3@fp64 + 8 Shaves@fp32.
Ø 32KB LRAM, 1MB CMX, 16/64MB DDR, DMAs.Ø Power efficiency of 1Tops/W (max 8-bit
equivalent).Ø FIFO buffers
Ø 28nm ultra-low power (≤ 0.5W@600MHz) with 17 power islands.
Ø Extended hardware support over Myriad 1: clock-gating, hard-wired configurable accelerators for imaging and vision, etc.
Ø Heterogeneous SoC: 2 Leon4@fp64 + 12 Shaves@fp32.
Ø 256+32KB LRAM, 2MB CMX, DDR3 support, DMAs. Power efficiency of 2Tops/W(max 16-bit equivalent).
Ø FIFO buffers
Evaluation of Streaming Aggregation Operator in Low Power Embedded Systems
22
n One producer feeding tuples to ten aggregators in a roundn Three producers feeding tuples to 8 aggregators n One final aggregator used n All processes run on SHAVES n Queues placed in CMX slices of aggregators
SingleProducerVariation
Evaluation of Streaming Aggregation Operator in Low Power Embedded Systems
23
SingleProducerVariation
Evaluation of Streaming Aggregation Operator in Low Power Embedded Systems
24
n All processes run on SHAVES n Queues placed in CMX slices of aggregators n Three producers feeding tuples to 8 aggregators n One final aggregator used
Threeproducervariation
Evaluation of Streaming Aggregation Operator in Low Power Embedded Systems
25
Threeproducersvariation
Streaming Aggregation Operator Customization In Embedded Systems
I. Walulya @ CTH 26
Streamingaggregationdesignspace
n Category A: n consists of decision trees that refer to memory configuration and
allocation
n Category B:n are assigned decision trees related to data movement and means by
which accesses to shared resources are synchronized
Streaming Aggregation Operator Customization In Embedded Systems
I. Walulya @ CTH 27
Metric2..................
Application Constraints
Hardware Constraints
Remove non-applicable options from the design space
Exploration for all customized streaming aggregation implementations
STEP 1: Design space exploration
step 1output:
Throughput, latency, energy, scalabilityfor each customized implementation
STEP 2: Identification of Pareto efficient
implementations
Throughput vs. memory sizeLatency vs. energy consumption
Scalability
Customized streaming aggregation implementation
...
Methodology output
Metric17.41647.41597.35627.35677.33657.3336
Q160: A1(loc), A2(loc), …, B4(b.w.)Q160: A1(loc), A2(loc), …, B4(p.s.)Q320: A1(loc), A2(loc), …, B4(b.w.)Q320: A1(loc), A2(loc), …, B4(p.s.)Q640: A1(loc), A2(loc), …, B4(p.s.)
......
Implementations evaluated:
7.25
7.3
7.35
7.4
7.45
40 60 80 100
P1
Metric1 vs. Metric2
P2P3
P4M
etric
1Metric2
Input:
METHODOLOGY
EXAMPLE
Streaming Aggregation Operator Customization In Embedded Systems
I. Walulya @ CTH 28
Evaluationsetup
n Dataset: Soundcloud (user id, timestamp, song id, comment)n Query: user id with the highest number of comments.n Platforms: Myriad1 (8 cores), Myriad2 (12 cores). n Evaluation metrics: Throughput, Memory size, Latency, energy
consumption
Streaming Aggregation Operator Customization In Embedded Systems
I. Walulya @ CTH 29
Multiwaystreamingaggregationresults:throughput,latency,energyandmemory
Streaming Aggregation Operator Customization In Embedded Systems
I. Walulya @ CTH 30
Performanceperwatt
Latency(usec) Throughput (t/sec) (t/sec)/watt
Myriad1 140.38 123,622 379,041
Myriad2 39.8 497,154 1,004,766
Intel XeonE5 15 1,105,221 18,412
n x20 highest performance per watt in Myriad1n x54 highest performance per watt in Myriad2
Conclusions
I. Walulya @ CTH 31
n Designed efficient concurrent data structure implementations for
embedded system applications.
n Evaluation of a concurrent data structure implementation model
based on message-passing. Design space exploration of streaming
aggregation implementation on embedded architectures.
n Data Streaming: Major departure from traditional persistent
database paradigm
n Fundamental re-thinking of models, assumptions, algorithms, system
architectures, …
I. Walulya @ CTH 32
References
I. Walulya @ CTH 33
1. Lamport L.: Specifying Concurrent program modules. ACM Transactions on Programming Languages and Systems 5, (1983), 190 -222
2. Giacomoni, J., Moseley, T., Vachharajani, M.: FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, ACM (2008) 43-52
3. Preud'homme, T., Sopena, J., Thomas, G., Folliot, B.: BatchQueue: Fast and Memory-Thrifty Core to Core Communication. In: 22nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). (2010) 215-222
4. Lee, P.P.C., Bu, T., Chandranmenon, G.: A lock-free, cache-efficient shared ring Buffer for multi-core architectures. In: Proceedings of the 5th ACM/IEEE Symposium on architectures for Networking and Communications Systems. ANCS '09, New York, NY, USA, ACM (2009) 78-79
5. Michael, M., Scott, M.: Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In: Proceedings of the 15th annual ACM symposium on Principles of distributed computing, ACM (1996) 267-275
6. Tsigas, P., Zhang, Y.: A Wait-free Queue As Fast As Fetch-and-add. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP(2016) 16:1--16:13
Evaluation of Streaming Aggregation Operator in Low Power Embedded Systems
34
1:Dataarefetchedfromoff-chipDDR,inCMXusingDMA.(L2Cacheisenabledinnormalmode).2:TuplesarecopiedfromoneCMXslicetoanotherCMXsliceusingmemcpy orDMA.3:4BytepointersaretransferredconstantlyfromoneSHAVEtoanotherfordeterminingwhichwindowsshouldberemoved.
Evaluation of Streaming Aggregation Operator in Low Power Embedded Systems
35
Threeproducersvariation
Queues
36
1000
5
10
15
20
25
1 2 4 6 8 10 12Shaves
Throughput(Mops/s)
1−lock 2−lock FAAQueue
Queues
37
1000
500
600
700
800
1 2 4 6 8 10 12Shaves
Power(mW)
1−lock 2−lock FAAQueue
Queues
38
1000
50
100
150
1 2 4 6 8 10 12Shaves
Ener
gy p
er O
pera
tion
(mJ/
op)
1−lock 2−lock FAAQueue