Date post: | 22-Jan-2018 |
Category: |
Software |
Upload: | farley-lai |
View: | 535 times |
Download: | 1 times |
University of Iowa | Mobile Sensing Laboratory
High Performance Stream Processing and Optimizations
May 8, 2015
Farley LaiAdvisor: Octav Chipara
Department of Computer Science
University of Iowa | Mobile Sensing Laboratory
• A class of applications that process continuous input data streams and may produce continuous output streams
– High performance for real-time processing
– Long-term efficient resource management
Stream Applications
2
Speaker Models
Speech Recording
VADFeature
Extraction
HTTP Upload
Speaker Identifier
Develop compiler optimizations and efficient runtime environments for scalable streaming systems
Introduction
University of Iowa | Mobile Sensing Laboratory
• Challenges
– Multi-core architectures
– High and variable memory and I/O workloads
– Energy efficiency
• Traditional approach: programmer specified parallelization
– Imperative languages expose limited parallelism to exploit
– Error-prone concurrency primitives
– Energy efficiency with aggressive power management
Stream Processing on Modern Architectures
3
Introduction
University of Iowa | Mobile Sensing Laboratory
• Synchronous Data Flow (SDF)
– Programs are described as directed graphs
• Nodes processes with sequential code
• Edges FIFO communication channels
– Pipeline and data parallelism are explicitly exposed
– Amount of data consumed/produced by a process is fixed & known
• Limited expressivity
– Periodic static schedules
– Memory requirements are bounded
– Memory behavior of entire program may be characterized
Model of Computation
4
University of Iowa | Mobile Sensing Laboratory
• Static schedules: – Order and num. invocations of a process in one period
– Two phases: initialization phase + steady phase
– Existence of a static schedule can be determined based on the production and consumption rates of all processes
• Sequential schedules– Built by simulating the process executions iteratively and
tracking the channel buffer sizes:• A process is schedulable if there is sufficient data to consume
• The simulation continue until the initial buffer size is restored
– Memory requirements are determined during scheduling
• Parallel schedules– derived by partitioning sequential schedules
Synchronous Data Flow
5
Model of Computation
University of Iowa | Mobile Sensing Laboratory
• Potential memory inefficiency
– Per channel buffer allocations and pass-by-value semantics
– Pass-by-reference only works on read-only chunks of data
• Aggregate update problem [Haven85]
• What if a process tries to insert new samples in the middle?
• Can we still prevent copying unchanged data portions by capturing the semantics of memory operations at compile-time?
Synchronous Data Flow
6
Model of Computation
MY RESEARCHESMS: Efficient Static Memory Analysis on Stream Programs
CSense: A Stream-Processing Toolkit for Mobile Sensing Applications
7
University of Iowa | Mobile Sensing Laboratory
• StreamIt: a SDF language
• Per channel allocation and pass-by-value semantics significant amount of memory usage and copies
• One single global allocation– Reuses memory and reduce memory requirements
– Avoids unnecessary memory copies
Memory Optimizations: ESMS
8
My Research: ESMS
University of Iowa | Mobile Sensing Laboratory
• Component analysis on filter work functions in the logical space for each filter– peek(i), pop(), push(v) are supposed to access
contiguous memory allocations in FIFO channels
– Interprets push(v) in filter work functions as• PASS: moves a unchanged data sample between channels
• UPDATE: otherwise, a new value is pushed
Splitters, joiners and reordering filters are pass-only
Static Analysis
9
int->int filter appender() {
work pop 4 push 8{
for(int i=0; i<4; i++) push(pop()); // pass
for(int i=0; i<4; i++) push(compute(i)); // update
}
}
My Research: ESMS
University of Iowa | Mobile Sensing Laboratory
• Relate logical positions to physical locations in the global allocation– Remaps peek(i), pop(), push(v) to access possibly
non-contiguous memory
– Live range analysis by reference counting in a schedule period• Layout starts with size zero and expands when necessary
• Each location represents a live variable with its live range
• The live range begins when receiving the 1st time pushed value– A splitter pushes a value multiple times for sharing
• The live range ends when it value is last time popped
• A location is free if its live variable is out of range
– Complete memory behaviors and sound approximation• No pointer aliasing
• Terminates in one schedule period
Whole Program Analysis
10
My Research: ESMS
University of Iowa | Mobile Sensing Laboratory
• Simulate one period of the static schedule
– case PASS: reuse memory locations in the layout
– case UPDATE: follow one of the three strategies• Always-Append (AA) | Append-on-Conflict (AoC) | Insert-in-Place (IP)
Layout Stitching
11
MEM Layout
MEM[0, 0]: D0
MEM[1, 0]: D1
MEM Layout
MEM[0, 0]: I0
MEM[1, 0]: I1
MEM Layout
MEM[0, 0]: D0
MEM[1, 0]: D1
MEM[2, 1]: I0
MEM[3, 1]: I1
AA AoC & IPInput (2 updates)
MEM Layout
MEM[0, 0]: D0
MEM[1, 1]: D1
MEM Layout
MEM[0, 1]: I0
MEM[1, 1]: I1
MEM[2, 1]: D1
MEM Layout
MEM[0, 0]: D0
MEM[1, 1]: D1
MEM[2, 1]: I0
MEM[3, 1]: I1
AA & AoC IPInput (2 updates)
My Research: ESMS
University of Iowa | Mobile Sensing Laboratory
– ESMS reduces both channel buffer sizes and the number memory operations for reordering and duplicating data streams.(splitters, joiners, reordering filters)
Memory Usage Reductions
12
45% to 96% reductions73% reductions on average
My Research: ESMS
University of Iowa | Mobile Sensing Laboratory
– The average speedup of AA, AoC, and IP are 3, 3.1, and 3 while the average speedup of CacheOpt is merely 1.07.
– ESMS improves the performance by eliminating unnecessary memory operations and fits in the cache with a smaller working set.
Speedup
13
My Research: ESMS
University of Iowa | Mobile Sensing Laboratory
• Challenges– Mobile sensing applications are difficult to implement on Android
devices• High frame rates
• Concurrency
• Robustness
• Energy efficiency
– Resource limitations and Java VM worsen these problems• Additional cost of virtualization
• Significant overhead of garbage collection
• Integrates SDF with dynamic scheduling– Conditional dataflow paths by partitioning the SDF
– Asynchronous event processing, i.e., network access and UI
– Android-specific power management
Mobile Sensing Applications: CSense
14
My Research: CSense
University of Iowa | Mobile Sensing Laboratory
• Speaker Identifier
– Conditional dataflow paths result in SDF subgraphs
– Bounded memory requirements
Example Application
15
addComponent("audio", new AudioComponentC(rateInHz, 16));
addComponent("rmsClassifier", new RMSClassifierC(rms));
addComponent("mfcc", new MFCCFeaturesG(speechT, featureT))
...
link("audio", "rmsClassifier");
toTap("rmsClassifier::below");
link("rmsClassifier::above", "mfcc::sin");
fromMemory("mfcc::fin");
...
create
components
wire
components
My Research: CSense
University of Iowa | Mobile Sensing Laboratory
Concurrency Model
16
My Research: CSense
getComponent("audio").setThreading(Threading.NEW_DOMAIN);
getComponent("httpPost").setThreading(Threading.NEW_DOMAIN);
getComponent("mfcc").setThreading(Threading.SAME_DOMAIN);
Compiler transformation
University of Iowa | Mobile Sensing Laboratory
• Static analysis– composition errors, memory usage errors, race conditions
• Flow analysis– whole-application configuration and optimization
• Stream Flow Graph transformations– domain partitioning, type conversions, MATLAB component
coalescing
• Code generation– Android application/service, MATLAB (C code + JNI stubs)
CSense Compiler
17
My Research: CSense
University of Iowa | Mobile Sensing Laboratory
• Components exchange data using push/pull semantics
• Runtime includes a scheduler for each domain
– task queue + event queue
– wake lock – for power management
CSense Runtime
18
Scheduler1Task Queue
Event Queue
Scheduler2 Task Queue
Event Queue
Memory Pool
My Research: CSense
University of Iowa | Mobile Sensing Laboratory
• Garbage collection overhead limits scalability
• Concurrency primitives have a significant impact on performance
Producer-Consumer Throughput
19
My Research: CSense
30%
13.8x
19x
MY RESEARCH IN CONTEXTMore energy savings in the dataflow model
Dynamic optimizations in cloud stream processing
20
University of Iowa | Mobile Sensing Laboratory
• Energy consumption in mobile sensing applications
– Energy bugs introduced by power management primitives
Program analysis on code paths and potential races
– Improper usage of I/O components elongates tail power states
Defer and batch I/O operations to execute in a short interval
– Intensive computations
Code offloading to cloud, AMP cores or GPGPU
Energy Efficiency and Stream Processing
21
Research of Interest: Energy Efficiency
University of Iowa | Mobile Sensing Laboratory
• Inconsistent throughputs
– Critical path with bottleneckprocesses
• Dynamic Voltage Frequency Scaling (DVFS)
– Dispatch partitions to cores at different frequencies
Save energy while maintaining performance
DVFS: GreenStreams
22
Research of Interest: Energy Efficiency
[Bartenstein2013]
University of Iowa | Mobile Sensing Laboratory
• New challenges
– Changing input structure
• Tweets mention graph computations
• Distributed storage of the graph
• Consistent graph representation
– Communication overhead
• Conventional stream processing less concerns large scale communications but focuses on local computations
– Changing performance criteria
• Statically made decisions cannot be optimal the whole time
Better performance requires to make decisions dynamically
– Fault-tolerance
Cloud Stream Processing
23
Research of Interest: Cloud Computing
University of Iowa | Mobile Sensing Laboratory
• Goal– Compute timely properties on the changing graph input
• Challenges– High rate of graph updates– Consistent graph structure– Static graph mining algorithms
• Global progress tracking protocol– Graph updates are queued and progress is tracked in a global table– Progress snapshots are taken and distributed to perform transactions of
graph updates and associated computations
• Pros– Decouples graph updates from graph computations
• Cons– Centralized progress tracking– No analysis on updates that may cancel each other and aggregation of
potential propagation of communications
Changing Graph: Kineograph
24
Research of Interest: Cloud Computing
University of Iowa | Mobile Sensing Laboratory
• Goal
– High throughput timely and low latency processing
• Challenges
– Communication overhead dominates local computing resources
– Massive communications cause media contentions and high latency
• Efficient batching and localized communications
– User decide to process input synchronously or asynchronously
– Distributed progress tracking based on partially ordered timestamps
• Pros
– Effective aggregation of communications
• Cons
– Flow control might be a concern for asynchronous delivery
Timely Dataflow: Naiad
25
Research of Interest: Cloud Computing
University of Iowa | Mobile Sensing Laboratory
• Goal
– Efficiently switching between Sync and Async modes for better performance and early termination
• Predict the next better execution mode online
• Pros
– Frees programmers from coding the execution modes explicitly
• Cons
– Separate snapshots and checkpointing under both modes
• Additional space usages
• Unclear performance impact
Execution Modes: PowerSwitch
26
Research of Interest: Cloud Computing
University of Iowa | Mobile Sensing Laboratory
• Energy efficiency in stream processing
– Dataflow paths as code paths facilitate program analysis
– Graph manipulation for better I/O component access aggregations
– Precise workload requests for computing resources
– Adapting to runtime information e.g., user activity predictions
• Incorporate dynamic optimizations
– Static optimizations based on fixed resources
– Reevaluations once in a while, e.g., execution mode switching
– Changing input structures and distributed state sharing
• Less changing information flow paths for aggregating communications
• Localized multicast for reconfiguration and termination in partial order
Conclusions and Future Work
27
University of Iowa | Mobile Sensing Laboratory
• Potential limitation on stream processing optimizations
– Non-trivial transformation between SDF graphs at different granularities
Conclusions and Future Work
28
FFTTestSource0 split08,16,16
FFTReorderSimple0
8,8,8
FFTReorderSimple1
8,8,8
CombineDFT08,4,4
CombineDFT28,4,4
CombineDFT14,8,8
join0
8,8,8
FloatPrinter016,1,1
CombineDFT34,8,8
8,8,8
FloatSource0 split02,4,4
Butterfly0
2,4,4
Butterfly1
2,4,4
join0
4,2,2
4,2,2
split14,8,8
Butterfly2
4,4,4
Butterfly3
4,4,4
join1
4,4,4
4,4,4
BitReverse08,8,8
FloatPrinter08,2,2
Coarse-grained FFT
Fine-grained FFT