Indian Institute of ScienceBangalore, India
भारतीय विज्ञान संस्थान
बंगलौर, भारत
Department of Computational and Data Sciences
©Yogesh Simmhan & Partha Talukdar, 2016This work is licensed under a Creative Commons Attribution 4.0 International LicenseCopyright for external content used with attribution is retained by their original authors
L9,10:Distributed Stream Processing
Yogesh Simmhan1 4 F e b , 2 0 1 7
DS256:Jan17 (3:1)
CDS.IISc.in | Department of Computational and Data Sciences
Stream are Commonplace (too)
Web & Social Networks‣ Twitter, Facebook, Internet packets
Cybersecurity‣ Telecom call logs, financial transactions, Malware
Internet of Things‣ Smart Transport/Power/Water networks
‣ Smart watch/phone/TV/…
14-Feb-17 2
CDS.IISc.in | Department of Computational and Data Sciences
IISc Smart Campus: Water Management
Plan pumping operations for reliability‣ Avoid underflow/overflow of water‣ 12 hrs to fill a large OHT, scarcity in summer weeks
Provide safer water‣ Leakages, contamination from decades old network
Reduce water usage for sustainability‣ IISc average: 400 Lit/day, Global standard: 135 Lit/day‣ Lack of visibility on usage footprint, sources‣ Opportunities for water harvesting, recycling
Lower the cost‣ Reduce cost for water use & electricity for pumping
14-Feb-17 3
Over Head Tank (OHT)
Ground Level Reservoir (GLR)
BWSSB Main Inlet
OHT 8
GLR 13
Inlet 4+3
IISc Campus• 440 Acres, 8 Km Perimeter• 50 buildings: Office, Hotel,
Residence, Stores• 10,000 people• Water Use: 40 Lakh Lit/Day• 10MW Power Consumed
14-Feb-17 4
CDS.IISc.in | Department of Computational and Data SciencesOver Head Tanks (OHT)
TPH (near Mechanical) JNT Auditorium Chemical Stores Opposite to CENSE
Opposite to NESARA
Behind old C-Mess
Opposite to Cense (new)
E Type Quarters 14-Feb-17 5
CDS.IISc.in | Department of Computational and Data Sciences
Custom Level + Quality Sensor
Fig. 7. System block diagram
Fig. 8. Scatter plot of measured vs actual distance
Fifty readings are taken every 60cm, and histograms are
plotted at each distance. Finally, all the histograms are merged
into a single scatter plot as shown in figure 8. The spots in this
graph indicate the concentration of readings in an area. Clearly,
accuracy is better at shorter distances. We determined that our
sensor has an accuracy within ±1.5% of full scale, with improved
accuracy at shorter distances. Further experiments at a
swimming pool indicated the sensors performed with the same
accuracy as that observed with the Aluminium board. A picture
of the installation of the sensor on a tank is shown in figure 8.
The system is installed in a Ground Level Reservoir (GLR)
in the campus, with a repeater placed on a nearby Overhead Tank
(OHT) and an access point setup a few hundred metres away.
The data collected every minute from the tank is encapsulated
into packets and transmitted to the repeater, which then forwards
it to the access point. The AP unpacks the received packet and
uploads the level data online to a Google App Engine
application. Figure 10 shows the collected data for half a day. A
few glitches in the data can be seen that can be smoothed after
acquisition. Jitter in readings is only 1-2cm.
Fig. 9. Installation on a tank
Fig. 10. Level data collected from a tank
VII. CONCLUSIONS
We have described an IoT system consisting of custom
sensors interconnected via a sub-GHz based wireless network.
The sensor sub-system uses ultrasonic ranging to determine
water levels in large tanks in the distribution system. Optimised
circuitry and algorithm allowed achieving a range of up to 10m
with an accuracy within ±1.5% of full scale, using low cost
transducers. Initial deployment results are encouraging and as a
next step, we will instrument all the tanks. A further addition is
to include control valves so that we can create a smart
distribution system which distributes just enough water to each
tank to satisfy the local demands.
ACKNOWLEDGMENT
We thank Mr. Alok Rawat for the design of the enclosure.
We thank Mr. Sheetal Kumar and Ms. Anjana of Dept. of Civil
Engineering, Indian Institute of Science, for discussions. We
thank the Robert Bosch Centre for Cyber Physical Systems at
Indian Institute of Science, Bangalore, for funding the project.
REFERENCES
[1] UNDP. Human Development Report, Beyond Scarcity: Power,
Poverty and the Global Water Crisis. United Nations
Development Programme, 2006.
[2] UNDP. Human Development Report, Sustaining Human
Progress: Reducing Vulnerability and Building Resilience.
United Nations Development Programme, 2014.
14-Feb-17 6
CDS.IISc.in | Department of Computational and Data Sciences
Aadhaar Enrolment
Input is stream of identity enrolment packets
Output is a UIDAI ID (success) or rejection
Each task tagged with Latency (ms)
Each edge tagged with Selectivity‣ Input:output rate, probability of path taken
14-Feb-17Benchmarking Fast Data Platforms for the ‘Aadhaar’ Biometric Database, Shukla, et al, Workshop on Big Data Benchmarking (WBDB), 2015 8
CDS.IISc.in | Department of Computational and Data Sciences
Aadhaar Authentication
14-Feb-17Benchmarking Fast Data Platforms for the ‘Aadhaar’ Biometric Database, Shukla, et al, Workshop on Big Data Benchmarking (WBDB), 2015 9
CDS.IISc.in | Department of Computational and Data Sciences
Size vs. Latency
14-Feb-17 11Big Data Analytics Platforms for Real-time Applications in IoT, Yogesh Simmhan & Srinath Perera, Big Data Analytics: Methods and Applications, Eds. Saumyadipta Pyne, B.L.S. Prakasa Rao, S.B. Rao, 2016
CDS.IISc.in | Department of Computational and Data Sciences
Application Model Application composed as a Directed Acyclic Graph
(DAG)
Tasks are user-defined logic, vertices of the DAG
Streams are channels between Tasks, edges of DAG
Streams carry “infinite” number of tuples‣ Also called events, messages
Tuples may be opaque, or have Name/Value/Type
14-Feb-17 12
A
B
C
D E
CDS.IISc.in | Department of Computational and Data Sciences
Application Model Tasks executed once for each input tuple
‣ Tasks may emit zero or more tuples for each input
Latency: Time taken to process a single tuple by a task.
Selectivity: Ratio of average number of output tuples
expected for each input tuple (in:out or𝒐𝒖𝒕
𝒊𝒏)
Can be used to calculate input rate at each task, given DAG input rate‣ e.g. i_d = i*s_a*s_b + i*s_a*s_c
14-Feb-17 13
A
B
C
D E
CDS.IISc.in | Department of Computational and Data Sciences
Application Model
Different degrees of parallelism helps exploit multiple resources to complete the execution faster, reduce latency
Task parallelism due to multiple tasks composed and executing in parallel (B&C || D)‣ Two tuples can be concurrently executed on two different
tasks that are independent of each other‣ Different from data parallelism (later…)
Pipelining due to streaming execution‣ Different parts of the infinite stream can be executing at the
same time on different tasks‣ All tasks can execute at the same time, once pipeline filled
Orthogonal concepts, can have one without other
14-Feb-17 14
CDS.IISc.in | Department of Computational and Data Sciences
Common Task Patterns
14-Feb-17 15Benchmarking Distributed Stream Processing Platforms for IoT Applications, Anshu Shukla and Yogesh Simmhan, TPCTC, 2016
CDS.IISc.in | Department of Computational and Data Sciences
Routing Semantics
Multiple outgoing edges‣ Duplicate, Round-robin or Hash
Multiple input edges‣ Interleave: Does a union of tuples entering an input queue.
Number of tuples is the sum of number of tuples from each input stream.
‣ Join: Merges one tuple from each input stream into a single tuple, that is given to the task. Number of tuples is the minimum of all tuples that enter on any input stream.
14-Feb-17 16
A
B
D
E F
C
CDS.IISc.in | Department of Computational and Data Sciences
Application Model Multiple Source/Sink tasks can be present
Count: Number of tasks in the DAG. Determines resource needs.
Width: Widest number of parallel tasks. Task parallelism. (e.g. 3)
Length: Longest number of tasks from a source to a sink (e.g. 4). Similar to Critical path…
Average Edge Degree: Hotspots, affects selectivity
14-Feb-17 17
B
D
E
F H
GA
C
CDS.IISc.in | Department of Computational and Data Sciences
Application Model Causally Dependent Messages: Set of output tuple
generated as a result of an input tuple.‣ What happens with aggregation? Sliding window?
Critical Path: Longest latency from the source to the sink. ‣ Determines the slowest causally dependent output, for a
given input.
‣ Includes task latency, I/O queue delay, and NW time
14-Feb-17 18
A
B
D
E F
C
CDS.IISc.in | Department of Computational and Data Sciences
Application Model Tasks may have one or more input and output
“ports”
Makes the routing semantics explicit in the composition‣ Join between tuples on different ports‣ Hash by writing to explicit output port
14-Feb-17 19
A
B
C
D
F
E
“hash”
“duplicate”
CDS.IISc.in | Department of Computational and Data Sciences
Application Model
Visual DAG composition
Programming abstractions‣ Task centric view (Storm)
‣ Data centric view (Spark)
14-Feb-17 20
val wordsDs = src.flatMap(value => value.split("\\s+"))val wordsPairDs = wordsDs.groupByKey(value => value)val wordCountDs = wordsPairDs.count()
filteredDS = wordsDs.filter(value => value =="hello")
src Split
Filter
Count
Hash
http://blog.madhukaraphatak.com/introduction-to-spark-two-part-3/
CDS.IISc.in | Department of Computational and Data Sciences
Execution Model
Input and output queue for each task‣ Buffers tuples
Multiple Threads for same Task‣ Allows and controls data parallel execution
Each thread can operate on one of the tuples in input queue‣ What happens to ordering?
14-Feb-17 22
CDS.IISc.in | Department of Computational and Data Sciences
Execution Model
Tuple Ordering
Can we guarantee that tuples are processing in specific order?‣ Difficult
‣ Needs logical timestamps
‣ Physical time-stamps vs. time skew
Guarantee at the source vs. each task in the DAG
14-Feb-17 23
CDS.IISc.in | Department of Computational and Data Sciences
Execution Model
Stateful vs. Stateless
Do tasks retain state?
Is state shared across threads?
What is the impact on aggregation operations? Hash keys
14-Feb-17 24
CDS.IISc.in | Department of Computational and Data Sciences
Execution Model
Delivery Guarantees
Best effort
At least once delivery
Exactly once delivery
Need to keep track of progress. Replay if necessary.
14-Feb-17 25
CDS.IISc.in | Department of Computational and Data Sciences
Distributed Stream Processing Systems
Aurora – Early Research System
Borealis – Early Research System
Apache Storm
Apache S4
Apache Samza
Google MillWheel
Amazon Kinesis
LinkedIn Databus
Facebook Puma/Ptail/Scribe/ODS
Azure Stream Analytics
Apache Flink
FlumeJava
NiFi
Google Dataflow
Spark Streaming
Apache Beam
© Programming Models for IoT and Streaming Data, Qiu14-Feb-17 26
CDS.IISc.in | Department of Computational and Data Sciences
Event Processing vs. Stream Processing Tuples are “transparent”‣ Columns, values
Query Based‣ Complex Event processing
‣ SQL like languages over continuous tuples
‣ Tasks are operators with have specific semantic meaning
Time operators included with ‣ window, sequence, group, merge, trigger
The Dataflow Model, Akidau, et al., VLDB 201514-Feb-17 27
CDS.IISc.in | Department of Computational and Data Sciences
Reading
Ankit Toshniwal, et al. Storm@twitter. In ACM SIGMOD, 2014
Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters, Zaharia, et al, USENIX HotCloud, 2012, https://www.usenix.org/conference/hotcloud12/workshop-program/presentation/zaharia
Leonardo Neumeyer, et al, S4: Distributed Stream Computing Platform. In ICDMW 2010
14-Feb-17 28