L9,10:Distributed Stream Processingcds.iisc.ac.in/wp-content/uploads/DS256.2017.L9-10... · The...

Indian Institute of ScienceBangalore, India

भारतीय विज्ञान संस्थान

बंगलौर, भारत

Department of Computational and Data Sciences

©Yogesh Simmhan & Partha Talukdar, 2016This work is licensed under a Creative Commons Attribution 4.0 International LicenseCopyright for external content used with attribution is retained by their original authors

L9,10:Distributed Stream Processing

Yogesh Simmhan1 4 F e b , 2 0 1 7

DS256:Jan17 (3:1)

http://creativecommons.org/licenses/by/4.0/deed.en_US

CDS.IISc.in | Department of Computational and Data Sciences

Stream are Commonplace (too)

Web & Social Networks‣ Twitter, Facebook, Internet packets

Cybersecurity‣ Telecom call logs, financial transactions, Malware

Internet of Things‣ Smart Transport/Power/Water networks

‣ Smart watch/phone/TV/…

14-Feb-17 2


IISc Smart Campus: Water Management

Plan pumping operations for reliability‣ Avoid underflow/overflow of water‣ 12 hrs to fill a large OHT, scarcity in summer weeks

Provide safer water‣ Leakages, contamination from decades old network

Reduce water usage for sustainability‣ IISc average: 400 Lit/day, Global standard: 135 Lit/day‣ Lack of visibility on usage footprint, sources‣ Opportunities for water harvesting, recycling

Lower the cost‣ Reduce cost for water use & electricity for pumping

14-Feb-17 3

Over Head Tank (OHT)

Ground Level Reservoir (GLR)

BWSSB Main Inlet

OHT 8

GLR 13

Inlet 4+3

IISc Campus• 440 Acres, 8 Km Perimeter• 50 buildings: Office, Hotel,

Residence, Stores• 10,000 people• Water Use: 40 Lakh Lit/Day• 10MW Power Consumed

14-Feb-17 4

CDS.IISc.in | Department of Computational and Data SciencesOver Head Tanks (OHT)

TPH (near Mechanical) JNT Auditorium Chemical Stores Opposite to CENSE

Opposite to NESARA

Behind old C-Mess

Opposite to Cense (new)

E Type Quarters 14-Feb-17 5


Custom Level + Quality Sensor

Fig. 7. System block diagram

Fig. 8. Scatter plot of measured vs actual distance

Fifty readings are taken every 60cm, and histograms are

plotted at each distance. Finally, all the histograms are merged

into a single scatter plot as shown in figure 8. The spots in this

graph indicate the concentration of readings in an area. Clearly,

accuracy is better at shorter distances. We determined that our

sensor has an accuracy within ±1.5% of full scale, with improved

accuracy at shorter distances. Further experiments at a

swimming pool indicated the sensors performed with the same

accuracy as that observed with the Aluminium board. A picture

of the installation of the sensor on a tank is shown in figure 8.

The system is installed in a Ground Level Reservoir (GLR)

in the campus, with a repeater placed on a nearby Overhead Tank

(OHT) and an access point setup a few hundred metres away.

The data collected every minute from the tank is encapsulated

into packets and transmitted to the repeater, which then forwards

it to the access point. The AP unpacks the received packet and

uploads the level data online to a Google App Engine

application. Figure 10 shows the collected data for half a day. A

few glitches in the data can be seen that can be smoothed after

acquisition. Jitter in readings is only 1-2cm.

Fig. 9. Installation on a tank

Fig. 10. Level data collected from a tank

VII. CONCLUSIONS

We have described an IoT system consisting of custom

sensors interconnected via a sub-GHz based wireless network.

The sensor sub-system uses ultrasonic ranging to determine

water levels in large tanks in the distribution system. Optimised

circuitry and algorithm allowed achieving a range of up to 10m

with an accuracy within ±1.5% of full scale, using low cost

transducers. Initial deployment results are encouraging and as a

next step, we will instrument all the tanks. A further addition is

to include control valves so that we can create a smart

distribution system which distributes just enough water to each

tank to satisfy the local demands.

ACKNOWLEDGMENT

We thank Mr. Alok Rawat for the design of the enclosure.

We thank Mr. Sheetal Kumar and Ms. Anjana of Dept. of Civil

Engineering, Indian Institute of Science, for discussions. We

thank the Robert Bosch Centre for Cyber Physical Systems at

Indian Institute of Science, Bangalore, for funding the project.

REFERENCES

[1] UNDP. Human Development Report, Beyond Scarcity: Power,

Poverty and the Global Water Crisis. United Nations

Development Programme, 2006.

[2] UNDP. Human Development Report, Sustaining Human

Progress: Reducing Vulnerability and Building Resilience.

United Nations Development Programme, 2014.

14-Feb-17 6


Backend

14-Feb-17

ECE BuildingW

W

7


Aadhaar Enrolment

Input is stream of identity enrolment packets

Output is a UIDAI ID (success) or rejection

Each task tagged with Latency (ms)

Each edge tagged with Selectivity‣ Input:output rate, probability of path taken

14-Feb-17Benchmarking Fast Data Platforms for the ‘Aadhaar’ Biometric Database, Shukla, et al, Workshop on Big Data Benchmarking (WBDB), 2015 8


Aadhaar Authentication

14-Feb-17Benchmarking Fast Data Platforms for the ‘Aadhaar’ Biometric Database, Shukla, et al, Workshop on Big Data Benchmarking (WBDB), 2015 9


Fast Data Processing

14-Feb-17 10


Size vs. Latency

14-Feb-17 11Big Data Analytics Platforms for Real-time Applications in IoT, Yogesh Simmhan & Srinath Perera, Big Data Analytics: Methods and Applications, Eds. Saumyadipta Pyne, B.L.S. Prakasa Rao, S.B. Rao, 2016


Application Model Application composed as a Directed Acyclic Graph

(DAG)

Tasks are user-defined logic, vertices of the DAG

Streams are channels between Tasks, edges of DAG

Streams carry “infinite” number of tuples‣ Also called events, messages

Tuples may be opaque, or have Name/Value/Type

14-Feb-17 12

A

B

C

D E


Application Model Tasks executed once for each input tuple

‣ Tasks may emit zero or more tuples for each input

Latency: Time taken to process a single tuple by a task.

Selectivity: Ratio of average number of output tuples

expected for each input tuple (in:out or𝒐𝒖𝒕

𝒊𝒏)

Can be used to calculate input rate at each task, given DAG input rate‣ e.g. i_d = i*s_a*s_b + i*s_a*s_c

14-Feb-17 13

A

B

C

D E


Application Model

Different degrees of parallelism helps exploit multiple resources to complete the execution faster, reduce latency

Task parallelism due to multiple tasks composed and executing in parallel (B&C || D)‣ Two tuples can be concurrently executed on two different

tasks that are independent of each other‣ Different from data parallelism (later…)

Pipelining due to streaming execution‣ Different parts of the infinite stream can be executing at the

same time on different tasks‣ All tasks can execute at the same time, once pipeline filled

Orthogonal concepts, can have one without other

14-Feb-17 14


Common Task Patterns

14-Feb-17 15Benchmarking Distributed Stream Processing Platforms for IoT Applications, Anshu Shukla and Yogesh Simmhan, TPCTC, 2016


Routing Semantics

Multiple outgoing edges‣ Duplicate, Round-robin or Hash

Multiple input edges‣ Interleave: Does a union of tuples entering an input queue.

Number of tuples is the sum of number of tuples from each input stream.

‣ Join: Merges one tuple from each input stream into a single tuple, that is given to the task. Number of tuples is the minimum of all tuples that enter on any input stream.

14-Feb-17 16

A

B

D

E F

C


Application Model Multiple Source/Sink tasks can be present

Count: Number of tasks in the DAG. Determines resource needs.

Width: Widest number of parallel tasks. Task parallelism. (e.g. 3)

Length: Longest number of tasks from a source to a sink (e.g. 4). Similar to Critical path…

Average Edge Degree: Hotspots, affects selectivity

14-Feb-17 17

B

D

E

F H

GA

C


Application Model Causally Dependent Messages: Set of output tuple

generated as a result of an input tuple.‣ What happens with aggregation? Sliding window?

Critical Path: Longest latency from the source to the sink. ‣ Determines the slowest causally dependent output, for a

given input.

‣ Includes task latency, I/O queue delay, and NW time

14-Feb-17 18

A

B

D

E F

C


Application Model Tasks may have one or more input and output

“ports”

Makes the routing semantics explicit in the composition‣ Join between tuples on different ports‣ Hash by writing to explicit output port

14-Feb-17 19

A

B

C

D

F

E

“hash”

“duplicate”


Application Model

Visual DAG composition

Programming abstractions‣ Task centric view (Storm)

‣ Data centric view (Spark)

14-Feb-17 20

val wordsDs = src.flatMap(value => value.split("\\s+"))val wordsPairDs = wordsDs.groupByKey(value => value)val wordCountDs = wordsPairDs.count()

filteredDS = wordsDs.filter(value => value =="hello")

src Split

Filter

Count

Hash

http://blog.madhukaraphatak.com/introduction-to-spark-two-part-3/


14-Feb-17 21


Execution Model

Input and output queue for each task‣ Buffers tuples

Multiple Threads for same Task‣ Allows and controls data parallel execution

Each thread can operate on one of the tuples in input queue‣ What happens to ordering?

14-Feb-17 22


Execution Model

Tuple Ordering

Can we guarantee that tuples are processing in specific order?‣ Difficult

‣ Needs logical timestamps

‣ Physical time-stamps vs. time skew

Guarantee at the source vs. each task in the DAG

14-Feb-17 23


Execution Model

Stateful vs. Stateless

Do tasks retain state?

Is state shared across threads?

What is the impact on aggregation operations? Hash keys

14-Feb-17 24


Execution Model

Delivery Guarantees

Best effort

At least once delivery

Exactly once delivery

Need to keep track of progress. Replay if necessary.

14-Feb-17 25


Distributed Stream Processing Systems

Aurora – Early Research System

Borealis – Early Research System

Apache Storm

Apache S4

Apache Samza

Google MillWheel

Amazon Kinesis

LinkedIn Databus

Facebook Puma/Ptail/Scribe/ODS

Azure Stream Analytics

Apache Flink

FlumeJava

NiFi

Google Dataflow

Spark Streaming

Apache Beam

© Programming Models for IoT and Streaming Data, Qiu14-Feb-17 26


Event Processing vs. Stream Processing Tuples are “transparent”‣ Columns, values

Query Based‣ Complex Event processing

‣ SQL like languages over continuous tuples

‣ Tasks are operators with have specific semantic meaning

Time operators included with ‣ window, sequence, group, merge, trigger

The Dataflow Model, Akidau, et al., VLDB 201514-Feb-17 27


Reading

Ankit Toshniwal, et al. Storm@twitter. In ACM SIGMOD, 2014

Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters, Zaharia, et al, USENIX HotCloud, 2012, https://www.usenix.org/conference/hotcloud12/workshop-program/presentation/zaharia

Leonardo Neumeyer, et al, S4: Distributed Stream Computing Platform. In ICDMW 2010

14-Feb-17 28

https://www.usenix.org/conference/hotcloud12/workshop-program/presentation/zaharia

Date post:	20-May-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

L9,10:Distributed Stream Processingcds.iisc.ac.in/wp-content/uploads/DS256.2017.L9-10... · The...

Documents