© 2016 IBM Corporation
Hybrid solution analysis of streaming
sensor data with Spark Streaming & Kafka
Virender Thakur, IBM Big Data Specialist
Big Data Developers, NY City
April 27, 2016
© 2015 IBM Corporation2
Agenda
Sensor data, the Internet of Things, and hybrid clouds
Overview of technology components used in the solution
Spark Streaming
Apache Kafka
IBM Bluemix
Node-RED
Secure Gateway
Demo Scenario Overview
Demo
Questions
© 2015 IBM Corporation3
Agenda
Sensor data, the Internet of Things, and hybrid clouds
Overview of technology components used in the solution
Spark Streaming
Apache Kafka
IBM Bluemix
Node-RED
Secure Gateway
Demo Scenario Overview
Demo
Questions
© 2015 IBM Corporation4
Internet of Things (IoT)
• Connecting devices through the Internet
• computing devices, appliances, humans and other living beings
• Insight gained by analyzing the data produced by IoT devices can be used to
improve a vast array of items and experiences throughout the world
• The following graphic is from a 2015 report published by IBM
© 2015 IBM Corporation5
Hybrid Cloud
The Internet of Things is helping to drive an explosion in the use of hybrid clouds
© 2015 IBM Corporation7
Agenda
Sensor data, the Internet of Things, and hybrid clouds
Overview of technology components used in the solution
Spark Streaming
Apache Kafka
IBM Bluemix
Node-RED
Secure Gateway
Demo Scenario Overview
Demo
Questions
© 2015 IBM Corporation9
Spark Streaming
• Scalable, high-throughput, fault-tolerant stream processing
• Data can be ingested from many sources
• Data can be processed using complex algorithms,
including Spark machine learning and graph processing
• Processed data can be pushed out to filesystems, databases and dashboards.
© 2015 IBM Corporation10
Spark Streaming
• Receives live input data streams
• Divides the data into batches
• Batches are processed by the Spark engine
• Stream of results in batches are generated
© 2015 IBM Corporation11
Discretized Streams (Dstreams)
• Basic abstraction provided by Spark Streaming
• Represents a continuous stream of data
• Input data stream received from source
• Processed data stream generated by transforming the input stream
• Represented by a continuous series of RDDs
• RDD is Spark’s abstraction of an immutable, distributed dataset
• Each RDD in a Dstream contains data from a certain interval
© 2015 IBM Corporation12
Window Operations
• Allow transformations to be applied over a sliding window of data
• Window slides over a source Dstream
• Source RDDs that fall in the window are combined and operated upon to
produce RDDs of the windowed Dstream
• Window operation requires two parameters
• Window length = the duration of the window
• sliding interval = the interval at which the window operation is performed
© 2015 IBM Corporation13
Apache Kafka
• Distributed, partitioned, replicated commit log service
• Maintains feeds of messages in categories called topics
• Producers are processes that publish messages to a Kafka topic
• Consumers subscribe to topics and process the feed of published messages
• Runs as a cluster comprised of one or more broker servers
• TCP protocol communication between clients and servers
© 2015 IBM Corporation14
Anatomy of a Topic
• Kafka maintains a partitioned log for each topic
• Partition
• an ordered, immutable sequence of messages that is continually appended to
• distributed over the servers in the Kafka cluster and replicated for fault
tolerance
• Messages in the partitions are each assigned a unique sequential id number
(offset)
• Kafka retains all published messages for a configurable period of time
– whether or not they have been consumed
© 2015 IBM Corporation1616
Open-standards, cloud-based platform for building,
running, and managing applications
Build your apps, your way
Use the most prominent
compute technologies to
power your app: Cloud
Foundry, Docker,
OpenStack.
Extend apps with services
A catalog of IBM, third party,
and open source services
allow the developer to stitch
an application together
quickly.
Scale more than just
instances
Development, monitoring,
deployment, and logging
tools allow the developer to
run and manage the entire
application.
Layered Security
IBM secures the platform and
infrastructure and provides
you with the tools to secure
your apps.
Deploy and manage hybrid
apps seamlessly
Get a seamless dev and
management experience
across a number of hybrid
implementations options.
Flexible Pricing
Try compute options and
services for free and, when
you’re ready, pay only for what
you use. Pay as you go and
subscription models offer
choice and flexibility.Coming Summer 2015
IBM Bluemix
© 2015 IBM Corporation17
Underlined by three key open compute technologies: Cloud Foundry, Docker, and OpenStack.
It extends each of these with a growing number of services, robust DevOps tooling, integration
capabilities, and a seamless developer experience.
Flexible Compute Options to Run Apps / Services
Instant Runtimes Containers Virtual Machines
Platform Deployment Options that Meet Your Workload Requirements
Bluemix
Public
Bluemix
Dedicated
Bluemix
Local*
DevOps
Tooling Your Own Hosted Apps / Services
Integration and
API Mgmt
Powered by IBM SoftLayer In Your Data Center
+ + +
+ +
Catalog of Services that Extend Apps’ Functionality
Web Data Mobile AnalyticsCognitive IoT Security Yours
+
IBM Bluemix
© 2015 IBM Corporation18
Node-RED
• Browser-based flow editor
• Visually wire together hardware devices, APIs and online services
• Built on Node.js and its event-driven, non-blocking model
• Includes a comprehensive built-in library of nodes
• Flows are deployed to the runtime in a single-click
• JavaScript functions can be created within the editor using a rich text editor
© 2015 IBM Corporation19
Bluemix Secure Gateway
• Provides secure connectivity and establishes a tunnel between Bluemix and a
remote location (on-premises or cloud)
• The Secure Gateway UI or REST API is used to connect to your client and
create a destination point
• To increase security, you can add TLS to encrypt the data
• The behavior of your gateways and destinations can be monitored in the Secure
Gateway Dashboard
© 2015 IBM Corporation20
Agenda
Sensor data, the Internet of Things, and hybrid clouds
Overview of technology components used in the solution
Spark Streaming
Apache Kafka
IBM Bluemix
Node-RED
Secure Gateway
Demo Scenario Overview
Demo
Questions
© 2015 IBM Corporation24
Kafka Setup
# /usr/iop/current/kafka-broker/bin/kafka-topics.sh --create --zookeeper
rvm.svl.ibm.com:2181 --replication-factor 1 --partitions 1 --topic measures
# /usr/iop/current/kafka-broker/bin/kafka-console-producer.sh --broker-list
rvm.svl.ibm.com:6667 --topic measures > /dev/null 2>&1
# /usr/iop/current/kafka-broker/bin/kafka-console-consumer.sh --zookeeper
rvm.svl.ibm.com:2181 --from-beginning --topic measures
© 2015 IBM Corporation25
Spark Streaming
time = 10 sec
reading from sensors
every 2 sec
Window Length = 30 sec (average over 30 second rolling window)
Slide Interval = 20 sec (report average every 20 seconds)
Report if object temperature exceeds 25 C
© 2015 IBM Corporation26
Spark Stream Application – Scala Main Class
…
object StreamingSensor {
def main(args: Array[String]): Unit = {
val streamingRateSeconds = 10
val conf = new SparkConf().setAppName("SteamingSensor")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(streamingRateSeconds))
}
…
© 2015 IBM Corporation27
Spark Stream Application – Kafka Integration
…
val zkQuorum = "rvm.svl.ibm.com:2181“
val inputGroup = “MyGroup"
val topic = "measures"
val topicMap = Map( topic -> 1 )
val kafkaStream = KafkaUtils.createStream(ssc, zkQuorum, inputGroup,
topicMap).map(_._2)
…
© 2015 IBM Corporation28
Spark Stream Application – Ingest CSV Data
…
//define the schema using a case class
case class Sensor(sensorid: String, temp: Integer, humidity: Integer, objectTemp:
Integer)
object SensorObject {
// function to parse line of csv data into Sensor class
def parseSensor(str: String): Sensor = {
val p = str.split(",")
Sensor(p(0), p(1).toInt, p(2).toInt, p(3).toInt)
}
}
val SensorReadings = kafkaStream.map(SensorObject.parseSensor)
…
© 2015 IBM Corporation29
Spark Stream Application – Calculate Average Sensor Data
…
val windowLength = 30
val slideInterval = 20
val SensorWindow =
SensorReadings.window(Seconds(windowLength),Seconds(slideInterval))
val SensorCountByKey = SensorWindow.map(sensor => (sensor.sensorid,
(sensor.objectTemp, 1)))
val SensorAddByKey = SensorCountByKey.reduceByKey((x, y) => (x._1 + y._1,
x._2+ y._2))
val SensorAverage = SensorAddByKey.map(x => (x._1, x._2._1.toFloat/
x._2._2.toFloat) )
…
© 2015 IBM Corporation30
Spark Stream Application – Look for Data Exceeding Threshold
…
val objectTempThreshold = 25
val numexceptions = 5
// foreachRDD performs function on each RDD in DStream
SensorWindow.foreachRDD { rdd =>
// filter sensor data for objectTemperature above threshold
val alertRDD = rdd.filter(sensor => sensor.objectTemp >
objectTempThreshold).map(sensor => (sensor.sensorid,
sensor.objectTemp)).takeSample(false, numexceptions).foreach(println)
}
…
© 2015 IBM Corporation32
Submit Spark Streaming Application
# spark-submit --class "StreamingSensor" --master yarn-client \
target/scala-2.10/Spark-Steaming-of-Sensor-Data-assembly-0.0.1.jar
Notes:
• spark-submit does not automatically include the package containing KafkaUtils
• You need to include KafkaUtils in your project JAR
• For that you need to create an all inclusive uber-jar (ex. using sbt-assembly)
• sbt-assembly is a sbt plugin to create a fat JAR of an sbt project with all of
its dependencies.
© 2015 IBM Corporation33
Agenda
Sensor data, the Internet of Things, and hybrid clouds
Overview of technology components used in the solution
Spark Streaming
Apache Kafka
IBM Bluemix
Node-RED
Secure Gateway
Demo Scenario Overview
Demo
Questions
© 2015 IBM Corporation34
Demo Review
IBM
Secure
Gateway
browser
Bluemixreading from sensors
every 2 sec
CSV
• average data over 30 second rolling window
• report average every 20 seconds
• report if object temperature exceeds 25 C
© 2015 IBM Corporation35
Agenda
Sensor data, the Internet of Things, and hybrid clouds
Overview of technology components used in the solution
Spark Streaming
Apache Kafka
IBM Bluemix
Node-RED
Secure Gateway
Demo Scenario Overview
Demo
Questions