Date post: | 01-Dec-2014 |
Category: |
Technology |
Upload: | stephan-reimann |
View: | 516 times |
Download: | 1 times |
© 2014 IBM Corporation
Take action on sensor data in real-time based on analytics in R Easy streaming analytics with InfoSphere Streams Stephan Reimann – IT Specialist Big Data - [email protected] d Wilfried Hoge – IT Architect Big Data – [email protected]
The sensor data challenge Innovations (not only) for the Internet of Things Big Data Meetup Berlin – 2014-10-23 Stephan Reimann – IT Specialist Big Data – [email protected] @stereimann de.linkedin.com/in/stephanreimann/
© 2014 IBM Corporation
Sensor data present an enormous business opportunity across industries
Manufacturing
IBM
uses automated quality
testing data for detection
of anomalies in the
semiconductor wafer
manufacturing process to
minimize wafer loss
Source and more details
Connected Car
PSA Peugot Citroën
uses IBM Big Data
technologies including
InfoSphere BigInsights as
the basis of their con-
nected services initiative
to bring additional services
to vehicle owners
Source
Industrial / Transport
Pratt & Wittney
“reduce maintenance costs by
up to 20 percent”
“... less disruptions, and remo-
vals, and when the engine is in
the shop, targeted repairs so
the engine can come out of the
shop quickly”
Source
Industrie 4.0 Energy & Utilities Connected Car Healthcare
... It
im
pro
ve
s e
ffic
ien
cy &
qu
ali
ty
an
d e
na
ble
s n
ew
bu
sin
es
s m
od
els
..
. a
nd
th
ere
are
ma
ny m
ore
op
po
tun
itie
s
Your fitness devices are also part of it!
2
© 2014 IBM Corporation
Making sense means ...
– detecting hidden correlations
– predicting future behavior predictive maintenance
– detecting outliers
It is not about having the data, it is about using analytics to make sense of it and creating value
The hard thing of making sense is doing it, because ...
– the topic is relatively new, there are not so much out-of-the-box solutions, so you
probably have to create your own solution
– it is a great opportunity to be innovative and gain competitive advantage
– creating your own solution will typically require using tools such as the Hadoop
framework, R, probably something for data preparation and reporting, ...
– this often means heavy programming, ... Think of available skills and time to market
3
© 2014 IBM Corporation 4
Analyzing large historical sensor data sets
require flexible and easy to use tools
Innovation #1: SQL on Hadoop
© 2014 IBM Corporation
Sensor is very special structured data
• A lot of different sources
• Structure differs between sources, e.g. number
of attributes, value encodings, ... And is usually
evolving
• (very) high volume
Use SQL on Hadoop -> Big SQL
– widely used, leverage existing skills
– Declarative:
what you want vs. how to get it
– Use your existing tools
Sensor data usually requires flexible schemas Analytics on sensor data isn‘t special: it should be as simple as always
Source B Source B Source A
Databases are not the
primary choice, due to
the flexible schema
Hadoop is pretty well
suited to analyze sensor
data in its raw format
But databases have
SQL, which offers a
very easy way to
prepare and analyze
structured data
5
© 2014 IBM Corporation
Big SQL combines the simplicity of SQL with the flexibility of Hadoop
Big SQL is an IBM innovation that provides rich, robust,
standards-based SQL support for data stored in
InfoSphere BigInsights (IBM’s Hadoop distribution)
– Full support for subqueries
– OLAP operations, grouping sets, analytic
aggregates, ...
– All standard join operators (get value from
combining data)
– Use your existing queries and tools
No propriety storage format
– Never need to copy data to a proprietary
representation
– It is not a database, it is running in Hadoop, on
standard data formats
Big SQL = easy to do SQL combined with the
flexiblity of Hadoop (like schema-on-read)
InfoSphere BigInsights
Big SQL SQL MPP Runtime
Data Sources
Parquet CSV Seq RC
Avro ORC JSON Custom
SQL-based
Application
6
© 2014 IBM Corporation
Big SQL is architected for performance
InfoSphere BigInsights
Big SQL SQL MPP Runtime
Data Sources
Parquet CSV Seq RC
Avro ORC JSON Custom
SQL-based
Application
Uses its own engine, replace MapReduce with a
modern MPP architecture
– Compiler and runtime are native code (not java)
– Big SQL worker daemons live directly on cluster
– Continuously running (no startup latency)
– Processing happens locally at the data
Architected from the ground up for performance
– low latency and high throughput
– Comprehensive query rewrite and
optimization (cost based optimizer)
Operations occur in memory with the ability
to spill to disk
– Supports aggregations and sorts larger than
available RAM
7
© 2014 IBM Corporation
A free QuickStart Editon, labs, tutorials and a developer community provides a fast start with Big SQL
Hadoop Dev: links to videos, white paper, labs, . . . .
https://developer.ibm.com/hadoop/
8 (for the direct links, click on the pictures)
How-to
Cloud
© 2014 IBM Corporation
Innovation #2: Streaming Analytics
9
Data don‘t have to be stored to be analyzed
Streaming analytics is the key enabler for
real time use cases with sensor data
© 2014 IBM Corporation
Traditional approach
– Historical fact finding
– Analyze persisted data
– (Micro-) Batch philosophy
– PULL approach
Streaming analytics
– Analyze the current moment / the now
– Analyze data directly “in Motion” – without
storing it
– Analyze data at the speed it is created
– PUSH approach
Data don‘t need to be persisted to be analyzed, streaming analytics represents a paradigm shift to enable real time use cases
Repository Insight Analysis Data Insight Analysis Data
10
© 2014 IBM Corporation
InfoSphere Streams is the result of an IBM research project, designed for high-throughput, low latency and to make streaming analytics easy
Scale out
Millions of Events per Second
Complex Data & Analytics
All kinds of data
Complex analytics: Everything you
can express via an algorithm
Low Latency
Analyzes data at the speed it is
created
Latencies down to µs
Immediate action in real time
+ +
Info
Sp
he
re S
tre
am
s
Ca
pa
bil
itie
s
Ho
w it
wo
rks
– Define apps as flow graphs consisting of
sources (inputs), operators & sinks (outputs)
– Extend the functionality with your code if
required for full flexibility
– The clustered, distributed runtime on
commodity HW scales nearly limitless
– GUIs for rapid development and
operations make streaming analytics easy
11
© 2014 IBM Corporation
Free Quickstart Edition
Developer Community
Streaming analytics is about analyzing all the data, continously, just in time, it enables a completely new generation of big data apps
www.ibmdw.net/streamsdev/ ibm.co/streamsqs
Stop just dreaming of real time big data
Start with streaming analytics!!!
+
Radio astronomy Healthcare TelCo Transport Smart Grid IoT
Streaming Analytics is already reality ... and is a key component of many
innovations
...
Tutorials,
Labs,
Forum, ...
Connected Car
GitHub Community
github.com/IBMStreams
+ Toolkits,
Toolkits,
Toolkits
12
© 2014 IBM Corporation
Where technology meets business potential: Start making sense of your sensor data, everything is prepared!
Big SQL
Easy to do SQL analytics
combined with fully
flexible schema-on-read
and Hadoop capabilities
InfoSphere Streams
Analyzes data at the
speed it is created with
maximum simplicity and
minimum latency
Many more, such as
• Time series functionality
• Efficient transport
protocols
• Cloud services (Bluemix)
Gain
valu
e f
rom
yo
ur
data
13
tech
no
log
y
Inn
ovati
on
s
ma
ke
it
easy
There are many opportu-
nities to gain value from
(not only) sensor data.
Let‘s talk how to make
sense of your data! http://www-05.ibm.com/de/events/workshop/bigdata/