+ All Categories
Home > Software > Storm Demo Talk - Denver Apr 2015

Storm Demo Talk - Denver Apr 2015

Date post: 18-Jul-2015
Category:
Upload: mmhw
View: 169 times
Download: 0 times
Share this document with a friend
Popular Tags:
58
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Real-Time Processing in Hadoop Big Data for Business Shane Kumpf & Mac Moore SoluEons Engineers, Hortonworks April 2015
Transcript

Page  1   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

Real-Time Processing in Hadoop Big Data for Business

Shane  Kumpf  &  Mac  Moore  SoluEons  Engineers,  Hortonworks  April  2015  

Page  2   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Agenda  

§  IntroducEon  &  about  Hortonworks  HDP  §  Overview  of  logisEcs  industry  scenario  §  Overview  of  streaming  architecture  on  HDP  §  Streaming  Demo  #1  §  IntegraEng  PredicEve  AnalyEcs  in  streaming  scenarios  §  Streaming  Demo  with  PredicEve  addiEons  §  Q  &  A  

Page  2  

Page  3   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Preface:  Enabling  Technologies  

Page  3  

• Problems solved at scale, via fundamentally new approaches…• Make it possible, even simple, to produce new products/applications that would have been too cost prohibitive – or simply impossible - beforehand.

• Where foundation tech like Li-­‐Ion  baUeries,  reEna  displays,  &  Eny  HD  cameras  (from  smartphones)  have  enabled  Electric  cars,  quad-­‐copters,  VR  displays,  &  more…  

• Hadoop  has  similarly  led  to  breakthroughs  in  big  data  capability,  and  enables  new  real-­‐Eme  advanced  analyEc  applicaEons.  

Page  4   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

Why did Hadoop emerge?

April  2015  

Page  5   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

   

Traditional systems under pressure Challenges •  Constrains data to app •  Can’t manage new data •  Costly to Scale

Business  Value  

   

   

Clickstream  

GeolocaEon  

Web  Data  

Internet  of  Things  

Docs,  emails  

Server  logs  

2012  2.8  Ze5abytes  

2020  40  Ze5abytes  

LAGGARDS  

INDUSTRY  LEADERS  

1

2 New Data  

ERP   CRM   SCM  

New    

TradiKonal  

Page  6   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

Hadoop for the Enterprise: Implement a Modern Data Architecture with HDP

Spring  2015  

Hortonworks. We do Hadoop.

Page  7   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

Hadoop  for  the  Enterprise:    Implement  a  Modern  Data  Architecture  with  HDP  

Customer Momentum

•  330+ customers (as of year-end 2014)

Hortonworks Data Platform •  Completely open multi-tenant platform for any app & any data. •  A centralized architecture of consistent enterprise services for

resource management, security, operations, and governance.

Partner for Customer Success •  Open source community leadership focus on enterprise needs •  Unrivaled world class support

•  Founded in 2011 •  Original 24 architects, developers,

operators of Hadoop from Yahoo! •  600+ Employees •  1000+ Ecosystem Partners

Page  8   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

Customer Partnerships matter Driving  our  innovaKon  through  

Apache  SoSware  FoundaKon  Projects  

Apache  Project   Commi5ers   PMC  Members  

Hadoop   27   21  

Pig   5   5  

Hive   18   6  

Tez   16   15  

HBase   6   4  

Phoenix   4   4  

Accumulo   2   2  

Storm   3   2  

Slider   11   11  

Falcon   5   3  

Flume   1   1  

Sqoop   1   1  

Ambari   34   27  

Oozie   3   2  

Zookeeper   2   1  

Knox   13   3  

Ranger   10   n/a  

TOTAL   161   108  Source:  Apache  Sobware  FoundaEon.  As  of  11/7/2014.  

Hortonworkers  are  the  architects  and  engineers  that  lead  development  of  open  source  Apache  Hadoop  at  the  ASF  

•  ExperKse  Uniquely  capable  to  solve  the  most  complex  issues  &  ensure  success  with  latest  features  

•  ConnecKon  Provide  customers  &  partners  direct  input  into    the  community  roadmap  

•  Partnership  We  partner  with  customers  with  subscripEon  offering.  Our  success  is  predicated  on  yours.  

27  

Cloudera:  11    

Facebook:  5    

LinkedIn:  2    

IBM:  2    

Others:  23    

Yahoo  10    

Page  9   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

Technology Partnerships matter

Apache  Project   Hortonworks  

RelaKonship   Named  Partner  

CerEfied  SoluEon   Resells   Joint  

Engr  

MicrosoS   u   u   u   u  

HP   u   u   u   u  

SAS   u   u   u  

SAP   u   u   u   u  

IBM   u   u   u  

Pivotal   u   u   u  

Redhat   u   u   u  

Teradata   u   u   u   u  

InformaKca   u   u   u  

Oracle   u   u  

It  is  not  just  about  packaging  and  cerEfying  sobware…    Our  joint  engineering  with  our  partners  drives  open  source  standards  for  Apache  Hadoop        HDP  is    Apache  Hadoop  

Page  10   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

HDP delivers a Centralized Architecture

Modern Data Architecture •  Unifies data and processing.

•  Enables applications to have access to all your enterprise data through an efficient centralized platform

•  Supported with a centralized approach governance, security and operations

•  Versatile to handle any applications and datasets no matter the size or type

Clickstream   Web    &  Social  

GeolocaKon   Sensor    &  Machine  

Server    Logs  

Unstructured  

SOURC

ES  

ExisKng  Systems  

ERP   CRM   SCM  

ANAL

YTICS  

Data    Marts  

Business    AnalyKcs  

VisualizaKon  &  Dashboards  

ANAL

YTICS  

ApplicaKons   Business    AnalyKcs  

VisualizaKon  &  Dashboards  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

HDFS    (Hadoop  Distributed  File  System)  

YARN:  Data  OperaKng  System  

Interactive Real-Time Batch Partner ISV Batch Batch MPP   EDW  

Page  11   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

Real World Use Case: Trucking Company

Spring  2015  

Hortonworks. We do Hadoop.

Page  12   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

Scenario Overview .

Page  13   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Trucking  company  w/  large  fleet  of  trucks  in  Midwest  

A  truck  generates  millions  of  events  for  a  given  route;  an  event  could  be:  

§  'Normal'  events:  starEng  /  stopping  of  the  vehicle  

§  ‘ViolaEon’  events:  speeding,  excessive  acceleraEon  and  breaking,  unsafe  tail  distance  

Company  uses  an  applicaKon  that  monitors  truck  locaKons  and  violaKons  from  the  truck/driver  in  real-­‐Kme  

Route?  Truck?  Driver?    Analysts  query  a  broad  history  to  understand  if  today’s  violaEons  are  part  of  a  larger  problem  with  specific  routes,  trucks,  or  drivers  

Page  14   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Distributed  Storage:  HDFS  

Many  Workloads:  YARN  

Trucking  Company’s  YARN-­‐enabled  Architecture  

Stream  Processing  (Storm)  

Inbound  Messaging  (Kara)  

Real-­‐Eme  Serving  (HBase)  

Alerts  &  Events  (AcEveMQ)  

Real-­‐Time    User  Interface  

One  cluster  with  consistent  security,  governance  &  operaKons  

SQL  

InteracEve  Query  (Hive  on  Tez)  

Truck  Sensors  

Page  15   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Distributed  Storage:  HDFS  

Many  Workloads:  YARN  

Trucking  Company’s  YARN-­‐enabled  Architecture  

Stream  Processing  (Storm)  

Inbound  Messaging  (Kara)  

Real-­‐Eme  Serving  (HBase)  

Alerts  &  Events  (AcEveMQ)  

Real-­‐Time    User  Interface  

One  cluster  with  consistent  security,  governance  &  operaKons  

SQL  

InteracEve  Query  (Hive  on  Tez)  

Truck  Sensors  

Page  16   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

What  is  Kara?     APACHE  KAFKA  

§  High  throughput  distributed  messaging  system  

§  Publish-­‐Subscribe  semanEcs  but  re-­‐imagined  at  the  implementaEon  level  to  operate  at  speed  with  big  data  volumes  

 §  Kara  @LinkedIn:  

§  800  billion  messages  per  day  §  175  terabytes  of  data  wriUen  per  day  §  650  terabytes  of  data  read  per  day  §  Over  13  million  messages/2.75GB  of  data  

per  second  

Kaga  Cluster  

producer  

producer  

producer  

consumer  

consumer  

consumer  

Page  17   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Kara:  Anatomy  of  a  Topic  ParKKon  0   ParKKon  1   ParKKon  2  

 0   0   0  

1   1   1  

2   2   2  

3   3   3  

4   4   4  

5   5   5  

6   6   6  

7   7   7  

8   8   8  

9   9   9  

10   10  

11   11  

12  

Writes  

Old  

New  

APACHE  KAFKA  

§  ParEEoning  allows  topics  to  scale  beyond  a  single  machine/node    

§  Topics  can  also  be  replicated,  for  high  availability.  

Page  18   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Distributed  Storage:  HDFS  

Many  Workloads:  YARN  

Trucking  Company’s  YARN-­‐enabled  Architecture  

Stream  Processing  (Storm)  

Inbound  Messaging  (Kara)  

Real-­‐Eme  Serving  (HBase)  

Alerts  &  Events  (AcEveMQ)  

Real-­‐Time    User  Interface  

One  cluster  with  consistent  security,  governance  &  operaKons  

SQL  

InteracEve  Query  (Hive  on  Tez)  

Truck  Sensors  

Page  19   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Apache  Storm  

• Distributed,  real  Eme,  fault  tolerant  Stream  Processing  plaxorm.  • Provides  processing  guarantees.  • Key  concepts  include:  

• Tuples  • Streams  • Spouts  • Bolts  • Topology  

Page  19  

Page  20   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Tuples  and  Streams  

• What  is  a  Tuple?  – Fundamental  data  structure  in  Storm.    Is  a  named  list  of  values  that  can  be  of  any  data  type.  

 

Page  20  

• What  is  a  Stream?  – An  unbounded  sequences  of  tuples.  – Core  abstracEon  in  Storm  and  are  what  you  “process”  in  Storm  

 

Page  21   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Spouts  

• What  is  a  Spout?  – Generates  or  a  source  of  Streams  – E.g.:  JMS,  TwiUer,  Log,  Kara  Spout  – Can  spin  up  mulEple  instances  of  a  Spout  and  dynamically  adjust  as  needed  

Page  21  

Page  22   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Bolts  

• What  is  a  Bolt?  – Processes  any  number  of  input  streams  and  produces  output  streams  – Common  processing  in  bolts  are  funcEons,  aggregaEons,  joins,  read/write  to  data  stores,  alerEng  logic  – Can  spin  up  mulEple  instances  of  a  Bolt  and  dynamically  adjust  as  needed  

• Bolts  used  in  the  Use  Case:  1.  HBaseBolt:  persisEng  and  counEng  in  Hbase  2.  HDFSBolt:  persisEng  into  HFDS  as  Avro  Files  using  Flume  3.  MonitoringBolt:  Read  from  Hbase  and  create  alerts  via  email  and  a  message  to  AcEveMQ  if  the  

number  of  illegal  driver  incidents  exceed  a  given  threshhold.  

Page  22  

Page  23   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Topology  

• What  is  a  Topology?  – A  network  of  spouts  and  bolts  wired  together  into  a  workflow  

Page 23

Truck-Event-Processor Topology

Kafka Spout

HBase BoltMonitoring

Bolt

HDFS Bolt

WebSocket Bolt

Stream Stream

Stream

Stream

Page  24   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Distributed  Storage:  HDFS  

Many  Workloads:  YARN  

Trucking  Company’s  YARN-­‐enabled  Architecture  

Stream  Processing  (Storm)  

Inbound  Messaging  (Kara)  

Real-­‐Eme  Serving  (HBase)  

Alerts  &  Events  (AcEveMQ)  

Real-­‐Time    User  Interface  

One  cluster  with  consistent  security,  governance  &  operaKons  

SQL  

InteracEve  Query  (Hive  on  Tez)  

Truck  Sensors  

Page  25   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Key  Constructs  in  Apache  HBase  • HBase = Key / Value store• Designed for petabyte scale•  Supports low latency reads, writes and updates

• Key features– Updateable records– Versioned Records– Distributed across a cluster of machines– Low Latency– Caching

•  Popular use cases:– User profiles and session state– Object store– Sensor apps

Page  25  

Page  26   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Data  Assignment  

Page  26  

HBase  Table  

Keys  within  HBase  Divided  among  

different  RegionServers  

Page  27   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Data  Access  

• Get– Retrieves a single cell, all cells with a matching rowkey, or all cells in a column family with a

matching rowkey

• Put– Inserts a new version of a cell.  

• Scan– The whole table, row by row, or a section of that table starting at a particular start key and ending

at a particular end key

• Delete– It is actually a version of put(Add a new version with put with a deletion marker)

• SQL via Apache Phoenix– Unique capability in the NoSQL market

Page  27  

Page  28   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Distributed  Storage:  HDFS  

Many  Workloads:  YARN  

Trucking  Company’s  YARN-­‐enabled  Architecture  

Stream  Processing  (Storm)  

Inbound  Messaging  (Kara)  

Real-­‐Eme  Serving  (HBase)  

Alerts  &  Events  (AcEveMQ)  

Real-­‐Time    User  Interface  

One  cluster  with  consistent  security,  governance  &  operaKons  

SQL  

InteracEve  Query  (Hive  on  Tez)  

Truck  Sensors  

Page  29   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

2009  2006  

1   °   °   °   °   °  

°   °   °   °   °   N  

HDFS    (Hadoop  Distributed  File  System)  

MapReduce  Largely  Batch  Processing  

Hadoop  w/  MapReduce  

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

° N

HDFS (Hadoop Distributed File System)

Hadoop2 & YARN based Architecture

Silo’d clusters Largely batch system Difficult to integrate

MR-­‐279:  YARN  

Hadoop 2 & YARN

Interactive Real-Time Batch

Architected & led development of YARN to enable the Modern Data Architecture

October 23, 2013

Page  30   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Benefits  of  YARN  as  the  Data  OperaEng  System  

• The container based model allows for running nearly any workload.– Enables the centralized architecture.

– No longer is MapReduce the only data processing engine.

– Docker containers managed by YARN. Yes Please!

• Decouples resource scheduling from application lifecycle.– Improved scalability and fault tolerence

• Dynamically allocated resources, resulting in HUGE utilization gains– Versus static allocation of “slots” in Hadoop 1.0

Page  30  

Yahoo has over 30000 nodes running YARN across over 365PB of data. They calculate running about 400,000 jobs per day for about 10 million hours of compute time.

They also have estimated a 60% – 150% improvement on node usage per day since moving to YARN.

Page  31   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Distributed  Storage:  HDFS  

Many  Workloads:  YARN  

Trucking  Company’s  YARN-­‐enabled  Architecture  

Stream  Processing  (Storm)  

Inbound  Messaging  (Kara)  

Real-­‐Eme  Serving  (HBase)  

Alerts  &  Events  (AcEveMQ)  

Real-­‐Time    User  Interface  

One  cluster  with  consistent  security,  governance  &  operaKons  

SQL  

InteracEve  Query  (Hive  on  Tez)  

Truck  Sensors  

Page  32   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Apache HDFS – Hadoop Distributed File System  •  Very large scale distributed file system

•  10K nodes, tens of millions files and PBs of data•  Supports large files

•  Designed to run on commodity hardware, assumes hardware failures•  Files are replicated to handle hardware failure•  Detect failures and recovers from them automatically

•  Optimized for Large Scale Processing•  Data locations are exposed so that the computations can move to where data resides

•  Data Coherency•  Write once and read many times access pattern

•  Files are broken up in chunks called ‘blocks’•  Blocks are distributed over nodes

Page  32  

Page  33   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Streaming  Demo  -­‐  High  Level  Architecture  

Distributed  Storage:  HDFS  

YARN  

Storm  Stream  Processing  

Kakfa  Spout  

HBase  

Dangerous  Events  Table  Hbase  

Bolt  HDFS  Bolt  

Truck  Events  

AcKve    MQ  

Monitoring  Bolt  

Web  App  

Truck  Streaming  Data  

T(1)   T(2)   T(N)  

Inbound  Messaging  (Kaga)  

Truck  Events  Topic  

Page  34   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

Demo  –  Streaming  Dashboard  .

Page  35   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

A  New  Challenge  .

Page  36   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

CDO’s  vision:  Build  a  PredicEve  Business,  not  a  ReacEve  one  

CDO’s  Requirements  §  Offline  predicKons  

§  IdenKfy  investments  that  will  increase  safety  and  reduce  company’s  liabiliKes  

§  Real-­‐Kme  predicKons    §  AnKcipate  driver  violaKons  before  they  

happen  and  take  precauKonary  acKons  

Data  ScienKst’s  Response  §  Need  to  explore  data  &  form  a  hypothesis  §  Verify  trends  against  TBs  of  events  data  via  

machine  learning  §  Generate  predicEve  models  with  Spark  

MLlib  on  HDP    §  Plug  models  into  the  Storm  topology  to  predict  

driver  violaEons  in  real-­‐Eme  

♬  I’ve  been  wai+ng  for  this  moment  all  my  life  ♬  

Page  37   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

Demo  –  Analyzing  Events  with  Tableau  .

Page  38   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Analyzing Raw Events – dangerous drivers

Page 38

Page  39   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Analyzing Raw Events – dangerous routes

Page 39

Page  40   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Analyzing Raw Events – violations by location

Page 40

Page  41   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

Enriching  truck  events  for  analysis  with  Pig  

HDFS   Raw  Truck  Events  Weather  Data  Sets  

Raw  Weather  Data  

HCatalog  (Metadata)  

Payroll  Data  

HR  &  Payroll  DBs  

Load  Raw  Truck  Events  

Clean  &    Filter  

Cleaned  Events  

Transformed  Events  

Transform      

Join  with  HR  &  weather  data  

Enriched  Events  

Enriched  Events  

Store  

Tableau    

Page  42   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Analyzing Enriched Events – noncertified and fatigued drivers more dangerous

Page 42

Page  43   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Analyzing Enriched Events – top 3 dangerous routes seem to be driven by fatigued drivers

Page 43

Page  44   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Analyzing Enriched Events – foggy weather leads to violations

Page 44

Page  45   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Analyzing Enriched Events – but top 3 safest routes are also foggy

Page 45

Page  46   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

IntegraEng  PredicEve  AnalyEcs  

Page  47   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

Building  the  PredicEve  Model  on  HDP  

Tableau    Explore  small  subset  of  events  to  idenEfy  predicEve  features  and  make  a  hypothesis.  E.g.  hypothesis:  “foggy  weather  causes  driver  viola+ons”  

1  

IdenEfy  suitable  ML  algorithms  to  train  a  model  –  we  will  use  classificaEon  algorithms  as  we  have  labeled  events  data    

2  

Transform  enriched  events  data  to  a  format  that  is  friendly  to  Spark  MLlib  –  many  ML  libs  expect  training  data  in  a  certain  format  

3  

Train  a  logisEc  regression  model  in  Spark  on  YARN,  with  above  events  as  training  input,  and  iterate  to  fine  tune  the  generated  model  

4  

 Integrate  Spark  MLlib  model  in  a  Storm  bolt  to  predict  violaEons  in  real  Eme  

5  

Page  48   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

Truck  Sensors  

HDFS  

YARN  

Integrate  PredicEve  AnalyEcs  in  Stream  Processing  

Stream  Processing  (Storm)  

Inbound  Messaging  (Kara)  

InteracEve  Query  (Hive  on  Tez)  

Real-­‐Eme  Serving  (HBase)  

Millions  of  Enriched  Truck  Events    

PredicEon  Bolt  

Plug  Spark  model  into  Storm  bolt  

Machine  Learning  (Spark)  

Train  Spark  ML  model  with  millions  of  truck  events  

Page  49   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  © Hortonworks Inc. 2012 Professional Services

Streaming  Demo  -­‐  Updated  Architecture  

Distributed  Storage:  HDFS  

YARN  

Storm  Stream  Processing  

Kakfa  Spout  

HBase  

PayRoll  Table  HBase  

Bolt  HDFS  Bolt  

Truck  Events  

AcKve    MQ  

Monitoring  Bolt  

Web  App  

Truck  Streaming  Data  

T(1)   T(2)   T(N)  

Inbound  Messaging  (Kaga)  

Truck  Events  Topic  

PredicKon  Bolt  

Enrich    Event  

Predict  violaKon  in  real  Kme    &  alert  via  MQ  

Render  Real  Kme  predicKons  on  UI  

Page  50   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

Transforming  training  data  for  Spark  MLlib  Enriched  Events  Data  

Event  Type   Is  Driver  CerKfied?  

Wage  Plan  

Hours  Driven  

Miles  Driven  

Longitude   LaKtude   Weather  Foggy  

Weather    Rainy  

Weather    Windy  

Normal   Yes   Hourly   45   2721   -­‐91.3   38.14   No   No   No  

Overspeed   No   Miles   72   4152   -­‐94.23   37.09   Yes   Yes   No  

…   …   …   …   …   …   …   …   …   …  

Spark  MLlib    Training  Data  Label   Is  Driver  

CerKfied?  Wage  Plan  

Hours  Driven  

Miles  Driven  

Weather  Foggy  

Weather    Rainy  

Weather    Windy  

0   1   1   0.45   0.2721   0   0   0  

1   0   0   0.72   0.4152   1   1   0  

…   …   …   …   …   …   …   …  

Normal  events  labeled  as  0  and  

violaEon  events  as  1  

Feature  scaling  applied  to  hours  and  miles  to  improve  algorithm  performance  

Features  with  binary  values    denoted  as  0  and  1  

Page  51   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

Running  Spark  ML  on  YARN  

1  spark-­‐submit  -­‐-­‐class  org.apache.spark.examples.mllib.BinaryClassifica+on  -­‐-­‐master  yarn-­‐cluster    -­‐-­‐num-­‐executors  3  -­‐-­‐driver-­‐memory  512m    -­‐-­‐executor-­‐memory  512m        -­‐-­‐executor-­‐cores  1  truckml.jar  -­‐-­‐algorithm  LR  -­‐-­‐regType  L2  -­‐-­‐regParam  1.0  /user/root/truck_training    -­‐-­‐numItera3ons  100  

Run  spark-­‐submit  script  to  launch  a  Spark  job  on  YARN.  

Training  data  locaEon  on  HDFS  

2   Monitor  progress  of  Spark  job  in  YARN  Resource  Mgr  UI  

Page  52   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

InterpreEng  Spark  LogisEc  Regression  Results  

Precision:  87.5%   Recall:  88%  

 Top  three  predictors  of  violaKons    1.  Foggy  Weather  2.  Rainy  Weather  3.  Driver  CerEficaEon  

Page  53   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

IntegraEng  Spark  model  in  Storm  

Kara  Spout    

         Storm  PredicEon  Bolt  

§  IniEalize  Spark  model  §  Parse  truck  event  §  Enrich  event  with  HBase  data  §  Predict  violaEon  with  model  §  Send  Alert  if  violaEon  predicted  

Real-­‐Eme  Serving  (HBase)  

AcKve  MQ  

Ops  Center   LOB  Dashboards  

Page  54   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

Summary:  SoluEon  Value  .

Page  55   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

Value  of  large  scale  ML  on  HDP  §  Accelerate  Kme  to  market/value  

§  Test  out  mulEple  ML  algorithms  against  TBs  of  training  data  in  reasonable  Eme  frames  

§  Confirm  hypothesis  against  TBs  of  training  data  with  confidence  §  We  confirmed  that  fog  does  impact  safety  and  wage  plans  do  not,  

whereas  BI  tools  indicated  otherwise    

§  Easily  integrate  predicKve  models  in  data  driven  apps  §  Run  predicEve  models  in  Storm  or  any  other  app  in  your  enterprise  

 §  Run  all  of  the  above  in  a  mulK-­‐tenant  YARN  cluster  

§  Large  scale  ML  on  YARN  respects  other  tenants  in  an  HDP  cluster  

 

Page  56   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

RecommendaEons  to  CDO  

§  Investment  recommendaKons,  in  order  of  priority  1.  Invest  in  visibility  sensors  and  auto  braking  systems  to  deal  with  foggy  condiEons  2.  Invest  in  slip  resistant  Eres  to  fight  rainy  condiEons  3.  Invest  in  cerEfying  drivers  to  reduce  violaEon  probability  

       

§  Power  of  real  Kme  predicKons  §  40%  reducEon  in  violaEon  rates  by  predicEng  high  risk  situaEons  in  real-­‐Eme  and  

sending  immediate  alerts  to  drivers    

 

Page  57   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

PredicEve  Demo  .

Page  58   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved  

Q & A Big Data for Business


Recommended