+ All Categories
Home > Technology > Zementis hortonworks-webinar-2014-09

Zementis hortonworks-webinar-2014-09

Date post: 10-Dec-2014
Category:
Upload: hortonworks
View: 519 times
Download: 1 times
Share this document with a friend
Description:
see the recording: http://youtu.be/qdhF1sfef10 Ofer Medelvitch, Director of Data Science of Hortonworks and Michael Zeller, Founder and CEO of Zementis present key learnings as to what drives successful implementations of big data analytics projects. Their knowledge comes from working with dozens of companies from small cloud-based start-ups to some of the largest companies in the world.
25
Copyright 2014 Zementis, Inc. All rights reserved. 2 Presented by Hortonworks & Zementis September 10, 2014 Webinar will begin shortlyHadoop’s Advantages for Machine Learning and Predictive Analytics
Transcript
Page 1: Zementis hortonworks-webinar-2014-09

Copyright 2014 Zementis, Inc. All rights reserved. 2

!!!!!!Presented by Hortonworks & Zementis !September 10, 2014

Webinar will begin shortly… !Hadoop’s Advantages for Machine Learning and Predictive Analytics

Page 2: Zementis hortonworks-webinar-2014-09

Copyright 2014 Zementis, Inc. All rights reserved.

Moderator Presenters

!

!

4

Hadoop’s Advantages for Machine Learning and Predictive Analytics

Mark Rabkin Director Business Development

Zementis

Ofer Mendelevitch Director of Data Science

Hortonworks

Michael Zeller CEO

Zementis

Page 3: Zementis hortonworks-webinar-2014-09

Copyright 2014 Zementis, Inc. All rights reserved. 5

The Speakers

Michael Zeller CEO & Founder Zementis

Ofer Mendelevitch Director of Data Science Hortonworks

Ofer Mendelevitch is Director of data sciences at Hortonworks, where he is responsible for professional services involving data science with Hadoop, including use-cases like recommender systems, prediction, classification and search. Prior to joining Hortonworks, Ofer has held a number of positions from Entrepreneur in Residence at XSeed Capital, VP of Engineering at Nor1 and Director of engineering at Yahoo! where he led multiple engineering and data science teams.

Michael Zeller is the CEO and Co-Founder of Zementis. His vision is to help companies deepen and accelerate insights from big data through the power of predictive analytics. Michael also serves on the Board of Directors of Software San Diego and as Secretary/Treasurer on the Executive Committee of ACM SIGKDD, which is the premier international organization for data mining researchers and practitioners from academia, industry, and government.

Page 4: Zementis hortonworks-webinar-2014-09

Copyright 2014 Zementis, Inc. All rights reserved.

Products & Capabilities:• Vendor-neutral architecture for

- Data mining tools - Analytics and data warehouse

platforms • Supports PMML industry standard and

wide range of predictive modeling techniques

• Rapidly deploys and executes predictive models

• Accelerates business insight

Zementis provides software for operational deployment of predictive analytics

6

Hortonworks & Zementis

Our Commitment:

Hortonworks: We Do Hadoop. Our mission is to power your Modern Data Architecture by delivering Enterprise Apache HadoopReseller Partners:

• Open LeadershipDrive innovation in the open exclusively via the Apache community-driven open source process

• Enterprise Rigor Engineer, test and certify Apache Hadoop with the enterprise in mind

• Ecosystem EndorsementFocus on deep integration with existing data center technologies and skills

Page 5: Zementis hortonworks-webinar-2014-09

Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

A data architecture under pressure from new data

APPLICAT

IONS*

DATA

**SYSTEM*

REPOSITORIES*

SOURC

ES*

Exis4ng*Sources**(CRM,*ERP,*Clickstream,*Logs)*

RDBMS* EDW* MPP*

Business**Analy4cs*

Custom*Applica4ons*

Packaged*Applica4ons*

Source: IDC

2.8*ZB*in*2012*

85%*from*New*Data*Types*

15x*Machine*Data*by*2020*

40*ZB*by*2020*

OLTP,&ERP,&CRM&Systems&

Unstructured&documents,&emails&

Clickstream&

Server&logs&

Sen>ment,&Web&Data&

Sensor.&Machine&Data&

GeoEloca>on&

Page 6: Zementis hortonworks-webinar-2014-09

Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop within an emerging Modern Data Architecture

OPERATIONS*TOOLS*

Provision, Manage & Monitor

DEV*&*DATA*TOOLS*

Build & Test

DATA

**SYSTEM*

REPOSITORIES*

SOURC

ES*

RDBMS* EDW* MPP*

OLTP,&ERP,&CRM&Systems&

Documents,&&Emails&

Web&Logs,&Click&Streams&

Social&Networks&

Machine&Generated&

Sensor&Data&

Geoloca>on&Data&

Gov

erna

nce

&

Inte

grat

ion

Secu

rity

Ope

ratio

ns

Data Access

Data Management

APPLICAT

IONS*

Business**Analy4cs*

Custom*Applica4ons*

Packaged*Applica4ons*

Data Lake An architectural shift in the data center that uses Hadoop to deliver deeper insight across a large, broad, diverse set of data at efficient scale

Page 7: Zementis hortonworks-webinar-2014-09

Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop unlocks a new approach: Iterative Analytics

Hadoop*Mul>ple&Query&Engines&Itera>ve&Process:&Explore,&Transform,&Analyze&

SQL*Single&Query&Engine&Repeatable&Linear&Process&

Determine*list*of*ques4ons*

Design*solu4ons*

Collect*structured*data*

Ask*ques4ons*from*list*

Detect*addi4onal*ques4ons*

Batch* Interac4ve* Real\4me* Streaming*

Current Reality Apply schema on write

Dependent on IT

Augment w/ Hadoop

Apply schema on read

Support range of access patterns to data stored in HDFS: polymorphic access

Page 8: Zementis hortonworks-webinar-2014-09

Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

A (partial) map of machine learning “tasks”

Discovery

Clustering Detect natural groupings

Outlier detection Detect anomalies

Association rule mining Co-occurrence patterns

Prediction

Classification Predict a category

Regression Predict a value

Recommendation Predict a preference

Page 9: Zementis hortonworks-webinar-2014-09

Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Typical iterative flow in machine learning modeling

Page 5

Visualize, Explore

Hypothesize; Model

Measure/Evaluate

Acquire Data

Clean Data

Deploy & Monitor

Page 10: Zementis hortonworks-webinar-2014-09

Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Why Apache Hadoop for Data Science?

• Hadoop’s schema-on-read reduces cycle time • Hadoop is ideal for pre-processing of raw data – Structured & unstructured

• Larger datasets enable better models

• Large-scale parallel scoring

Page 11: Zementis hortonworks-webinar-2014-09

Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop’s schema-on-read accelerates innovation

I&need&new&data&

Finally,&we&start&

collec>ng&

Let&me&see…&is&it&any&good?&

Start 6 months 9 months

“Schema change” project

Let’s&just&put&it&in&a&folder&on&HDFS&

Let&me&see…&is&it&any&good?&

3 months

My&model&is&awesome!&

Page 12: Zementis hortonworks-webinar-2014-09

Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop is ideal for large scale pre-processing

Join&

Normalize&

OCR&

Sample&

Aggregate&

Raw&Data&Feature&Matrix&

NLP&

Transform&

Page 13: Zementis hortonworks-webinar-2014-09

Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop enables modeling with larger datasets Larger datasets ! better outcomes

Banko & Brill, 2001 • More examples • More features

Page 14: Zementis hortonworks-webinar-2014-09

Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop enables large-scale parallelized “scoring”

Training set Learning

Model

Test set OutputScoring

PMMLNative

Embarrassingly Parallel Using Hadoop as grid compute infrastructure

Page 15: Zementis hortonworks-webinar-2014-09

Copyright 2014 Zementis, Inc. All rights reserved.

What is PMML?

• Mature standard developed by the DMG (Data Mining Group) to avoidproprietary issues and incompatibilities and to deploy models !

• XML-based language used to define statistical and data mining models and to share these between compliant applications !

• Supported by most leading data mining tools, commercial and open-source !

• Data handling and transformations (pre-and post-processing) are a core component of the PMML standard !

• Allows for the clear separation of tasks: Model development vs. model deployment !

• Eliminates the need for custom code and proprietary model deployment solutions

8

Predictive Model Markup Language (PMML) industry standard reduces the complexity of operationalizing models

Page 16: Zementis hortonworks-webinar-2014-09

Copyright 2014 Zementis, Inc. All rights reserved. Confidential 9

Predictive Analytics Workflow

PMML File!

Data Pre-Processing!

Predictive!Model!

Model !

Outputs!

Derived !Model Inputs!

Data Post-Processing!

Raw Inputs!

Prediction!

Input!Validation

Outliers, !Missing Values,!Invalid Values

Normalize,!Discretize, Bin,!

Map, etc.

Scaling,!Business Decisions,!

Thresholds, etc.

Model!Signature

Data and!operational types

PMML in action, covering a complete workflow from raw data input to decision output

Page 17: Zementis hortonworks-webinar-2014-09

Copyright 2014 Zementis, Inc. All rights reserved.

Big Data

Applications

Cloud

Log Files

RSS Feeds

Other Sources

Predictive Models

Machine Learning

Techniques

Data Mining Tools

Business Insights

Business Value

Decisions &

Actions

• Accelerated time-to-market

• More precise targeting

• Real-time responsiveness

• Enhanced operational agility

• Competitive advantage

• Higher revenue growth rates

• Greater profitability

• More relevant

• More accurate

• More comprehensive

• More nuanced

• Faster

• Lower risk

• Greater positive impact

Predictive Analytics

Path to Business Value

Databases

10

Predictive analytics helps organizations unlock the value of their big data …

Page 18: Zementis hortonworks-webinar-2014-09

Copyright 2014 Zementis, Inc. All rights reserved.

Predictive model deployment becomes a

rework cycle

• Extensive manual coding

• Cross-checking

• Fixing coding errors

• Delayed insight

• Less accurate decisions

• Missed opportunities

• Loss of value

Traditional Deployment Cycle

Develop Operationalize Utilize

Data Scientist IT Engineer

Business Decisions

Business Professional

11

… but model deployment challenges can often erode much of the value that predictive analytics can deliver

Page 19: Zementis hortonworks-webinar-2014-09

Copyright 2014 Zementis, Inc. All rights reserved.

Within 2 days *

~ 6 months

Without ZementisWith Zementis

Time-to-insight

Eco

nom

ic V

alue

Model Deployment Cycle Time

* And sometimes even within a few hours!

• Accelerated deployment timeline • Reduced model deployment cycle time • Reduced model deployment expense • Increased model throughput

• Enhanced accuracy • Minimal rework, if any

Rapid insight

Rapid time-to-value

from predictive analytics

=

Deployment with Zementis & PMML

12

Enter Zementis, whose solutions accelerate time-to-insight for predictive analytics

Page 20: Zementis hortonworks-webinar-2014-09

Copyright 2014 Zementis, Inc. All rights reserved. Confidential 13

Universal PMML Plug-in (UPPI)Model Deployment!

Integration/Execution

Decision Trees!

Neural Networks!

Support Vector Machines!

Linear and Logistic Regression!

Naive Bayes Classifiers!

General and Generalized Linear Models!

Cox Regression!

Rule Set Models!

Clustering!

Scorecards!

Association Rules!

Multiple Models (Segmentation, Chaining,

Composition and Ensemble, including

Random Forest Models)

Predictive Algorithms

Data Mining ToolsCommercial Vendors (e.g. IBM SPSS, SAS)!

Open Source Tools (R, KNIME, ...)

Zementis UPPI!for Hive/Hadoop

Simple Deployment & Execution! Upload PMML file(s) in Hive!

PMML turns into HiveQL functions!

Seamlessly score data on Hadoop

PMML

Page 21: Zementis hortonworks-webinar-2014-09

Copyright 2014 Zementis, Inc. All rights reserved. Confidential 14

Hive 0.13Now faster than ever, up to 100x performance improvements and more to come…

Page 22: Zementis hortonworks-webinar-2014-09

Copyright 2014 Zementis, Inc. All rights reserved. Confidential

Speeding Up Performance with Tez & ORC !

15

UPPI for Hive 0.13 Performance

Scaling by Hadoop Cluster Size

Tim

e

0

50

100

10 Nodes 20 Nodes

Tim

e

0

25

50

75

100 21%29%

Performance executing a complex PMML model as UDF (User-Defined Function) using Hive 0.13

29% performance improvement when executing the same model and data by enabling Tez & ORC

Tez & ORCTezHive!0.13

Page 23: Zementis hortonworks-webinar-2014-09

Copyright 2014 Zementis, Inc. All rights reserved. 16

DEMO!

1. PMML Sample Models > Hive UDFs 2. Run Customer Churn Example

Zementis!UPPI for Hive

Zementis Universal PMML Plug-in (UPPI) demo on Hortonworks Sandbox

Page 24: Zementis hortonworks-webinar-2014-09

Copyright 2014 Zementis, Inc. All rights reserved.

Broad Applicability

Fraud & Risk Scoring

• Financial institutions • Scoring bureaus • Fraud detection • Advanced decision

management

Marketing & Sales

• Up- /cross-sell and next-best-offer

• Marketing campaign optimization

• Real-time recommendations

• Rotating equipment • Energy • Biometrics • IP network security

Sensor & Device Data Processing

17

Hortonworks and Zementis products accelerate predictive model insights for multiple industries and business use cases

Page 25: Zementis hortonworks-webinar-2014-09

Copyright 2014 Zementis, Inc. All rights reserved. 18

Thank You!

Questions?


Recommended