Date post: | 26-Jan-2015 |
Category: |
Technology |
Upload: | srisatish-ambati |
View: | 104 times |
Download: | 2 times |
H2O the Prediction Engine
Better predictions
https://github.com/0xdata/h2o
H2O makes hadoop do math
Hadoop = opportunity Not enough Data Scientists Analysts won’t code java
H2O the Prediction Engine Exploration Modeling Scoring
Big Data
H2O the Prediction Engine
Adhoc Exploration
Math Modeling
Real-time Scoring
Big Data Velocity Volume
H2O the Prediction Engine
Adhoc Exploration
Math Modeling
Real-time Scoring
Big Data
Messy Clustering
Classification
Ensembles
100’s nanos models
Regression
H2O the Prediction Engine
Big Data Exploration Modeling Scoring
Real-time
H2O the Prediction Engine
Big Data Exploration Modeling Scoring
Real-time
No New API
Approximate results each step
H2O the Prediction Engine
Big Data Exploration Modeling Scoring
Real-time
More Data beats Better Algorithms
H2O the Prediction Engine
Big Data Exploration Modeling Scoring
Real-time
More Data and Better Algorithms Scale & Parallelism
H2O the Prediction Engine
Big Data Exploration Modeling Scoring
Real-time
More Data and Better Algorithms Scale & Parallelism
fraud detection
Apps
reco engine
H2O the Prediction Engine
Intellectual Legacy
Math needs to be free
Open Source
Support and Innovation
https://github.com/0xdata/h2o
SriSatish Ambati, CEO & Co-founder Director of Engineering, DataStax, Cassandra & Hadoop Customers & Platform Marketing, Azul Cliff Click. CTO & Co-founder Chief JVM Architect, Azul, Sun, HP, Motorola, JIT & Hotspot Tomas Nykodym Phd Security, Intrusion Detection Cyprien Noel Founder ObjectFabric, TradeWeb, SmartTrade Michal Malohlava Phd DSLs, Compilers Jan Vitek Full Professor, Purdue, On Sabbatical, Real-time VM, R/stats Compiler
Kevin Normoyle AMD Fellow, Distinguished Engineer Sun, Consistency Models Tom Kraljevic VP Of Engineering, founder Luminix, Azul, PMC-Sierra, Chromatic
Credits & Team
Stephen Boyd Professor of Mathemat ica l Engineer ing, Stanford, Convex Opt
Trevor Hastie Professor of Stat is t ics, Stanford, General ized Addi t ive Models
Rob Tibshirani Professor of Stat is t ics, Stanford, GLMNet, Lasso
Doug Lea Mal loc for C. fork- jo in. java memory model , suny oswego Dhruba Borthakur HDFS, Hive, Facebook Nial l Dalton TimeSer ies DB, KX, High- f requency Trading, Cantor-F i tz Char les Zedlewski VP Products, Cloudera
Data Science & Advisors
Distributed! Extensible, reconfigurable!
Math-at Scale – Simple Legos
H2O
+ σ cov
*
µ mean
n
GLM Logistic
Regression
rand shuffle
histo gram
Random Decision
Trees
OLS
k-means
Volume: HDFS
HIVE/SQL
Data Scientist
Munging slice n dice Features
Classification Regression Clustering Optimal Model
Engineer
Velocity: Events Online Scoring
Exploration
Modeling
Offline Scoring
Business Analyst
Ensemble models Low latency
Applications
Predictions
Rule Engine
Before H2O
Product Road Map
algos: RandomForest GLM, ADMM, GLMnet, k-means data: dense, categorical api: REST, JSON, R-like console Scale, Single-Execution GridSearch
In 4-pilots
algos: GroupBy, Grep Unbalanced App: Fraud Detection data: sparse api: R, math, string Adhoc Analytics Multi-Execution Scoring Engine Event Ingest In production
algos: GBM, SVM, KNN Optimization App:RecoEngine data: sparse api: Tableau Visualization Multi-tenant Library Big Adoption
1.15.2013 5.15.2013 8.15.2013
secret sauce move code. not data
Linear Regression
fork/join. data partitioning. fine grain parallelism
phase 1 sums phase 2 distance phase 3 validate
arraylets leaf computes parent aggregates
company confidential. copyright 2012
Fraud Detection Scoring: Event stream on a ScoreCard Model Modeling: Random Forest for outlier detection Modeling: Event sequence patterns
Customer Behavior & Merchant Analytics Scoring: Purchase event stream scoring on Ensemble Models Modeling: Logistic Regression models for Customer Engagement
Failure Prediction from Sensor Data Model device failures and rank vendor graphs.
Upstream Oil Exploration Distance & Regression on 1TB big data MLS for Oil fields
Use Cases
Math & Hadoop users recommend us!
Data & Algorithms
SQL | HDFS | S3 | NoSQL
H2O – Real Time
REST
patterns sequences
Distributed Collections Execution
JSON R Excel
Java API
Hadoop Ecosystem
HDFS
H2O Map Reduce
Hive Pig
Impala Drill
Batch Interactive
H2O
• Alternating Direction Method of Multipliers (Boyd) • Decomposition-coordination • Small Local Sub-Problems and Global Coordination
• Broadcast & Gather • Decomposability Dual Ascent + Convergence of Multipliers • Block & Component Separability
• Generalized Gradients (Hastie, Tibshirani, et al)
Generalized Linear Modeling
l1 norm regularization
https://github.com/0xdata/h2o/blob/master/src/main/java/hex/DLSM.java
• Text Book implementation from Breiman’s paper.
• Data is distributed upon ingest • Splits on random selection of features
• Gini & Entropy
• Handle NAs (during training) • Class-Weighting • Stratified Sampling (local)
Random Forest
https://github.com/0xdata/h2o/tree/master/src/main/java/hex/rf
forest for the tree.. iris dataset
• 1% increase in predictive power - $11m @ major online payment system
• Each fraud scored accurately = expected value of 10s of thousand dollars.
• Leads cost $10-100/lead – Predicting accurate conversion and quality of leads goes directly to bottom line.
• Competitive advantage in predicting which assets to acquire.
Models unlock value in data
Deployment - commodity / cloud
H2O
x86
H2O is pure java and easy-to-install
company confidential. copyright 2012
H2O
H2O
H2O the Prediction Engine
Better predictions
https://github.com/0xdata/h2o
H2O the Prediction Engine
Big Data Science Modeling & Scoring Engine Approximate results each step No new API
Use R, Excel & SAS Scale & Parallelism