DF1 - R - Roark - H2O Overview

Post on 16-Jan-2017

761 views 4 download

transcript

H2O.aiMachine Intelligence

ML is the new SQLPrediction is the new Search

H2O.aiMachine Intelligence

Company Overview

Company

Product

• Founded: 2011 venture-backed• Team: 40• Distributed Systems Engineers doing Machine

Learning• HQ: Mountain View, CA• Fast, scalable machine and deep learning• Predictive analytics• Open Source Applications

H2O.aiMachine Intelligence

Executive Team

Sri Satish AmbatiCEO & Co-founder

Cliff ClickCTO & Co-founder

Tom KraljevicVP of Engineering

Board of DirectorsJishnu Bhattacharjee // Nexus VenturesAsh Bhardwaj // Flextronics

Scientific Advisory CouncilTrevor HastieStephen BoydRob Tibshirani

DataStax Sun, Java Hotspot Abrizio, Intel Lexalytics

VP of MarketingOleg Rognyskyy

H2O.aiMachine Intelligence

H2O Product Overview

H2O is: Open Source , Distributed, In Memory, Predictive Analytics Platform!

Speed Matters!

No Sampling

Interactive UI

Cutting-Edge Algos

• Time is valuable• In-memory is faster• Intelligence as a service• High speed AND accuracy

• Scale to big data• Access data links• Use all data without sampling

• Online modeling with H2O Flow• Model comparison• R Python Web REST API

• Suite of cutting-edge algorithms• Deep Learning• NanoFast Scoring Engine• Move from model>production extremely fast

H2O.aiMachine Intelligence

Ensembles

Deep Neural Networks

Algorithms on H2O

• Generalized Linear Models : Binomial, Gaussian, Gamma, Poisson and Tweedie

• Cox Proportional Hazards Models• Naïve Bayes • Distributed Random Forest : Classification

or regression models• Gradient Boosting Machine : Produces an

ensemble of decision trees with increasing refined approximations

• Deep learning : Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations

Supervised Learning

Statistical Analysis

H2O.aiMachine Intelligence

Dimensionality Reduction

Anomaly Detection

Algorithms on H2O

• K-means : Partitions observations into k clusters/groups of the same spatial size

• Principal Component Analysis : Linearly transforms correlated variables to independent components

• Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning

Unsupervised Learning

Clustering

H2O.aiMachine Intelligence

Accuracy with Speed and Scale

H2O.aiMachine Intelligence

Accuracy with Speed and Scale

HDFS

S3

SQL

NoSQL

ClassificationRegression

Feature Engineering

In-Memory

Map Reduce/Fork Join

Columnar Compression

Deep Learning

PCA, GLM, Cox

Random Forest / GBM Ensembles

Fast Modeling Engine

Streaming

Nano Fast Java Scoring Engines

Matrix Factorization Clustering

Munging

H2O.aiMachine Intelligence

HDFS=DATA

MLlib H2O SQL

H2ORDD

H2O – The Killer-App for Spark

In-Memory Big Data, Columnar

ML 100x faster Algos

R CRAN, API, fast engine

API Spark API, Java MM

Community Devs, Data Science

H2O.aiMachine Intelligence

H2O Flow Interface

H2O.aiMachine Intelligence

125 Meetups15,000 Attendees13,200+ Installations 2,300+ Corporations1st annual H2O World Conference

Adoption and Growth

weeks

H2O.aiMachine Intelligence

H2O’s Install BaseML is the new SQL

103 634 2329

463 2,887 13,237

Companies

Users

Mar 2014 July 2014 Mar 2015

Open Source

135+ MeetupsWord-of-Mouth

Active Users

H2O.aiMachine Intelligence

Hadoop + HDFS

YARN node manager

worker

+Mllib worker

YARN container

Spark executor

Scala main program

Sparkling Water cluster of size 3 on YARN

YARN node manager

worker

+Mllib worker

YARN container

Spark executor

YARN node manager

worker

+Mllib worker

YARN container

Spark executor

client

+Mllib client

Driver

H2O.aiMachine Intelligence

JavaScript R Python Excel/Tableau

Network

Rapids Expression Evaluation Engine Scala Customer

Algorithm

Customer AlgorithmParse

GLMGBMRF

Deep LearningK-Means

PCA

In-H2O Prediction Engine

H2O Software Stack

Fluid Vector Frame

Distributed K/V Store

Non-blocking Hash Map

Job

MRTask

Fork/Join

Flow

Customer Algorithm

Spark Hadoop Standalone H2O

H2O.aiMachine Intelligence

Reading Data from HDFS into H2O with R

STEP 1

R user

h2o_df = h2o.importFile(“hdfs://path/to/data.csv”)

H2O.aiMachine Intelligence

Reading Data from HDFS into H2O with R

H2OH2O

H2O

data.csv

HTTP REST API request to H2Ohas HDFS path

H2O ClusterInitiate distributed

ingest

HDFSRequest

data from HDFS

STEP 22.2

2.3

2.4

R

h2o.importFile()

2.1R function

call

H2O.aiMachine Intelligence

Reading Data from HDFS into H2O with R

H2OH2O

H2O

R

HDFS

STEP 3

Cluster IPCluster Port

Pointer to Data

Return pointer to data in REST

API JSON Response

HDFS provides data

3.3

3.43.1h2o_df object

created in R

data.csv

h2o_df H2OFrame

3.2Distributed

H2OFrame in DKV

H2O Cluster

H2O.aiMachine Intelligence

R Script Starting H2O GLM

HTTP

REST/JSON

.h2o.startModelJob()POST /3/ModelBuilders/glm

h2o.glm()

R script

Standard R process

TCP/IP

HTTP

REST/JSON

/3/ModelBuilders/glm endpoint

Job

GLM algorithm

GLM tasks

Fork/Join framework

K/V store framework

H2O process

Network layer

REST layer

H2O - algos

H2O - core

User process

H2O process

Legend

H2O.aiMachine Intelligence

R Script Retrieving H2O GLM Result

HTTP

REST/JSON

h2o.getModel()GET /3/Models/glm_model_id

h2o.glm()

R script

Standard R process

TCP/IP

HTTP

REST/JSON

/3/Models endpoint

Fork/Join framework

K/V store framework

H2O process

Network layer

REST layer

H2O - algos

H2O - core

User process

H2O process

Legend