+ All Categories
Home > Technology > H2O Rains with Databricks Cloud - NY 02.16.16

H2O Rains with Databricks Cloud - NY 02.16.16

Date post: 06-Jan-2017
Category:
Upload: srisatish-ambati
View: 524 times
Download: 2 times
Share this document with a friend
23
H2O Rains with Databricks Cloud Michal Malohlava @mmalohlava NYC Meetup 2016/02/16, SF
Transcript
Page 1: H2O Rains with Databricks Cloud - NY 02.16.16

H2O Rains with Databricks Cloud

Michal Malohlava @mmalohlava

NYC Meetup 2016/02/16, SF

Page 2: H2O Rains with Databricks Cloud - NY 02.16.16

Who Am I?Background

• PhD in CS from Charles University in Prague, Czech Republic

• Postdoc at Purdue University experimenting with algos for large-scale computation

• Now SW engineer at H2O.ai

Experience with domain-specific languages, distributed system, software engineering,

and big data.

Page 3: H2O Rains with Databricks Cloud - NY 02.16.16

H2O.aiH

2O team

Sri Ambati Cliff ClickCo-

Foun

ders

Stephen Boyd

Rob Tibshirani

TrevorHastie

Scie

ntifi

cA

dvis

ory

Cou

ncil

Page 4: H2O Rains with Databricks Cloud - NY 02.16.16

H2OOpen-Source In-Memory Data Science Platform

• Highly optimized Java code (in-house)

• Distributed in-memory K-V store and map/reduce computation framework

• Data parser (HDFS, S3, NFS, HTTP, local drives, etc.)

• Read/write access to distributed data frames (R/Pandas-style)

• ML algos - Deep Learning, GBM, DRF, GLM, GLRM, K-Means, PCA, CoxPH, Ensembles

• REST API: clients Interactive UI/R/Python

Page 5: H2O Rains with Databricks Cloud - NY 02.16.16

H2O+Spark = Sparkling

Water

Page 6: H2O Rains with Databricks Cloud - NY 02.16.16

Open-source distributed execution platform

User-friendly API for data transformation based on RDDs, DataFrames (from 1.4) and DataSets (from 1.6)

Platform components - SQL, MLLib, text mining, Avro, Redshift, Kinesis.

Easily extendable by 3rd party packages Interactive shell

Current release 1.6Supported releases 1.3, 1.4, 1.5

Page 7: H2O Rains with Databricks Cloud - NY 02.16.16

DatabricksDatabricks • founded by the creators of Apache Spark • still contribute 75% of the code to the Spark project • cloud platform for running Spark in your AWS account

Databricks Platform • integrated collaborative data

science workspace • notebook interface inspired by

iPython and Zeplin but purpose built for Spark

• self service cluster manager and job scheduler for production Spark workloads

Page 8: H2O Rains with Databricks Cloud - NY 02.16.16

Sparkling WaterProvides

Transparent integration of H2O with Spark ecosystem

Transparent use of H2O data structures and algorithms with Spark API

Platform for building Smarter Applications

Excels in existing Spark workflows requiring advanced Machine Learning algorithms

Functionality missing in H2O can be replaced by Spark and vice versa

Page 9: H2O Rains with Databricks Cloud - NY 02.16.16

How to use Sparkling Water?

Page 10: H2O Rains with Databricks Cloud - NY 02.16.16

Model Building

Data Source

Data munging Modelling

Deep Learning, GBMDRF, GLM, GLRM

K-Means, PCACoxPH, Ensembles

Prediction processing

Page 11: H2O Rains with Databricks Cloud - NY 02.16.16

Data Munging

Data Source

Data load/munging/ exploration Modelling

Page 12: H2O Rains with Databricks Cloud - NY 02.16.16

Stream processing

DataSourceO

ff-lin

e m

odel

trai

ning

Data munging

Model prediction

Deploy the model

Stre

ampr

oces

sing

Data Stream

Spark Streaming/Storm

Export modelin a binary format

or as code

Modelling

Page 13: H2O Rains with Databricks Cloud - NY 02.16.16

What is inside?

Page 14: H2O Rains with Databricks Cloud - NY 02.16.16

Databricks

Worker node

Spark executor

Scala/Py main program

Driver node

H2OContext

SparkContext

Worker node

Spark executor

Worker node

Spark executor

Page 15: H2O Rains with Databricks Cloud - NY 02.16.16

H2O

Ser

vice

sH

2O S

ervi

ces

Data Source

Spar

k Ex

ecut

orSp

ark

Exec

utor

Spar

k Ex

ecut

or

Spark Cluster

DataFrame

H2O

Ser

vice

s

H2OFrame

Data Source

h2oContext.asDataFrame

h2oContext.asH2OFrame

Page 16: H2O Rains with Databricks Cloud - NY 02.16.16

DEMO Time!

Page 17: H2O Rains with Databricks Cloud - NY 02.16.16

What do we need?Databricks account (14 day free trial at www.databricks.com)

AWS account

Sparkling Water coordinates: ai.h2o:sparkling-water-examples_2.10:1.5.10

And some cool machine learning idea!

Page 18: H2O Rains with Databricks Cloud - NY 02.16.16

OR

Detect spam text messages

Page 19: H2O Rains with Databricks Cloud - NY 02.16.16

Data sample

Page 20: H2O Rains with Databricks Cloud - NY 02.16.16

Goal

For a given text message

identify if it is spam or not

Page 21: H2O Rains with Databricks Cloud - NY 02.16.16

Machine Learning Workflow

1. Extract data

2. Transform, tokenize messages

3. Build Tf-IDF model

4. Create and evaluate Deep Learning model

5. Use the model to detect spam

Page 22: H2O Rains with Databricks Cloud - NY 02.16.16

Checkout H2O.ai Training Books

http://learn.h2o.ai/

Checkout H2O.ai Blog

http://h2o.ai/blog/

Checkout H2O.ai Youtube Channel

https://www.youtube.com/user/0xdata

Checkout GitHub

https://github.com/h2oai/sparkling-water

Meetups

https://meetup.com/

More info

Page 23: H2O Rains with Databricks Cloud - NY 02.16.16

Learn more at h2o.ai Follow us at @h2oai

Thank you!Sparkling Water is

open-source ML application platform

combining power of Spark and H2O


Recommended