+ All Categories
Home > Software > H2O Rains with Databricks Cloud - Parisoma SF

H2O Rains with Databricks Cloud - Parisoma SF

Date post: 06-Jan-2017
Category:
Upload: srisatish-ambati
View: 1,113 times
Download: 0 times
Share this document with a friend
23
H2O Rains with Databricks Cloud Michal Malohlava @mmalohlava Meetup 2016/02/04, SF
Transcript
Page 1: H2O Rains with Databricks Cloud - Parisoma SF

H2O Rains with Databricks Cloud

Michal Malohlava @mmalohlava

Meetup 2016/02/04, SF

Page 2: H2O Rains with Databricks Cloud - Parisoma SF

Who Am I?Background

• PhD in CS from Charles University in Prague, Czech Republic

• Postdoc at Purdue University experimenting with algos for large-scale computation

• Now SW engineer at H2O.ai

Experience with domain-specific languages, distributed system, software engineering,

and big data.

Page 3: H2O Rains with Databricks Cloud - Parisoma SF

H2O.aiH

2O team

Sri Ambati Cliff ClickCo-

Foun

ders

Stephen Boyd

Rob Tibshirani

TrevorHastie

Scie

ntifi

cA

dvis

ory

Cou

ncil

Page 4: H2O Rains with Databricks Cloud - Parisoma SF

H2OOpen-Source In-Memory Data Science Platform

• Highly optimized Java code (in-house)

• Distributed in-memory K-V store and map/reduce computation framework

• Data parser (HDFS, S3, NFS, HTTP, local drives, etc.)

• Read/write access to distributed data frames (R/Pandas-style)

• ML algos - Deep Learning, GBM, DRF, GLM, GLRM, K-Means, PCA, CoxPH, Ensembles

• REST API: clients Interactive UI/R/Python

Page 5: H2O Rains with Databricks Cloud - Parisoma SF

H2O+Spark = Sparkling

Water

Page 6: H2O Rains with Databricks Cloud - Parisoma SF

Open-source distributed execution platform

User-friendly API for data transformation based on RDDs, DataFrames (from 1.4) and DataSets (from 1.6)

Platform components - SQL, MLLib, text mining, Avro, Redshift, Kinesis.

Easily extendable by 3rd party packages Interactive shell

Current release 1.6Supported releases 1.3, 1.4, 1.5

Page 7: H2O Rains with Databricks Cloud - Parisoma SF

DatabricksDatabricks • founded by the creators of Apache Spark • still contribute 75% of the code to the Spark project • cloud platform for running Spark in your AWS account

Databricks Platform • integrated collaborative data

science workspace • notebook interface inspired by

iPython and Zeplin but purpose built for Spark

• self service cluster manager and job scheduler for production Spark workloads

Page 8: H2O Rains with Databricks Cloud - Parisoma SF

Sparkling WaterProvides

Transparent integration of H2O with Spark ecosystem

Transparent use of H2O data structures and algorithms with Spark API

Platform for building Smarter Applications

Excels in existing Spark workflows requiring advanced Machine Learning algorithms

Functionality missing in H2O can be replaced by Spark and vice versa

Page 9: H2O Rains with Databricks Cloud - Parisoma SF

How to use Sparkling Water?

Page 10: H2O Rains with Databricks Cloud - Parisoma SF

Model Building

Data Source

Data munging Modelling

Deep Learning, GBMDRF, GLM, GLRM

K-Means, PCACoxPH, Ensembles

Prediction processing

Page 11: H2O Rains with Databricks Cloud - Parisoma SF

Data Munging

Data Source

Data load/munging/ exploration Modelling

Page 12: H2O Rains with Databricks Cloud - Parisoma SF

Stream processing

DataSourceO

ff-lin

e m

odel

trai

ning

Data munging

Model prediction

Deploy the model

Stre

ampr

oces

sing

Data Stream

Spark Streaming/Storm

Export modelin a binary format

or as code

Modelling

Page 13: H2O Rains with Databricks Cloud - Parisoma SF

What is inside?

Page 14: H2O Rains with Databricks Cloud - Parisoma SF

Databricks

Worker node

Spark executor

Scala/Py main program

Driver node

H2OContext

SparkContext

Worker node

Spark executor

Worker node

Spark executor

Page 15: H2O Rains with Databricks Cloud - Parisoma SF

H2O

Ser

vice

sH

2O S

ervi

ces

Data Source

Spar

k Ex

ecut

orSp

ark

Exec

utor

Spar

k Ex

ecut

or

Spark Cluster

DataFrame

H2O

Ser

vice

s

H2OFrame

Data Source

h2oContext.asDataFrame

h2oContext.asH2OFrame

Page 16: H2O Rains with Databricks Cloud - Parisoma SF

DEMO Time!

Page 17: H2O Rains with Databricks Cloud - Parisoma SF

What do we need?Databricks account (14 day free trial at www.databricks.com)

AWS account

Sparkling Water coordinates: ai.h2o:sparkling-water-examples_2.10:1.5.10

And some cool machine learning idea!

Page 18: H2O Rains with Databricks Cloud - Parisoma SF

OR

Detect spam text messages

Page 19: H2O Rains with Databricks Cloud - Parisoma SF

Data sample

Page 20: H2O Rains with Databricks Cloud - Parisoma SF

Goal

For a given text message

identify if it is spam or not

Page 21: H2O Rains with Databricks Cloud - Parisoma SF

Machine Learning Workflow

1. Extract data

2. Transform, tokenize messages

3. Build Tf-IDF model

4. Create and evaluate Deep Learning model

5. Use the model to detect spam

Page 22: H2O Rains with Databricks Cloud - Parisoma SF

Checkout H2O.ai Training Books

http://learn.h2o.ai/

Checkout H2O.ai Blog

http://h2o.ai/blog/

Checkout H2O.ai Youtube Channel

https://www.youtube.com/user/0xdata

Checkout GitHub

https://github.com/h2oai/sparkling-water

Meetups

https://meetup.com/

More info

Page 23: H2O Rains with Databricks Cloud - Parisoma SF

Learn more at h2o.ai Follow us at @h2oai

Thank you!Sparkling Water is

open-source ML application platform

combining power of Spark and H2O


Recommended