Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

Post on 07-Jan-2017

140 views 0 download

transcript

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential © 2016 MapR Technologies1

Real-World Machine Learning - Leverage the Features of MapR Converged Data PlatformMathieu Dumoulin (mdumoulin@mapr.com) Mateusz Dymczyk (mateusz@h2o.ai)

Hadoop Summit Tokyo 2016

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 2

Today’s goals• Machine Learning projects in the Enterprise

have a LOT of requirements beyond training a

good ML model

• Current options are too complex

• Need a Converged Data Platform

• Introduce specific features useful for ML: – MapR-FS, Volumes, Mirrors and Topologies

– MapR-DB and MapR Streams

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 3

Mathieu Dumoulin, Data Engineer• Master’s degree in text classification on Hadoop at Fujitsu Canada’s Innovation Lab

• In Tokyo, I’ve worked as a Data Scientist, Search Engineer and Data Engineer

• I like Scikit-Learn and H2O •日本料理が大好き。とくに鍋としゃぶしゃぶです。

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 4

Mateusz Dymczyk, Software Engineer• M.Sc. in CS (Software and

System Engineering) @ AGH

UST, Poland

• Ph.D. (Machine Learning) dropout

• Software Engineer @ H2O.ai

• Previously ML/NLP @ Fujitsu

Laboratories and en-japan inc

• I’m taking Sommelier classes

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 5

A common machine learning pipeline

*Image from scikit-learn.org

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 6

… meets the real world (Enterprise IT)

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 7

… meets the real worldData comes from many sources maybe very large

Data isn’t always labeled!

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 8

… meets the real worldData comes from many sources, maybe very large

Needs ETL and cleaning

Finding the best algorithm and parameters can use a lot of CPU

Data isn’t always labeled!

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 9

… Meets the real worldData comes from many sources, maybe very large

Needs ETL and cleaning

Finding the best algorithm and parameters can use a lot of CPU

Data isn’t always labeled!

From production systems? Is it real time?

What server will serve predictions?

The predictions are used by another system...

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 10

Machine learning here...

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 11

Is not the same when you do it here

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 12

Enterprise machine learning mattersGrowing number of ML use cases at successful companies

Anomaly Detection 異常検出

Customer 360Fraud Detection 不正検出

Log Security Analysis ログ分析

Recommender Engines

レコメンデーションSensor Data Analysis (IoT)

Personalized Offers 個人化

Ad Tech

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 13

…but it’s HARD

Ref: http://advancedspark.com/ , https://github.com/fluxcapacitor/pipeline

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 14

There must be a better way...

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 15

Big data Enterprise IT infrastructure for ML

• You can start simple and show value quickly • It just works. Easy configuration and administration.

• Works with existing systems, and tools

• Includes common basics (File storage, DB, Streams)

• Strong ecosystem support (Apache projects)

• Enterprise class (multi-tenancy, security, HA, support)

An ideal platform for ML:

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential © 2016 MapR Technologies 16

MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 17

MapR Converged Data Platform

Open Source Engines & Tools Commercial Engines & Applications

Utility-Grade Platform Services

Dat

aP

roce

ssin

g

Enterprise StorageMapR-FS MapR-DB MapR Streams

Database Event Streaming

Global Namespace High Availability Data Protection Self-healing Unified Security Real-time Multi-tenancy

Search & Others

Cloud & Managed Services

Custom Apps

Unified M

anagement and M

onitoring

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 18

MapR is great for Enterprise ML projects

●MapR-FS and NFS mount

●Volumes and Topologies

●Mirrors and Snapshots

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 19

MapR Filesystem

•Native implementation in C/C++, it’s fast •Use it like your own local filesystem •Everything that can use files works as usual •Unique MapR technology

•For more info watch on Youtube: •What is MapR-FS •MapR-FS vs. HDFS

Working, battle-tested distributed read-write filesystem

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 20

NFS MountMount the cluster as a regular folder

$> sudo mount -o hard,nolock ip-10-0-0-110:/mapr /mapr $> ll /mapr/hadoopsummit/ total 3 drwxr-xr-x. 3 mapr mapr 1 Oct 13 11:21 appsdrwxr-xr-x. 2 mapr mapr 0 Oct 13 11:12 hbasedrwxr-xr-x. 3 root root 1 Oct 13 11:21 installerdrwxr-xr-x. 2 mapr mapr 0 Oct 13 11:14 optdrwxrwxrwx. 2 mapr mapr 1 Oct 14 10:41 tmpdrwxr-xr-x. 6 mapr mapr 4 Oct 14 10:52 userdrwxr-xr-x. 3 mapr mapr 1 Oct 13 11:13 var

© 2014 MapR Technologies 21

MapR NFS and Volumes

[mapr@ip-10-0-0-110 mapr]$ pwd /mapr/hadoopsummit/user/mapr

© 2014 MapR Technologies 22

MapR NFS and Volumes

[mapr@ip-10-0-0-110 mapr]$ pwd /mapr/hadoopsummit/user/mapr

© 2014 MapR Technologies 23

MapR NFS and Volumes

[mapr@ip-10-0-0-110 mapr]$ pwd /mapr/hadoopsummit/user/mapr

© 2014 MapR Technologies 24

MapR-FS and NFS mount for ML• Get started quickly and simply • Use your favorite tool like...

– Custom code (Scikit-learn, R) – SPSS, SAS, RapidMiner – Apache Spark, Drill, Flink

• Super easy data import – Just save to file on MapR – Integrate with legacy servers

and code – Use any ecosystem (Sqoop) it

all works

• Quick and scalable roundtrip during development

– ETL/cleaning -> train/test -> predict

– Don’t copy data (cluster to cluster, local to cluster)

• Run in production direct from the cluster

– no copying around

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 25

Volumes and Topologies - Managed in MCS

© 2014 MapR Technologies 26

Volumes and TopologiesVolumes are just “regular” volumes

© 2014 MapR Technologies 27

Volumes and TopologiesVolumes are just “regular” volumes

Select what nodes for volume data = Topology

© 2014 MapR Technologies 28

Volumes and Topologies for ML

• With YARN’s Node Labels, run tasks on nodes with guaranteed data locality – Special nodes with GPU, high memory or big CPU

• Multi-Tenancy – Share cluster with business use cases in production – Data isolation guaranteed – Easy unified admin (Data scientists != Hadoop

admin) – Bigger cluster, more reliable and faster

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 29

Snapshots and Mirrors

© 2014 MapR Technologies 30

Snapshots and Mirrors

© 2014 MapR Technologies 31

Snapshots and Mirrors

© 2014 MapR Technologies 32

Snapshots - Instant point in time save

© 2014 MapR Technologies 33

Mirrors - Physical copy

© 2014 MapR Technologies 34

Snapshots

[... mateusz]$ cd .snapshot [... .snapshot]$ ll total 1 drwxr-xr-x. 2 mapr mapr 1 Oct 14 10:56 mateusz.snap1

© 2014 MapR Technologies 35

Snapshots and Mirrors for ML

• Versioned data and models = Repeatable results

– same model, same data guaranteed

– Go back in time for free

• Keep intermediate transformations

– Quickly change your mind, don’t redo work

• A/B Testing easy-mode

© 2014 MapR Technologies 36

Real-time events and DB for ML• Built-in, no config, it just works • Support next-gen use cases

– hyper-personalization of web/store content – IoT Sensor data

• easy to start small but grows with your data/use case

© 2014 MapR Technologies 37

MapR Converged Application Blueprint

• Microservices connected by real-time streams – Ideal to serve predictions from ML models

• Next-Generation large-scale architecture • Working example: https://www.mapr.com/appblueprint/

overview

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 38

Converged Data Platform 💖 Machine Learning

• Features that work together to support all phases of ML

• Supports your existing tools/code and the state of the art

large scale frameworks

• Easier to manage, more robust and secure.

• MapR is made for the enterprise and great for ML!

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 39

Demo of H2O on MapR: Features in Action

Agenda

• Why tooling matters in Machine Learning • What is H2O and Sparkling Water • Why MapR • Demo

ML project problems

• Multiple data sources • Different formats • Large volumes of data to be read • System bootstrap time • Collaboration between data scientists • Comparing models • Deployment of the model • Versioning • Too many moving parts! • etc.etc.

Successful ML platform

• Fast ingestion and manipulation of versatile data • Intuitive modeling UI/API • Easy model validation, visualisation and comparison • Easy model deployment w/ versioning for fast predictions

• Written in high performance Java - native Java API

• Supports multiple file formats and data sources

• ETL capabilities

• Highly paralleled and distributed implementation

• Fast in-memory computation on highly compressed data

• Allows you to use all your data without sampling

• Runs on top of most major Hadoop distributions

ML platform

Ingestions platform

Big data platform

What is H2O?

• Open source platform

• Exposes math and predictive algorithms

• GLM, Random Forest, GBM, Deep Learning etc.

FlowUI

• Notebook style open source interface for H2O

• Code execution, mathematics, plots, and rich media

Why H2O?

• Fast ingestion and manipulation of versatile data • Blazing fast data parsing, supports multiple formats and

data sources • Intuitive modeling UI/API

• FlowUI, R/Python/REST APIs • Easy model validation, visualisation and comparison

• Cross-validation, FlowUI graphs, comparison via Steam • Easy model deployment /w versioning for fast predictions

• Model export as POJO, deploy as service via Steam

What is Sparkl ing Water?

• Framework integrating Spark and H2O • H2O instances on Spark executors • Allows to call Spark and H2O methods together

Why MapR?

• H2O + MapR-FS = fast data ingestion made even faster • Data resilience • MapR snapshots + H2O modelling from checkpoints =

continuous and versioned modelling

Demo

Air l ine delay classif ication

Model predicting flight delays

ETL Modell ing Predict ions

Load data from CSVs Model using H2O’s GLM

* https://github.com/h2oai/sparkling-water/tree/master/examples/scripts

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 50

Q & A@mapr

mdumoulin@mapr.com

Engage with us!

mapr-technologies