Self-Service Data Science for Leveraging ML & AI on All of Your Data

© 2017 MapR TechnologiesMapR Confidential 1

Self-Service Data Science for

Leveraging ML & AI on All of Your

Data:Introducing the MapR Data Science Refinery

Rachel SilverProduct Manager – Data Science & Analytics

11/16/17


Summary

• Why Companies Invest In ML/AI

• Winning With a Data First Approach

• Introducing the MapR Data Science Refinery

• Deep Dive & Demos

– Ease of Deployment

– Data Exploration

– Extensibility & Collaboration


Why Companies Invest In ML/AI


Where AI Creates Value In The Value Chain

Produce

Optimized Production &

Maintenance

Provide rich, personal, and convenient

user experiences.

Project

Smarter R&D and

forecasting

Promote

Targeted Sales &

Marketing

Source: McKinsey Global Institute – Artificial Intelligence / The Next Digital Frontier? (2017)


Project Where The Next Threat Will Come FromDeep security analytics and advanced persistent threat (APT) detection

• Centralization and

visibility of all data

from an information

security perspective

• Reduced risk of

data breaches from

DDOS and APT

attacks

• Real-time insights

into what is

happening within

the environment

OBJECTIVE

• Early detection of data breaches and suspicious activity

• Aggregate and retain all security related data into a single central store and

then build statistical models to detect abnormal activity within the

environment.

• Get insights into what are insiders doing within the environment

CHALLENGES

• Existing SIEM solution could not scale

• Current solutions do not work well for “unknown” threats

SOLUTION

• Leverage MapR-DB for fast data ingestion and query performance

• MapR provided the deep storage and machine learning algorithms

• NFS enabled easy integration with the IT ecosystem

Retail

Bank


Source

1

Source

2

Source

1000

Houston

MAPR

Core

Cluster

Time to insight (48 hrs)

Manual Process

Before Edge

Source

1

Source

2

Source

1000

Houston

MAPR

Core

Cluster

Time to insight (<2 hrs)

Automated Process

1000s of

Oil & Drill Sources

Will do Pre Processing locally +at Core

(Custom App + Down Sampling)

After Edge

Produce More EfficientlyML aggregation and processing at the edge optimizes production

Oil & Gas

company


Promote personalized offers in real-timeTargeting credit card customers using Recommendation Engine

A Global Financial Services company wanted to offer real-time localized & personalized recommendations to their credit card holdersusing ML/AI

OBJECTIVE

• Increase revenue and customer loyalty through real-time personalized offers generated by a recommendation engine

CHALLENGES

• In order to be accurate, data had to be updated on a real-time basis• Being a global company, their Platform has to be consistent and 100%

available 24x7 – no downtime• Must be able to simultaneously ingest (stream) and update data in the

same cluster

SOLUTION

• MapR was the only distribution that met the mission critical needs of the customer and also provided the capability to ingest data continuously into the cluster

• Direct NFS allows data to be continuously ingested directly into their cluster• MapR-XD’s self-healing capability allowed them to go into production safely

Leading

Credit Card

Company


Provide Customers With a Customized ExperienceProvide customers with a personalized and convenient experience

Using ML/AI to bring customer understanding to the center of business processes

OBJECTIVE

• Use full knowledge of customer relationship to inform online interactions.

CHALLENGES

• Need to store 20 trillion records• Training sample size is 400 million records• The decision trees contained 2 million possible pathways• Every combination must be evaluated every time a model is used (~15 billion

combinations)

SOLUTION

• The MapR Converged Data Platform centralizes analytics and operational apps on one platform allowing Quantium to make one large infrastructure investment instead of many small silo’d ones. Current cluster has 50TB of memory and 5000 CPUs to process and store 5PB of data


A Winning Approach: Data First


Gartner estimates they solve between 10-100 business problems in three to five years.

Gartner estimates they solve

between 3-20 business

problems in three to five years.

20%

Contemplators Experimenters

41%40%

Adopters

Uncertain about the

benefits of Data Science.

Desire easy entry

Entry Points in the Data Science Journey

20%

Source: McKinsey Global Institute – Artificial Intelligence / The Next Digital Frontier? (2017)Source: Gartner – Magic Quadrant for Data Science Platforms (2017)







Uncertain about the


Desire easy entry

Adopters

20%


41%40%

80%!

Source: McKinsey Global Institute – Artificial Intelligence / The Next Digital Frontier? (2017)Source: Gartner – Magic Quadrant for Data Science Platforms (2017)







Uncertain about the


Desire easy entry

Adopters

20%

Experimenters

41%


AI adoption outside of the tech sectoris stuck here and many firms report they are

uncertain of the ROI

Contemplators

40%

Investment in AI is growing at a high rate,

but adoption in 2017 remains low

AI is only deployed into production

12% of the time







Uncertain about the


Desire easy entry


41%40%

Adopters

20%


Seamless Data Access

Technical Capabilities (a strong digital foundation)

Leadership From The Top

Key Traits Of A Successful Data Science Approach


If it is ALL about the data,

then it better be about ALL your data.

Seamless Data Access


ML Models Improve when Trained on Larger Datasets

Instead of relying on

assumptions and weak

correlations, presence of

more data results in better

and more accurate models

Source: A Survey of Applications of AI Algorithms in Eco-environmental modelling (2009)


Data Growth Puts A Premium on Efficient Leverage

Source: McKinsey Global Institute: “The Age of Analytics”, Dec. 2016

The amount of data

is predicted to

double every three

years

Data Diversity

EmailsCall Detail

Records

Click

stream

CSV DocumentsData

PDFBilling Data Meta

Data

JSON Network

Data

Mobile

Data

XMLProduct

Catalog

Medical

RecordsText Files VideoText

Messages

Merchant

Listings

Sensor

Data

Server

Logs

Set Top

Box

Social

Media

Audio

4 Zettabytes

of Data

20111986

300 Exabytes

of Data

3 Exabytes

of Data

20192016

2 Zettabytes

of Data


Hadoop + Vendor Approach to Data ScienceRequires yet another cluster

Data Science

cluster

Batch

Cluster

Streaming

Cluster

NoSQL

Cluster

On Premises



A Capable Platform With a Strong Digital Foundation

NFS POSIX REST HDFS

MAPR CONVERGED DATA PLATFORM

ON-PREMISES, MULTI-CLOUD, IoT EDGE

FILESTORE

CONTAINER STORE

CUSTOMFILE APPS

METADATAMANAGEMENT

JSON HBASEKAFKA

HADOOP & SPARK APPS

REAL-TIMEBI APPS

STREAMING APPS

IoT/EDGE

SQL

OPERATIONAL DATA HUB

CDC

CONTEXTUAL USER

EXPERIENCES

CORE BUSINESS

APPS

SINGLE

VIEWIOT


Real-time Machine Learning Pipelines

A Robust Microservices Framework

Event Streams

• Persistent

• Infinitely replicable

• Re-playable

Compare model

results live!

M

Model A

M

Model B Persistent

Client & Application

Containers


Advice For Leadership

Avoid

• Creating new silos

• Looking for a one-trick pony

• Adopting tools that have

unwieldy install, integration,

and configuration processes

• Tools that don’t scale to

broader enterprise use

• Ensure secure role based

access to all data

• Adopt tools that meet the

needs of a broad range of

Data Science Teams

• Encourage adoption by

making things easy, secure,

and complete

Important


Data Science @ MapR


The MapR Data Science VisionA Holistic Approach To Self-Service Data Science

MAPR DATA SCIENCE REFINERY REFINERY DATA SCIENTISTS

Data Scientist led product-and-

services offerings including Quick

Start Solutions (QSS) & Training

REFINERY PARTNERSHIPS

Expand on what we offer in-

product to meet the needs of all

data science teams

An easy-to-deploy, secure, and

extensible data science offering

that leverages all existing platform

assets



MapR Data Science Refinery

Provides the ability to work across many

engines in one visual space

• Apache Spark: Spark Streaming, SparkSQL, SparkR, and

PySpark

• Apache Hive

• Apache Pig

• Apache Drill

• Python

• Shell access to MapR-FS

• Programmatic access to MapR-DB and MapR-ES in Spark

Pluggable Visualization Available via Helium!

An Enterprise-ready Data Science Notebook

MAPR

POSIX CLIENT

FOR CONTAINERS

MAPR

CONVERGED CLIENT

FOR CONTAINERS


MapR Data Science Refinery Benefits

Easy to Deploy• A Docker Image includes all the necessary bits - no more,

no less - required to leverage MapR as a persistent data

store for your data science output.

• Available on DockerHub

Secure• Authentication occurs at a container level to ensure

containerized applications only have access to data for

which they are authorized.

• Communications are encrypted to ensure privacy when

accessing data in MapR.

Extensible• A Dockerfile is also available on GitHub, allowing you to

further customize the image as needed to support your

specific application needs.

• The Helium Framework enables pluggable visualization

Leverage Locally, On-premise, or in Cloud

CLOUD-SCALE

DATA STORE

MAPR-XD

OPERATIONAL

DATABASE

MAPR-DB

EVENT

STREAMING

MAPR-ES

High Availability Real-time Unified Security Multi-Tenancy Disaster Recovery Global Namespace



Partner Integration: An Example

We’re enabling our partners to integrate with and use this product

DataScience.com Platform

Services

MapR DSR

Zeppelin Livy

JDBC

MapR Clients



Demo: Ease of Deployment & Data Exploration


Demo: Ease of Deployment

What’s in the command

docker run --rm -it --cap-add SYS_ADMIN --cap-add SYS_RESOURCE --

device /dev/fuse --memory 0 -e MAPR_CLUSTER=my.cluster.com -e

MAPR_MEMORY=0 -e MAPR_MOUNT_PATH=/mapr -e

MAPR_TZ=America/Los_Angeles -e MAPR_CONTAINER_USER=mapr -e

MAPR_CONTAINER_UID=5000 -e MAPR_CONTAINER_GROUP=mapr -e

MAPR_CONTAINER_GID=5000 -e

MAPR_CLDB_HOSTS=172.24.8.195,172.24.11.200,172.24.10.4 -e

MAPR_TICKETFILE_LOCATION=/tmp/maprticket_5000 -e

ZEPPELIN_SSL_PORT=9995 -e HOST_IP=172.24.11.62 -e

MAPR_HS_HOST=172.24.8.195 -p 9995:9995 -p 10000-10010:10000-10010 -v

/tmp/maprticket_5000:/tmp/maprticket_5000:ro -v

/sys/fs/cgroup:/sys/fs/cgroup:ro maprtech/data-science-

refinery:v1.0_6.0.0_4.0.0_centos7










MAPR_CLDB_HOSTS=172.24.8.195,172.24.11.200,172.24.10.4 -e



MAPR_HS_HOST=172.24.8.195 -p 9995:9995 -p 10000-10010:10000-10010 -v













MAPR_CLDB_HOSTS=172.24.8.195,172.24.11.200,172.24.10.4 -e



MAPR_HS_HOST=172.24.8.195 -p 9995:9995 -p 10000-10010:10000-10010 -v













MAPR_CLDB_HOSTS=172.24.8.195,172.24.11.200,172.24.10.4 -e



MAPR_HS_HOST=172.24.8.195 -p 9995:9995 -p 10000-10010:10000-10010 -v













MAPR_CLDB_HOSTS=172.24.8.195,172.24.11.200,172.24.10.4 -e



MAPR_HS_HOST=172.24.8.195 -p 9995:9995 -p 10000-10010:10000-10010 -v






How is Security Handled?

$ maprlogin password

[Password for user ’jane' at cluster 'my.cluster.com': ]

MapR credentials of user ’john' for cluster 'my.cluster.com' are written to '/tmp/janes_ticket’

Job submits as ‘jane’



Why Livy?

CLOUD-SCALE

DATA STORE

MAPR-XD

OPERATIONAL

DATABASE

MAPR-DB

EVENT

STREAMING

MAPR-ES

MAPR CONVERGED DATA PLATFORMHTTP (RPC)

Advantages over native Spark Interpreter:• Jobs are submitted in YARN cluster mode

• Spark context can be shared

• Support for Spark Dynamic Resource Allocation


Demo: Extensibility & Collaboration



Collaboration

CLOUD-SCALE

DATA STORE

MAPR-XD

OPERATIONAL

DATABASE

MAPR-DB

EVENT

STREAMING

MAPR-ES




Collaboration

CLOUD-SCALE

DATA STORE

MAPR-XD

OPERATIONAL

DATABASE

MAPR-DB

EVENT

STREAMING

MAPR-ES


MAPR

POSIX CLIENT

FOR CONTAINERS




docker run --rm -it --cap-add SYS_ADMIN --cap-add SYS_RESOURCE --device /dev/fuse --memory 0 -e MAPR_CLUSTER=my.cluster.com -e MAPR_MEMORY=0 -e MAPR_MOUNT_PATH=/mapr -e ZEPPELIN_NOTEBOOK_DIR=/mapr/my.cluster.com/user/mapr/zeppelin/shared-notebooks/ -e MAPR_TZ=America/Los_Angeles -e MAPR_CONTAINER_USER=mapr -e MAPR_CONTAINER_UID=5000 -e MAPR_CONTAINER_GROUP=mapr -e MAPR_CONTAINER_GID=5000 -e MAPR_CLDB_HOSTS=172.24.8.195,172.24.11.200,172.24.10.4 -e MAPR_TICKETFILE_LOCATION=/tmp/maprticket_5000 -e ZEPPELIN_SSL_PORT=9995 -e HOST_IP=172.24.11.62 -e MAPR_HS_HOST=172.24.8.195 -p 9995:9995 -p 10000-10010:10000-10010 -v /tmp/maprticket_5000:/tmp/maprticket_5000:ro -v /sys/fs/cgroup:/sys/fs/cgroup:romaprtech/data-science-refinery:v1.0_6.0.0_4.0.0_centos7


Demo: Extensibility

Adding Deep Learning libraries to the container


Demo: Extensibility


CLOUD-SCALE

DATA STORE

MAPR-XD

OPERATIONAL

DATABASE

MAPR-DB

EVENT

STREAMING

MAPR-ES


Compute Persistent Storage


Demo: Extensibility


CLOUD-SCALE

DATA STORE

MAPR-XD

OPERATIONAL

DATABASE

MAPR-DB

EVENT

STREAMING

MAPR-ES


Compute Persistent Storage

What if this was a box of GPUs?


A Final Comparison

Traditional Hadoop Vendor

Ba

tch

Clu

ste

r

Stre

am

ing

Clu

ste

r

No

SQ

L C

luste

r

On Premises

Data

Science

cluster


Q&A

ENGAGE WITH US

@mapr

[email protected]

Date post:	21-Jan-2018
Category:	Data & Analytics
Upload:	mapr-data-technologies
View:	257 times
Download:	2 times

Self-Service Data Science for Leveraging ML & AI on All of Your Data

Data & Analytics