+ All Categories
Home > Documents > Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging...

Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging...

Date post: 07-Jul-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
28
May 20, 2019 Machine Learning Models as a Service Tobias Wenzel Vigith Maurice
Transcript
Page 1: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

May 20, 2019

Machine Learning Models as a Service

Tobias WenzelVigith Maurice

Page 2: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks
Page 3: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

“We spend more time bringing the model to production than developing and training it”

Data Science Operations is not easy

Page 4: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

“We spend more time bringing the model to production than developing and training it”

Data Science Operations is not easy

Page 5: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

Let's build a platform

Centralized or Decentralized?

Page 6: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

Let's build a platform

Centralized!

Page 7: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

Let's build a centralized platform

The platform controls the environment in which models are deployed

Page 8: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

Let's build a centralized platform

The platform can implement actions of the machine learning model life cycle centrally

Page 9: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

Let's build a centralized platform

Create self service APIs for every interaction with the platform

Page 10: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

Security for an ML Platform

https://news.ycombinator.com/item?id=15256121

Page 11: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

Security for an ML Platform

https://medium.com/@bertusk/detecting-cyber-attacks-in-the-python-package-index-pypi-61ab2b585c67

Page 12: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

Let's build a centralized platform

Runtime security can be taken care of by the platform

Page 13: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

Let's build a centralized platform

Updates can be rolled out to every running model by updating the platform

Page 14: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

Machine Learning Platform

Model Training Model Execution

Secure Data Access (Intuit Data Lake)

ContinuousTraining(Argo)

Access Control(Authn, Authz)

LoggingTracing(Splunk)

Monitoring(Wavefront)

ML Compute (via Sagemaker)(GPUs etc.)

Notebooks (via Sagemaker)(Data Exploration, Visualization)

DataAggregation(Pre-fetch, Cache, Optimize)

Auto Scaling(Cost Control)

Billing(Chargeback)

Prediction Quality Feedback(Beacons)

Self Service(Model Lifecycle Management)

Storage(Training Data, Model Artifacts)

User Interfaces (Web, API, CLI)

Intuit’s ML Platform

Page 15: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

And Now for Something Completely Different

Page 16: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

ObservabilityBeaconing / Monitoring / Logging

Let's build a centralized platform

Page 17: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

IntrospectionInspect prediction response (Monitoring Contd.)

Let's build a centralized platform

Page 18: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

Meaning of life = 43I expected 42!

Introspection

Page 19: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

BeaconingRe-training and Monitoring

Introspection

Page 20: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

What is a beacon?

{"modelName": "meaning-of-life","modelVersion": "1","environment": "PROD","requestBody": "What is the meaning of life?","responseBody": "42"

… Metadata …}

Page 21: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

Realtime Beacons for Predict Request and Responses

KafkaData Lake

Hive, MR, Spark, Flink

Kafka Consumer

Realtim

e

Micro Batchingencrypt

Beaconing - BYOC

Labelling / Re-training

Anomaly / Alerting

Near Real Time

Long Term

Monitoring Feedback Loop

{"modelName": "meaning-of-life","modelVersion": "1","environment": "PROD","requestBody": "What is the meaning of life?","responseBody": "42"}.encode()

Page 22: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

Monitoringthe bottom layer of the Hierarchy of Production Needs, is fundamental to

running a stable service[1].

Observability

Page 23: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

Core Building Blocks

Model Hosting Service

HostedModels

Store

Beacon

Cache

Counters Timers Distribution

Requests Core Process Time Request Size

HTTP Status Total Response Time Response Size

Store Requests Store Process Time Store get/put size

Cache Hit/Miss Cache Process Time Cache get/put size

Beacons Beacon Emit Time Beacon size

Authentication LMA Process Time

● Cardinality of each is increased by tagging● Availability is formulated from granular metrics● Most alerting is based on P99

Monitoring - Platform on Platform

Page 24: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

Monitoring - Request Latency

Page 25: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

Logging

Let's build a centralized platform

Page 26: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

● Accessibility (easy to access)

● Compartmentalized (logs are grouped by Model)

● Traceability (transactional id flows from end user app all the way to Model)

● Near real time

● Log Retention

Centralized Logging

Page 27: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

Machine Learning Models as a Service

Page 28: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks

1. Make actions in ML lifecycle self service

2. Take the operations burden off of the data scientist

3. Make sure models are run securely

4. Provide common functionalities to all models at scale

5. Provide logging, tracing and monitoring out of the box

Running ML models is just like running a Service


Recommended