+ All Categories
Home > Technology > JustEnoughDevOpsForDataScientists

JustEnoughDevOpsForDataScientists

Date post: 21-Jan-2018
Category:
Upload: anya-bida
View: 205 times
Download: 0 times
Share this document with a friend
37
Transcript

Just Enough DevOps for Data Scientists

[email protected]

@ anyabida1

Anya Bida, SRE at Salesforce

About Anya

Sr. Member of Technical Staff (SRE)

Salesforce Production Engineering

Salesforce Einstein Platform

Co-organizer SF Big Analytics

Spark Tuning

• Cheat-sheet

• Talks

Previously at Alpine Data, SRI

PhD Mayo Clinic, BS Johns Hopkins

@anyabida1

What I am going to talk about

What is DevOps

Salesforce Einstein Scales

Our goal

Top 10 tips

What’s next?

What is DevOps?

Software Development

Network &

SecurityInfrastructure

Build & Release

What is DevOps?

Software Development

Network &

SecurityInfrastructure

Build & Release

Data Science

What is DevOps?

Software Development

Network &

SecurityInfrastructure

Build & Release

Data Science

• Awesome library

on SparkML

• Spark clusters

• Microservices

• Cluster, Containers

Fastest Growing Top 5 Enterprise Software Company

$5.4BFY15

$4.1BFY14

$3.1BFY13

$6.7BFY16

$2.3BFY12

$1.7BFY11

$2.56BFY18Q2 revenue

$8.4BFY17 revenue

2009 • 2010 • 2011

2012 • 2013 • 2014

2015 • 2016 • 2017

September 2016

2011 • 2012 • 2013

2014 • 2015 • 2016 • 2017

The world’s most

innovative companies

“Innovator of

the Decade”

Our Goal

Time

Number of Predictions

Infrastructure Costs

Tip 1: Plan for FailureTake off that Data Scientist hat now.

Simple Dashboard with KPIs

Tip 1: Plan for FailureTake off that Data Scientist hat now.

Tip 1: Plan for FailureTake off that Data Scientist hat now.

https://www.slideshare.net/jiboumans/how-to-measure-everything-a-million-metrics-per-second-with-minimal-developer-overhead

Simple Dashboard with KPIs

• Request & error rates

• Longest response times - upper

95th & 99th percentile

• Capacity

• Events

Jos Boumans,

Salesforce DMP

slides

Tip 1: Plan for FailureTake off that Data Scientist hat now.

https://www.slideshare.net/jiboumans/how-to-measure-everything-a-million-metrics-per-second-with-minimal-developer-overhead

Simple Dashboard with KPIs

• Request & error rates

• Longest response times - upper

95th & 99th percentile

• Capacity

• Events

Collect metrics from every

machine.

Troubleshoot with all the

metrics at your disposal

Tip 2: Blue Green Deployments

https://docs.mobingi.com/official/guide/bg-deploy

Blue Machine

(old)

Green Machine

(new)

Users

Tip 3: Assume people make mistakes

Technical debt

• Every manual change

• Duplicate metrics

Scale down resources

• Terminate unused machines

• Janitor Monkey

• Understand the cost per job

• Jobs should not accumulate files on disk

Tip 4: Changes should be auditableSchaper - the tool to compare schemas

https://www.linkedin.com/in/huqixiu/

Qixiu “Q” Hu

Tip 4: Changes should be auditableSchaper - the tool to compare schemas

https://www.linkedin.com/in/huqixiu/

Qixiu “Q” Hu

CREATE TABLE myConferences (

name text ,

city text,

early_bird timeuuid,

late_bird timeuuid,

PRIMARY KEY ((name, city),

early_bird)

) WITH CLUSTERING ORDER BY

(early_bird DESC);

CREATE TABLE myConferences (

name text ,

city text,

early_bird timeuuid,

late_bird timeuuid,

PRIMARY KEY ((name, city),

early_bird)

) WITH CLUSTERING ORDER BY

(early_bird DESC);

Tip 4: Changes should be auditableSchaper - the tool to compare schemas

https://www.linkedin.com/in/huqixiu/

Qixiu “Q” Hu

CREATE TABLE myConferences (

name text ,

city text,

early_bird timeuuid,

late_bird timeuuid,

PRIMARY KEY ((name, city),

early_bird)

) WITH CLUSTERING ORDER BY

(early_bird DESC);

CREATE TABLE myConferences (

name text ,

city text,

early_bird timeuuid,

late_bird timeuuid,

discount_code string,

PRIMARY KEY ((name, city),

early_bird)

) WITH CLUSTERING ORDER BY

(early_bird DESC);

Tip 5: Configuration management

Network Connectivity

• 20 parameters

User Access

• 50 parameters

Deploy cluster (eg Mesos)

• 20 non-default parameters

Deploy a microservice

• 50 parameters

Schedule a job

• 3 parameters

SUM X 3 regionsX 20 metrics

Approx.6000

Templates for Automation

Service discovery

Creating dashboards• Prod, non-prod, …

Log queries

Cost analysis

Tip 6: Pick a naming convention

<service>.

<environment>.

<region>.

<hostname>.

<metric>

Tip 7: PermissionsEvery user, service, & job should have specific, auditable permissions.

Cluster Manager

Scheduler

IAM

IAM Roles

• User has an IAM Role

• Job has an IAM Role

• IAM Roles determine read /

write access to data

IAM

Out

Logs

IAM

In

Understanding Memory Management in Spark For Fun And Profit Shivnath Babu (Duke University, Unravel Data Systems)

Mayuresh Kunjir (Duke University)

Tip 8: Understand resource allocation

Node Memory

Container Memory

8Gb

Node Memory

Container

Memory

8Gb

Node

Memory

Node

Memory

Node

Memory

4Gb

used

8Gb

total

Can my 8Gb container launch on this cluster?

8Gb

Tip 9: Monitor multiple viewpoints

https://light.co/camera

Tip 9: Monitor multiple viewpointsConnectivity Viewer

https://www.linkedin.com/in/vaibhavt/

Vaibhav Tandon

Tip 9: Monitor multiple viewpointsConnectivity Viewer

https://www.linkedin.com/in/vaibhavt/

Vaibhav Tandon

Tip 9: Monitor multiple viewpointsConnectivity Viewer

https://www.linkedin.com/in/vaibhavt/

Vaibhav Tandon

Getting started tips:1. Plan for failure

2. Blue / Green Deployments

3. Assume people make mistakes

4. Changes should be auditable

5. Configuration management

6. Pick a naming convention

7. Permissions

• user, service, job

8. Understand resource allocation

9. Monitor multiple viewpoints

Getting started tips: 1. Plan for failure

2. Blue / Green Deployments

3. Assume people make mistakes

4. Changes should be auditable

5. Configuration management

6. Pick a naming convention

7. Permissions

• user, service, job

8. Understand resource allocation

9. Monitor multiple viewpoints

10. Infrastructure as Code

Did we just automate ourselves out of our jobs?Nope. Now we have time to take on new projects and grow…

More info:

Jos Boumans,

Salesforce DMP

slides

SRE How Google Runs

Production Systems book

James Ward,

Engineering & Open Source

Ambassador at Salesforce

High Performance

spark book

More info:

Real Time ML Pipelines in Multi-Tenant Environments

Director of Engineering Karl Skucha & Lead Engineer Yan Yang

Introduction to Machine Learning

Engineering & Open Source Ambassador James Ward

Fantastic ML apps and how to build them

Principal Engineer, Matthew Tovbin

Fireworks - lighting up the sky with millions of Sparks

Director of Engineering Thomas Gerber

Functional Linear Algebra in Scala

Engineer & Professor Vlad Patryshev

Panel: Functional Programming for Machine Learning

Saturday @ 2:10pm —Complex Machine Learning Pipelines Made Easy

Machine Learning Engineers Till Bergmann & Chris Rupley

[email protected]

@ anyabida1

Anya Bida, SRE at Salesforce

Questions?

Extra, unused slides