+ All Categories
Home > Technology > DataEngConf SF16 - Running simulations at scale

DataEngConf SF16 - Running simulations at scale

Date post: 15-Apr-2017
Category:
Upload: hakka-labs
View: 332 times
Download: 1 times
Share this document with a friend
15
2016 Simulations at Scale Saurabh Bajaj
Transcript

2016

Simulations at ScaleSaurabh Bajaj

Match passengers going in the same direction efficiently

Calculate the price of a ride

3 Lyft Line Matching

Lyft runs multiple services to power a Lyft ride

2 Dispatching

Pick driver to dispatch for a given ride request

1 Pricing Service

Pricing service determines how much a ride should cost

copy OpenStreetMap contributors

Price = (a distance) + (b time) + (c demand^2) + (d matching_coefficient) + (e hour_of_day) + hellip

1 Pricing

Dispatching Service determines the best driver to dispatch for a new ride

copy OpenStreetMap contributors

Nearest driver

Lowest time to arrivalBest lyft line match

2 Dispatching

Lyft Line service matches riders going in the same direction for maximum efficiency

copy OpenStreetMap contributors

Maximize system efficiency

Improve passenger experience

3 Lyft Line Matching

We optimize these services to perform well under varying conditions

Traffic Conditions

Traffic Conditions

Commute Hours

Ride Type

Demand Density

Day of Week

Factors like weather Specific events Interference

Traditional experimentation does not work well for optimizing these services

Test many variations of a model quickly

New product launchCombinationsExternal Factors

Eg Impact of Lyft Line launch on driver availability

We use simulations to solve these shortcomings

copy OpenStreetMap contributors

Make ride requests based on past activity

Dispatch driver based on new rules

Run for 1 week of historical data

Measure overall efficiency changes

Images for illustration only

Load ride data for given time period

Load driver locations for given time period

Post results to the DB

A simulator process replays Lyft rides for a given time period for a given market

Simulate ride activity

Fetch job config

Install pip requirements

Fetch microservices from github

Report stacktrace via email

Dispatch drivers with new model

Anatomy of a simulator

Execute jobs that are available to run

EC2

Simulators workers are deployed to EC2 and run new jobs asynchronously

EC2

EC2

DB

Register new simulations to run

Fetch new models

S3

Get new job

Ride activity and driver locations

EC2 auto scaling helps run thousands of simulations in parallel

Request for 10000 new simulations causing CPU spike across existing nodes

No more new models to run

Issues of race condition between workers arise that should be avoided

PostgreSQL docs

Any new simulations

Any new simulations

Register one job

DB

Yes 1 new job

Yes 1 new job

Ok running

Ok running

Worker 1 Worker 2 DatabaseClient

Use exclusive lock on select to prevent race conditions

TIM

E

Create base env and clone to speed up installation

Setting up new environment for each job is really slow and can be improved using conda

Base cond env Base

virtualenvBase conda env

Base virtualenv

Base conda env

Use conda to skip large build times

Conda Docs

Failures are expensive and need to be isolated and notified

DB

Register new job

Lock and start the job run

Job FailedRecord failure for retry

Email stacktrace

Lock and retry run

TIM

E

Thanks

Resilient Async Responsive Elastic

Reactive Manifesto

Build services that scale

saurabhlyftcom

Match passengers going in the same direction efficiently

Calculate the price of a ride

3 Lyft Line Matching

Lyft runs multiple services to power a Lyft ride

2 Dispatching

Pick driver to dispatch for a given ride request

1 Pricing Service

Pricing service determines how much a ride should cost

copy OpenStreetMap contributors

Price = (a distance) + (b time) + (c demand^2) + (d matching_coefficient) + (e hour_of_day) + hellip

1 Pricing

Dispatching Service determines the best driver to dispatch for a new ride

copy OpenStreetMap contributors

Nearest driver

Lowest time to arrivalBest lyft line match

2 Dispatching

Lyft Line service matches riders going in the same direction for maximum efficiency

copy OpenStreetMap contributors

Maximize system efficiency

Improve passenger experience

3 Lyft Line Matching

We optimize these services to perform well under varying conditions

Traffic Conditions

Traffic Conditions

Commute Hours

Ride Type

Demand Density

Day of Week

Factors like weather Specific events Interference

Traditional experimentation does not work well for optimizing these services

Test many variations of a model quickly

New product launchCombinationsExternal Factors

Eg Impact of Lyft Line launch on driver availability

We use simulations to solve these shortcomings

copy OpenStreetMap contributors

Make ride requests based on past activity

Dispatch driver based on new rules

Run for 1 week of historical data

Measure overall efficiency changes

Images for illustration only

Load ride data for given time period

Load driver locations for given time period

Post results to the DB

A simulator process replays Lyft rides for a given time period for a given market

Simulate ride activity

Fetch job config

Install pip requirements

Fetch microservices from github

Report stacktrace via email

Dispatch drivers with new model

Anatomy of a simulator

Execute jobs that are available to run

EC2

Simulators workers are deployed to EC2 and run new jobs asynchronously

EC2

EC2

DB

Register new simulations to run

Fetch new models

S3

Get new job

Ride activity and driver locations

EC2 auto scaling helps run thousands of simulations in parallel

Request for 10000 new simulations causing CPU spike across existing nodes

No more new models to run

Issues of race condition between workers arise that should be avoided

PostgreSQL docs

Any new simulations

Any new simulations

Register one job

DB

Yes 1 new job

Yes 1 new job

Ok running

Ok running

Worker 1 Worker 2 DatabaseClient

Use exclusive lock on select to prevent race conditions

TIM

E

Create base env and clone to speed up installation

Setting up new environment for each job is really slow and can be improved using conda

Base cond env Base

virtualenvBase conda env

Base virtualenv

Base conda env

Use conda to skip large build times

Conda Docs

Failures are expensive and need to be isolated and notified

DB

Register new job

Lock and start the job run

Job FailedRecord failure for retry

Email stacktrace

Lock and retry run

TIM

E

Thanks

Resilient Async Responsive Elastic

Reactive Manifesto

Build services that scale

saurabhlyftcom

Pricing service determines how much a ride should cost

copy OpenStreetMap contributors

Price = (a distance) + (b time) + (c demand^2) + (d matching_coefficient) + (e hour_of_day) + hellip

1 Pricing

Dispatching Service determines the best driver to dispatch for a new ride

copy OpenStreetMap contributors

Nearest driver

Lowest time to arrivalBest lyft line match

2 Dispatching

Lyft Line service matches riders going in the same direction for maximum efficiency

copy OpenStreetMap contributors

Maximize system efficiency

Improve passenger experience

3 Lyft Line Matching

We optimize these services to perform well under varying conditions

Traffic Conditions

Traffic Conditions

Commute Hours

Ride Type

Demand Density

Day of Week

Factors like weather Specific events Interference

Traditional experimentation does not work well for optimizing these services

Test many variations of a model quickly

New product launchCombinationsExternal Factors

Eg Impact of Lyft Line launch on driver availability

We use simulations to solve these shortcomings

copy OpenStreetMap contributors

Make ride requests based on past activity

Dispatch driver based on new rules

Run for 1 week of historical data

Measure overall efficiency changes

Images for illustration only

Load ride data for given time period

Load driver locations for given time period

Post results to the DB

A simulator process replays Lyft rides for a given time period for a given market

Simulate ride activity

Fetch job config

Install pip requirements

Fetch microservices from github

Report stacktrace via email

Dispatch drivers with new model

Anatomy of a simulator

Execute jobs that are available to run

EC2

Simulators workers are deployed to EC2 and run new jobs asynchronously

EC2

EC2

DB

Register new simulations to run

Fetch new models

S3

Get new job

Ride activity and driver locations

EC2 auto scaling helps run thousands of simulations in parallel

Request for 10000 new simulations causing CPU spike across existing nodes

No more new models to run

Issues of race condition between workers arise that should be avoided

PostgreSQL docs

Any new simulations

Any new simulations

Register one job

DB

Yes 1 new job

Yes 1 new job

Ok running

Ok running

Worker 1 Worker 2 DatabaseClient

Use exclusive lock on select to prevent race conditions

TIM

E

Create base env and clone to speed up installation

Setting up new environment for each job is really slow and can be improved using conda

Base cond env Base

virtualenvBase conda env

Base virtualenv

Base conda env

Use conda to skip large build times

Conda Docs

Failures are expensive and need to be isolated and notified

DB

Register new job

Lock and start the job run

Job FailedRecord failure for retry

Email stacktrace

Lock and retry run

TIM

E

Thanks

Resilient Async Responsive Elastic

Reactive Manifesto

Build services that scale

saurabhlyftcom

Dispatching Service determines the best driver to dispatch for a new ride

copy OpenStreetMap contributors

Nearest driver

Lowest time to arrivalBest lyft line match

2 Dispatching

Lyft Line service matches riders going in the same direction for maximum efficiency

copy OpenStreetMap contributors

Maximize system efficiency

Improve passenger experience

3 Lyft Line Matching

We optimize these services to perform well under varying conditions

Traffic Conditions

Traffic Conditions

Commute Hours

Ride Type

Demand Density

Day of Week

Factors like weather Specific events Interference

Traditional experimentation does not work well for optimizing these services

Test many variations of a model quickly

New product launchCombinationsExternal Factors

Eg Impact of Lyft Line launch on driver availability

We use simulations to solve these shortcomings

copy OpenStreetMap contributors

Make ride requests based on past activity

Dispatch driver based on new rules

Run for 1 week of historical data

Measure overall efficiency changes

Images for illustration only

Load ride data for given time period

Load driver locations for given time period

Post results to the DB

A simulator process replays Lyft rides for a given time period for a given market

Simulate ride activity

Fetch job config

Install pip requirements

Fetch microservices from github

Report stacktrace via email

Dispatch drivers with new model

Anatomy of a simulator

Execute jobs that are available to run

EC2

Simulators workers are deployed to EC2 and run new jobs asynchronously

EC2

EC2

DB

Register new simulations to run

Fetch new models

S3

Get new job

Ride activity and driver locations

EC2 auto scaling helps run thousands of simulations in parallel

Request for 10000 new simulations causing CPU spike across existing nodes

No more new models to run

Issues of race condition between workers arise that should be avoided

PostgreSQL docs

Any new simulations

Any new simulations

Register one job

DB

Yes 1 new job

Yes 1 new job

Ok running

Ok running

Worker 1 Worker 2 DatabaseClient

Use exclusive lock on select to prevent race conditions

TIM

E

Create base env and clone to speed up installation

Setting up new environment for each job is really slow and can be improved using conda

Base cond env Base

virtualenvBase conda env

Base virtualenv

Base conda env

Use conda to skip large build times

Conda Docs

Failures are expensive and need to be isolated and notified

DB

Register new job

Lock and start the job run

Job FailedRecord failure for retry

Email stacktrace

Lock and retry run

TIM

E

Thanks

Resilient Async Responsive Elastic

Reactive Manifesto

Build services that scale

saurabhlyftcom

Lyft Line service matches riders going in the same direction for maximum efficiency

copy OpenStreetMap contributors

Maximize system efficiency

Improve passenger experience

3 Lyft Line Matching

We optimize these services to perform well under varying conditions

Traffic Conditions

Traffic Conditions

Commute Hours

Ride Type

Demand Density

Day of Week

Factors like weather Specific events Interference

Traditional experimentation does not work well for optimizing these services

Test many variations of a model quickly

New product launchCombinationsExternal Factors

Eg Impact of Lyft Line launch on driver availability

We use simulations to solve these shortcomings

copy OpenStreetMap contributors

Make ride requests based on past activity

Dispatch driver based on new rules

Run for 1 week of historical data

Measure overall efficiency changes

Images for illustration only

Load ride data for given time period

Load driver locations for given time period

Post results to the DB

A simulator process replays Lyft rides for a given time period for a given market

Simulate ride activity

Fetch job config

Install pip requirements

Fetch microservices from github

Report stacktrace via email

Dispatch drivers with new model

Anatomy of a simulator

Execute jobs that are available to run

EC2

Simulators workers are deployed to EC2 and run new jobs asynchronously

EC2

EC2

DB

Register new simulations to run

Fetch new models

S3

Get new job

Ride activity and driver locations

EC2 auto scaling helps run thousands of simulations in parallel

Request for 10000 new simulations causing CPU spike across existing nodes

No more new models to run

Issues of race condition between workers arise that should be avoided

PostgreSQL docs

Any new simulations

Any new simulations

Register one job

DB

Yes 1 new job

Yes 1 new job

Ok running

Ok running

Worker 1 Worker 2 DatabaseClient

Use exclusive lock on select to prevent race conditions

TIM

E

Create base env and clone to speed up installation

Setting up new environment for each job is really slow and can be improved using conda

Base cond env Base

virtualenvBase conda env

Base virtualenv

Base conda env

Use conda to skip large build times

Conda Docs

Failures are expensive and need to be isolated and notified

DB

Register new job

Lock and start the job run

Job FailedRecord failure for retry

Email stacktrace

Lock and retry run

TIM

E

Thanks

Resilient Async Responsive Elastic

Reactive Manifesto

Build services that scale

saurabhlyftcom

We optimize these services to perform well under varying conditions

Traffic Conditions

Traffic Conditions

Commute Hours

Ride Type

Demand Density

Day of Week

Factors like weather Specific events Interference

Traditional experimentation does not work well for optimizing these services

Test many variations of a model quickly

New product launchCombinationsExternal Factors

Eg Impact of Lyft Line launch on driver availability

We use simulations to solve these shortcomings

copy OpenStreetMap contributors

Make ride requests based on past activity

Dispatch driver based on new rules

Run for 1 week of historical data

Measure overall efficiency changes

Images for illustration only

Load ride data for given time period

Load driver locations for given time period

Post results to the DB

A simulator process replays Lyft rides for a given time period for a given market

Simulate ride activity

Fetch job config

Install pip requirements

Fetch microservices from github

Report stacktrace via email

Dispatch drivers with new model

Anatomy of a simulator

Execute jobs that are available to run

EC2

Simulators workers are deployed to EC2 and run new jobs asynchronously

EC2

EC2

DB

Register new simulations to run

Fetch new models

S3

Get new job

Ride activity and driver locations

EC2 auto scaling helps run thousands of simulations in parallel

Request for 10000 new simulations causing CPU spike across existing nodes

No more new models to run

Issues of race condition between workers arise that should be avoided

PostgreSQL docs

Any new simulations

Any new simulations

Register one job

DB

Yes 1 new job

Yes 1 new job

Ok running

Ok running

Worker 1 Worker 2 DatabaseClient

Use exclusive lock on select to prevent race conditions

TIM

E

Create base env and clone to speed up installation

Setting up new environment for each job is really slow and can be improved using conda

Base cond env Base

virtualenvBase conda env

Base virtualenv

Base conda env

Use conda to skip large build times

Conda Docs

Failures are expensive and need to be isolated and notified

DB

Register new job

Lock and start the job run

Job FailedRecord failure for retry

Email stacktrace

Lock and retry run

TIM

E

Thanks

Resilient Async Responsive Elastic

Reactive Manifesto

Build services that scale

saurabhlyftcom

Factors like weather Specific events Interference

Traditional experimentation does not work well for optimizing these services

Test many variations of a model quickly

New product launchCombinationsExternal Factors

Eg Impact of Lyft Line launch on driver availability

We use simulations to solve these shortcomings

copy OpenStreetMap contributors

Make ride requests based on past activity

Dispatch driver based on new rules

Run for 1 week of historical data

Measure overall efficiency changes

Images for illustration only

Load ride data for given time period

Load driver locations for given time period

Post results to the DB

A simulator process replays Lyft rides for a given time period for a given market

Simulate ride activity

Fetch job config

Install pip requirements

Fetch microservices from github

Report stacktrace via email

Dispatch drivers with new model

Anatomy of a simulator

Execute jobs that are available to run

EC2

Simulators workers are deployed to EC2 and run new jobs asynchronously

EC2

EC2

DB

Register new simulations to run

Fetch new models

S3

Get new job

Ride activity and driver locations

EC2 auto scaling helps run thousands of simulations in parallel

Request for 10000 new simulations causing CPU spike across existing nodes

No more new models to run

Issues of race condition between workers arise that should be avoided

PostgreSQL docs

Any new simulations

Any new simulations

Register one job

DB

Yes 1 new job

Yes 1 new job

Ok running

Ok running

Worker 1 Worker 2 DatabaseClient

Use exclusive lock on select to prevent race conditions

TIM

E

Create base env and clone to speed up installation

Setting up new environment for each job is really slow and can be improved using conda

Base cond env Base

virtualenvBase conda env

Base virtualenv

Base conda env

Use conda to skip large build times

Conda Docs

Failures are expensive and need to be isolated and notified

DB

Register new job

Lock and start the job run

Job FailedRecord failure for retry

Email stacktrace

Lock and retry run

TIM

E

Thanks

Resilient Async Responsive Elastic

Reactive Manifesto

Build services that scale

saurabhlyftcom

We use simulations to solve these shortcomings

copy OpenStreetMap contributors

Make ride requests based on past activity

Dispatch driver based on new rules

Run for 1 week of historical data

Measure overall efficiency changes

Images for illustration only

Load ride data for given time period

Load driver locations for given time period

Post results to the DB

A simulator process replays Lyft rides for a given time period for a given market

Simulate ride activity

Fetch job config

Install pip requirements

Fetch microservices from github

Report stacktrace via email

Dispatch drivers with new model

Anatomy of a simulator

Execute jobs that are available to run

EC2

Simulators workers are deployed to EC2 and run new jobs asynchronously

EC2

EC2

DB

Register new simulations to run

Fetch new models

S3

Get new job

Ride activity and driver locations

EC2 auto scaling helps run thousands of simulations in parallel

Request for 10000 new simulations causing CPU spike across existing nodes

No more new models to run

Issues of race condition between workers arise that should be avoided

PostgreSQL docs

Any new simulations

Any new simulations

Register one job

DB

Yes 1 new job

Yes 1 new job

Ok running

Ok running

Worker 1 Worker 2 DatabaseClient

Use exclusive lock on select to prevent race conditions

TIM

E

Create base env and clone to speed up installation

Setting up new environment for each job is really slow and can be improved using conda

Base cond env Base

virtualenvBase conda env

Base virtualenv

Base conda env

Use conda to skip large build times

Conda Docs

Failures are expensive and need to be isolated and notified

DB

Register new job

Lock and start the job run

Job FailedRecord failure for retry

Email stacktrace

Lock and retry run

TIM

E

Thanks

Resilient Async Responsive Elastic

Reactive Manifesto

Build services that scale

saurabhlyftcom

Load ride data for given time period

Load driver locations for given time period

Post results to the DB

A simulator process replays Lyft rides for a given time period for a given market

Simulate ride activity

Fetch job config

Install pip requirements

Fetch microservices from github

Report stacktrace via email

Dispatch drivers with new model

Anatomy of a simulator

Execute jobs that are available to run

EC2

Simulators workers are deployed to EC2 and run new jobs asynchronously

EC2

EC2

DB

Register new simulations to run

Fetch new models

S3

Get new job

Ride activity and driver locations

EC2 auto scaling helps run thousands of simulations in parallel

Request for 10000 new simulations causing CPU spike across existing nodes

No more new models to run

Issues of race condition between workers arise that should be avoided

PostgreSQL docs

Any new simulations

Any new simulations

Register one job

DB

Yes 1 new job

Yes 1 new job

Ok running

Ok running

Worker 1 Worker 2 DatabaseClient

Use exclusive lock on select to prevent race conditions

TIM

E

Create base env and clone to speed up installation

Setting up new environment for each job is really slow and can be improved using conda

Base cond env Base

virtualenvBase conda env

Base virtualenv

Base conda env

Use conda to skip large build times

Conda Docs

Failures are expensive and need to be isolated and notified

DB

Register new job

Lock and start the job run

Job FailedRecord failure for retry

Email stacktrace

Lock and retry run

TIM

E

Thanks

Resilient Async Responsive Elastic

Reactive Manifesto

Build services that scale

saurabhlyftcom

Execute jobs that are available to run

EC2

Simulators workers are deployed to EC2 and run new jobs asynchronously

EC2

EC2

DB

Register new simulations to run

Fetch new models

S3

Get new job

Ride activity and driver locations

EC2 auto scaling helps run thousands of simulations in parallel

Request for 10000 new simulations causing CPU spike across existing nodes

No more new models to run

Issues of race condition between workers arise that should be avoided

PostgreSQL docs

Any new simulations

Any new simulations

Register one job

DB

Yes 1 new job

Yes 1 new job

Ok running

Ok running

Worker 1 Worker 2 DatabaseClient

Use exclusive lock on select to prevent race conditions

TIM

E

Create base env and clone to speed up installation

Setting up new environment for each job is really slow and can be improved using conda

Base cond env Base

virtualenvBase conda env

Base virtualenv

Base conda env

Use conda to skip large build times

Conda Docs

Failures are expensive and need to be isolated and notified

DB

Register new job

Lock and start the job run

Job FailedRecord failure for retry

Email stacktrace

Lock and retry run

TIM

E

Thanks

Resilient Async Responsive Elastic

Reactive Manifesto

Build services that scale

saurabhlyftcom

EC2 auto scaling helps run thousands of simulations in parallel

Request for 10000 new simulations causing CPU spike across existing nodes

No more new models to run

Issues of race condition between workers arise that should be avoided

PostgreSQL docs

Any new simulations

Any new simulations

Register one job

DB

Yes 1 new job

Yes 1 new job

Ok running

Ok running

Worker 1 Worker 2 DatabaseClient

Use exclusive lock on select to prevent race conditions

TIM

E

Create base env and clone to speed up installation

Setting up new environment for each job is really slow and can be improved using conda

Base cond env Base

virtualenvBase conda env

Base virtualenv

Base conda env

Use conda to skip large build times

Conda Docs

Failures are expensive and need to be isolated and notified

DB

Register new job

Lock and start the job run

Job FailedRecord failure for retry

Email stacktrace

Lock and retry run

TIM

E

Thanks

Resilient Async Responsive Elastic

Reactive Manifesto

Build services that scale

saurabhlyftcom

Issues of race condition between workers arise that should be avoided

PostgreSQL docs

Any new simulations

Any new simulations

Register one job

DB

Yes 1 new job

Yes 1 new job

Ok running

Ok running

Worker 1 Worker 2 DatabaseClient

Use exclusive lock on select to prevent race conditions

TIM

E

Create base env and clone to speed up installation

Setting up new environment for each job is really slow and can be improved using conda

Base cond env Base

virtualenvBase conda env

Base virtualenv

Base conda env

Use conda to skip large build times

Conda Docs

Failures are expensive and need to be isolated and notified

DB

Register new job

Lock and start the job run

Job FailedRecord failure for retry

Email stacktrace

Lock and retry run

TIM

E

Thanks

Resilient Async Responsive Elastic

Reactive Manifesto

Build services that scale

saurabhlyftcom

Create base env and clone to speed up installation

Setting up new environment for each job is really slow and can be improved using conda

Base cond env Base

virtualenvBase conda env

Base virtualenv

Base conda env

Use conda to skip large build times

Conda Docs

Failures are expensive and need to be isolated and notified

DB

Register new job

Lock and start the job run

Job FailedRecord failure for retry

Email stacktrace

Lock and retry run

TIM

E

Thanks

Resilient Async Responsive Elastic

Reactive Manifesto

Build services that scale

saurabhlyftcom

Failures are expensive and need to be isolated and notified

DB

Register new job

Lock and start the job run

Job FailedRecord failure for retry

Email stacktrace

Lock and retry run

TIM

E

Thanks

Resilient Async Responsive Elastic

Reactive Manifesto

Build services that scale

saurabhlyftcom

Thanks

Resilient Async Responsive Elastic

Reactive Manifesto

Build services that scale

saurabhlyftcom


Recommended