Date post: | 15-Apr-2017 |
Category: |
Technology |
Upload: | hakka-labs |
View: | 332 times |
Download: | 1 times |
2016
Simulations at ScaleSaurabh Bajaj
Match passengers going in the same direction efficiently
Calculate the price of a ride
3 Lyft Line Matching
Lyft runs multiple services to power a Lyft ride
2 Dispatching
Pick driver to dispatch for a given ride request
1 Pricing Service
Pricing service determines how much a ride should cost
copy OpenStreetMap contributors
Price = (a distance) + (b time) + (c demand^2) + (d matching_coefficient) + (e hour_of_day) + hellip
1 Pricing
Dispatching Service determines the best driver to dispatch for a new ride
copy OpenStreetMap contributors
Nearest driver
Lowest time to arrivalBest lyft line match
2 Dispatching
Lyft Line service matches riders going in the same direction for maximum efficiency
copy OpenStreetMap contributors
Maximize system efficiency
Improve passenger experience
3 Lyft Line Matching
We optimize these services to perform well under varying conditions
Traffic Conditions
Traffic Conditions
Commute Hours
Ride Type
Demand Density
Day of Week
Factors like weather Specific events Interference
Traditional experimentation does not work well for optimizing these services
Test many variations of a model quickly
New product launchCombinationsExternal Factors
Eg Impact of Lyft Line launch on driver availability
We use simulations to solve these shortcomings
copy OpenStreetMap contributors
Make ride requests based on past activity
Dispatch driver based on new rules
Run for 1 week of historical data
Measure overall efficiency changes
Images for illustration only
Load ride data for given time period
Load driver locations for given time period
Post results to the DB
A simulator process replays Lyft rides for a given time period for a given market
Simulate ride activity
Fetch job config
Install pip requirements
Fetch microservices from github
Report stacktrace via email
Dispatch drivers with new model
Anatomy of a simulator
Execute jobs that are available to run
EC2
Simulators workers are deployed to EC2 and run new jobs asynchronously
EC2
EC2
DB
Register new simulations to run
Fetch new models
S3
Get new job
Ride activity and driver locations
EC2 auto scaling helps run thousands of simulations in parallel
Request for 10000 new simulations causing CPU spike across existing nodes
No more new models to run
Issues of race condition between workers arise that should be avoided
PostgreSQL docs
Any new simulations
Any new simulations
Register one job
DB
Yes 1 new job
Yes 1 new job
Ok running
Ok running
Worker 1 Worker 2 DatabaseClient
Use exclusive lock on select to prevent race conditions
TIM
E
Create base env and clone to speed up installation
Setting up new environment for each job is really slow and can be improved using conda
Base cond env Base
virtualenvBase conda env
Base virtualenv
Base conda env
Use conda to skip large build times
Conda Docs
Failures are expensive and need to be isolated and notified
DB
Register new job
Lock and start the job run
Job FailedRecord failure for retry
Email stacktrace
Lock and retry run
TIM
E
Thanks
Resilient Async Responsive Elastic
Reactive Manifesto
Build services that scale
saurabhlyftcom
Match passengers going in the same direction efficiently
Calculate the price of a ride
3 Lyft Line Matching
Lyft runs multiple services to power a Lyft ride
2 Dispatching
Pick driver to dispatch for a given ride request
1 Pricing Service
Pricing service determines how much a ride should cost
copy OpenStreetMap contributors
Price = (a distance) + (b time) + (c demand^2) + (d matching_coefficient) + (e hour_of_day) + hellip
1 Pricing
Dispatching Service determines the best driver to dispatch for a new ride
copy OpenStreetMap contributors
Nearest driver
Lowest time to arrivalBest lyft line match
2 Dispatching
Lyft Line service matches riders going in the same direction for maximum efficiency
copy OpenStreetMap contributors
Maximize system efficiency
Improve passenger experience
3 Lyft Line Matching
We optimize these services to perform well under varying conditions
Traffic Conditions
Traffic Conditions
Commute Hours
Ride Type
Demand Density
Day of Week
Factors like weather Specific events Interference
Traditional experimentation does not work well for optimizing these services
Test many variations of a model quickly
New product launchCombinationsExternal Factors
Eg Impact of Lyft Line launch on driver availability
We use simulations to solve these shortcomings
copy OpenStreetMap contributors
Make ride requests based on past activity
Dispatch driver based on new rules
Run for 1 week of historical data
Measure overall efficiency changes
Images for illustration only
Load ride data for given time period
Load driver locations for given time period
Post results to the DB
A simulator process replays Lyft rides for a given time period for a given market
Simulate ride activity
Fetch job config
Install pip requirements
Fetch microservices from github
Report stacktrace via email
Dispatch drivers with new model
Anatomy of a simulator
Execute jobs that are available to run
EC2
Simulators workers are deployed to EC2 and run new jobs asynchronously
EC2
EC2
DB
Register new simulations to run
Fetch new models
S3
Get new job
Ride activity and driver locations
EC2 auto scaling helps run thousands of simulations in parallel
Request for 10000 new simulations causing CPU spike across existing nodes
No more new models to run
Issues of race condition between workers arise that should be avoided
PostgreSQL docs
Any new simulations
Any new simulations
Register one job
DB
Yes 1 new job
Yes 1 new job
Ok running
Ok running
Worker 1 Worker 2 DatabaseClient
Use exclusive lock on select to prevent race conditions
TIM
E
Create base env and clone to speed up installation
Setting up new environment for each job is really slow and can be improved using conda
Base cond env Base
virtualenvBase conda env
Base virtualenv
Base conda env
Use conda to skip large build times
Conda Docs
Failures are expensive and need to be isolated and notified
DB
Register new job
Lock and start the job run
Job FailedRecord failure for retry
Email stacktrace
Lock and retry run
TIM
E
Thanks
Resilient Async Responsive Elastic
Reactive Manifesto
Build services that scale
saurabhlyftcom
Pricing service determines how much a ride should cost
copy OpenStreetMap contributors
Price = (a distance) + (b time) + (c demand^2) + (d matching_coefficient) + (e hour_of_day) + hellip
1 Pricing
Dispatching Service determines the best driver to dispatch for a new ride
copy OpenStreetMap contributors
Nearest driver
Lowest time to arrivalBest lyft line match
2 Dispatching
Lyft Line service matches riders going in the same direction for maximum efficiency
copy OpenStreetMap contributors
Maximize system efficiency
Improve passenger experience
3 Lyft Line Matching
We optimize these services to perform well under varying conditions
Traffic Conditions
Traffic Conditions
Commute Hours
Ride Type
Demand Density
Day of Week
Factors like weather Specific events Interference
Traditional experimentation does not work well for optimizing these services
Test many variations of a model quickly
New product launchCombinationsExternal Factors
Eg Impact of Lyft Line launch on driver availability
We use simulations to solve these shortcomings
copy OpenStreetMap contributors
Make ride requests based on past activity
Dispatch driver based on new rules
Run for 1 week of historical data
Measure overall efficiency changes
Images for illustration only
Load ride data for given time period
Load driver locations for given time period
Post results to the DB
A simulator process replays Lyft rides for a given time period for a given market
Simulate ride activity
Fetch job config
Install pip requirements
Fetch microservices from github
Report stacktrace via email
Dispatch drivers with new model
Anatomy of a simulator
Execute jobs that are available to run
EC2
Simulators workers are deployed to EC2 and run new jobs asynchronously
EC2
EC2
DB
Register new simulations to run
Fetch new models
S3
Get new job
Ride activity and driver locations
EC2 auto scaling helps run thousands of simulations in parallel
Request for 10000 new simulations causing CPU spike across existing nodes
No more new models to run
Issues of race condition between workers arise that should be avoided
PostgreSQL docs
Any new simulations
Any new simulations
Register one job
DB
Yes 1 new job
Yes 1 new job
Ok running
Ok running
Worker 1 Worker 2 DatabaseClient
Use exclusive lock on select to prevent race conditions
TIM
E
Create base env and clone to speed up installation
Setting up new environment for each job is really slow and can be improved using conda
Base cond env Base
virtualenvBase conda env
Base virtualenv
Base conda env
Use conda to skip large build times
Conda Docs
Failures are expensive and need to be isolated and notified
DB
Register new job
Lock and start the job run
Job FailedRecord failure for retry
Email stacktrace
Lock and retry run
TIM
E
Thanks
Resilient Async Responsive Elastic
Reactive Manifesto
Build services that scale
saurabhlyftcom
Dispatching Service determines the best driver to dispatch for a new ride
copy OpenStreetMap contributors
Nearest driver
Lowest time to arrivalBest lyft line match
2 Dispatching
Lyft Line service matches riders going in the same direction for maximum efficiency
copy OpenStreetMap contributors
Maximize system efficiency
Improve passenger experience
3 Lyft Line Matching
We optimize these services to perform well under varying conditions
Traffic Conditions
Traffic Conditions
Commute Hours
Ride Type
Demand Density
Day of Week
Factors like weather Specific events Interference
Traditional experimentation does not work well for optimizing these services
Test many variations of a model quickly
New product launchCombinationsExternal Factors
Eg Impact of Lyft Line launch on driver availability
We use simulations to solve these shortcomings
copy OpenStreetMap contributors
Make ride requests based on past activity
Dispatch driver based on new rules
Run for 1 week of historical data
Measure overall efficiency changes
Images for illustration only
Load ride data for given time period
Load driver locations for given time period
Post results to the DB
A simulator process replays Lyft rides for a given time period for a given market
Simulate ride activity
Fetch job config
Install pip requirements
Fetch microservices from github
Report stacktrace via email
Dispatch drivers with new model
Anatomy of a simulator
Execute jobs that are available to run
EC2
Simulators workers are deployed to EC2 and run new jobs asynchronously
EC2
EC2
DB
Register new simulations to run
Fetch new models
S3
Get new job
Ride activity and driver locations
EC2 auto scaling helps run thousands of simulations in parallel
Request for 10000 new simulations causing CPU spike across existing nodes
No more new models to run
Issues of race condition between workers arise that should be avoided
PostgreSQL docs
Any new simulations
Any new simulations
Register one job
DB
Yes 1 new job
Yes 1 new job
Ok running
Ok running
Worker 1 Worker 2 DatabaseClient
Use exclusive lock on select to prevent race conditions
TIM
E
Create base env and clone to speed up installation
Setting up new environment for each job is really slow and can be improved using conda
Base cond env Base
virtualenvBase conda env
Base virtualenv
Base conda env
Use conda to skip large build times
Conda Docs
Failures are expensive and need to be isolated and notified
DB
Register new job
Lock and start the job run
Job FailedRecord failure for retry
Email stacktrace
Lock and retry run
TIM
E
Thanks
Resilient Async Responsive Elastic
Reactive Manifesto
Build services that scale
saurabhlyftcom
Lyft Line service matches riders going in the same direction for maximum efficiency
copy OpenStreetMap contributors
Maximize system efficiency
Improve passenger experience
3 Lyft Line Matching
We optimize these services to perform well under varying conditions
Traffic Conditions
Traffic Conditions
Commute Hours
Ride Type
Demand Density
Day of Week
Factors like weather Specific events Interference
Traditional experimentation does not work well for optimizing these services
Test many variations of a model quickly
New product launchCombinationsExternal Factors
Eg Impact of Lyft Line launch on driver availability
We use simulations to solve these shortcomings
copy OpenStreetMap contributors
Make ride requests based on past activity
Dispatch driver based on new rules
Run for 1 week of historical data
Measure overall efficiency changes
Images for illustration only
Load ride data for given time period
Load driver locations for given time period
Post results to the DB
A simulator process replays Lyft rides for a given time period for a given market
Simulate ride activity
Fetch job config
Install pip requirements
Fetch microservices from github
Report stacktrace via email
Dispatch drivers with new model
Anatomy of a simulator
Execute jobs that are available to run
EC2
Simulators workers are deployed to EC2 and run new jobs asynchronously
EC2
EC2
DB
Register new simulations to run
Fetch new models
S3
Get new job
Ride activity and driver locations
EC2 auto scaling helps run thousands of simulations in parallel
Request for 10000 new simulations causing CPU spike across existing nodes
No more new models to run
Issues of race condition between workers arise that should be avoided
PostgreSQL docs
Any new simulations
Any new simulations
Register one job
DB
Yes 1 new job
Yes 1 new job
Ok running
Ok running
Worker 1 Worker 2 DatabaseClient
Use exclusive lock on select to prevent race conditions
TIM
E
Create base env and clone to speed up installation
Setting up new environment for each job is really slow and can be improved using conda
Base cond env Base
virtualenvBase conda env
Base virtualenv
Base conda env
Use conda to skip large build times
Conda Docs
Failures are expensive and need to be isolated and notified
DB
Register new job
Lock and start the job run
Job FailedRecord failure for retry
Email stacktrace
Lock and retry run
TIM
E
Thanks
Resilient Async Responsive Elastic
Reactive Manifesto
Build services that scale
saurabhlyftcom
We optimize these services to perform well under varying conditions
Traffic Conditions
Traffic Conditions
Commute Hours
Ride Type
Demand Density
Day of Week
Factors like weather Specific events Interference
Traditional experimentation does not work well for optimizing these services
Test many variations of a model quickly
New product launchCombinationsExternal Factors
Eg Impact of Lyft Line launch on driver availability
We use simulations to solve these shortcomings
copy OpenStreetMap contributors
Make ride requests based on past activity
Dispatch driver based on new rules
Run for 1 week of historical data
Measure overall efficiency changes
Images for illustration only
Load ride data for given time period
Load driver locations for given time period
Post results to the DB
A simulator process replays Lyft rides for a given time period for a given market
Simulate ride activity
Fetch job config
Install pip requirements
Fetch microservices from github
Report stacktrace via email
Dispatch drivers with new model
Anatomy of a simulator
Execute jobs that are available to run
EC2
Simulators workers are deployed to EC2 and run new jobs asynchronously
EC2
EC2
DB
Register new simulations to run
Fetch new models
S3
Get new job
Ride activity and driver locations
EC2 auto scaling helps run thousands of simulations in parallel
Request for 10000 new simulations causing CPU spike across existing nodes
No more new models to run
Issues of race condition between workers arise that should be avoided
PostgreSQL docs
Any new simulations
Any new simulations
Register one job
DB
Yes 1 new job
Yes 1 new job
Ok running
Ok running
Worker 1 Worker 2 DatabaseClient
Use exclusive lock on select to prevent race conditions
TIM
E
Create base env and clone to speed up installation
Setting up new environment for each job is really slow and can be improved using conda
Base cond env Base
virtualenvBase conda env
Base virtualenv
Base conda env
Use conda to skip large build times
Conda Docs
Failures are expensive and need to be isolated and notified
DB
Register new job
Lock and start the job run
Job FailedRecord failure for retry
Email stacktrace
Lock and retry run
TIM
E
Thanks
Resilient Async Responsive Elastic
Reactive Manifesto
Build services that scale
saurabhlyftcom
Factors like weather Specific events Interference
Traditional experimentation does not work well for optimizing these services
Test many variations of a model quickly
New product launchCombinationsExternal Factors
Eg Impact of Lyft Line launch on driver availability
We use simulations to solve these shortcomings
copy OpenStreetMap contributors
Make ride requests based on past activity
Dispatch driver based on new rules
Run for 1 week of historical data
Measure overall efficiency changes
Images for illustration only
Load ride data for given time period
Load driver locations for given time period
Post results to the DB
A simulator process replays Lyft rides for a given time period for a given market
Simulate ride activity
Fetch job config
Install pip requirements
Fetch microservices from github
Report stacktrace via email
Dispatch drivers with new model
Anatomy of a simulator
Execute jobs that are available to run
EC2
Simulators workers are deployed to EC2 and run new jobs asynchronously
EC2
EC2
DB
Register new simulations to run
Fetch new models
S3
Get new job
Ride activity and driver locations
EC2 auto scaling helps run thousands of simulations in parallel
Request for 10000 new simulations causing CPU spike across existing nodes
No more new models to run
Issues of race condition between workers arise that should be avoided
PostgreSQL docs
Any new simulations
Any new simulations
Register one job
DB
Yes 1 new job
Yes 1 new job
Ok running
Ok running
Worker 1 Worker 2 DatabaseClient
Use exclusive lock on select to prevent race conditions
TIM
E
Create base env and clone to speed up installation
Setting up new environment for each job is really slow and can be improved using conda
Base cond env Base
virtualenvBase conda env
Base virtualenv
Base conda env
Use conda to skip large build times
Conda Docs
Failures are expensive and need to be isolated and notified
DB
Register new job
Lock and start the job run
Job FailedRecord failure for retry
Email stacktrace
Lock and retry run
TIM
E
Thanks
Resilient Async Responsive Elastic
Reactive Manifesto
Build services that scale
saurabhlyftcom
We use simulations to solve these shortcomings
copy OpenStreetMap contributors
Make ride requests based on past activity
Dispatch driver based on new rules
Run for 1 week of historical data
Measure overall efficiency changes
Images for illustration only
Load ride data for given time period
Load driver locations for given time period
Post results to the DB
A simulator process replays Lyft rides for a given time period for a given market
Simulate ride activity
Fetch job config
Install pip requirements
Fetch microservices from github
Report stacktrace via email
Dispatch drivers with new model
Anatomy of a simulator
Execute jobs that are available to run
EC2
Simulators workers are deployed to EC2 and run new jobs asynchronously
EC2
EC2
DB
Register new simulations to run
Fetch new models
S3
Get new job
Ride activity and driver locations
EC2 auto scaling helps run thousands of simulations in parallel
Request for 10000 new simulations causing CPU spike across existing nodes
No more new models to run
Issues of race condition between workers arise that should be avoided
PostgreSQL docs
Any new simulations
Any new simulations
Register one job
DB
Yes 1 new job
Yes 1 new job
Ok running
Ok running
Worker 1 Worker 2 DatabaseClient
Use exclusive lock on select to prevent race conditions
TIM
E
Create base env and clone to speed up installation
Setting up new environment for each job is really slow and can be improved using conda
Base cond env Base
virtualenvBase conda env
Base virtualenv
Base conda env
Use conda to skip large build times
Conda Docs
Failures are expensive and need to be isolated and notified
DB
Register new job
Lock and start the job run
Job FailedRecord failure for retry
Email stacktrace
Lock and retry run
TIM
E
Thanks
Resilient Async Responsive Elastic
Reactive Manifesto
Build services that scale
saurabhlyftcom
Load ride data for given time period
Load driver locations for given time period
Post results to the DB
A simulator process replays Lyft rides for a given time period for a given market
Simulate ride activity
Fetch job config
Install pip requirements
Fetch microservices from github
Report stacktrace via email
Dispatch drivers with new model
Anatomy of a simulator
Execute jobs that are available to run
EC2
Simulators workers are deployed to EC2 and run new jobs asynchronously
EC2
EC2
DB
Register new simulations to run
Fetch new models
S3
Get new job
Ride activity and driver locations
EC2 auto scaling helps run thousands of simulations in parallel
Request for 10000 new simulations causing CPU spike across existing nodes
No more new models to run
Issues of race condition between workers arise that should be avoided
PostgreSQL docs
Any new simulations
Any new simulations
Register one job
DB
Yes 1 new job
Yes 1 new job
Ok running
Ok running
Worker 1 Worker 2 DatabaseClient
Use exclusive lock on select to prevent race conditions
TIM
E
Create base env and clone to speed up installation
Setting up new environment for each job is really slow and can be improved using conda
Base cond env Base
virtualenvBase conda env
Base virtualenv
Base conda env
Use conda to skip large build times
Conda Docs
Failures are expensive and need to be isolated and notified
DB
Register new job
Lock and start the job run
Job FailedRecord failure for retry
Email stacktrace
Lock and retry run
TIM
E
Thanks
Resilient Async Responsive Elastic
Reactive Manifesto
Build services that scale
saurabhlyftcom
Execute jobs that are available to run
EC2
Simulators workers are deployed to EC2 and run new jobs asynchronously
EC2
EC2
DB
Register new simulations to run
Fetch new models
S3
Get new job
Ride activity and driver locations
EC2 auto scaling helps run thousands of simulations in parallel
Request for 10000 new simulations causing CPU spike across existing nodes
No more new models to run
Issues of race condition between workers arise that should be avoided
PostgreSQL docs
Any new simulations
Any new simulations
Register one job
DB
Yes 1 new job
Yes 1 new job
Ok running
Ok running
Worker 1 Worker 2 DatabaseClient
Use exclusive lock on select to prevent race conditions
TIM
E
Create base env and clone to speed up installation
Setting up new environment for each job is really slow and can be improved using conda
Base cond env Base
virtualenvBase conda env
Base virtualenv
Base conda env
Use conda to skip large build times
Conda Docs
Failures are expensive and need to be isolated and notified
DB
Register new job
Lock and start the job run
Job FailedRecord failure for retry
Email stacktrace
Lock and retry run
TIM
E
Thanks
Resilient Async Responsive Elastic
Reactive Manifesto
Build services that scale
saurabhlyftcom
EC2 auto scaling helps run thousands of simulations in parallel
Request for 10000 new simulations causing CPU spike across existing nodes
No more new models to run
Issues of race condition between workers arise that should be avoided
PostgreSQL docs
Any new simulations
Any new simulations
Register one job
DB
Yes 1 new job
Yes 1 new job
Ok running
Ok running
Worker 1 Worker 2 DatabaseClient
Use exclusive lock on select to prevent race conditions
TIM
E
Create base env and clone to speed up installation
Setting up new environment for each job is really slow and can be improved using conda
Base cond env Base
virtualenvBase conda env
Base virtualenv
Base conda env
Use conda to skip large build times
Conda Docs
Failures are expensive and need to be isolated and notified
DB
Register new job
Lock and start the job run
Job FailedRecord failure for retry
Email stacktrace
Lock and retry run
TIM
E
Thanks
Resilient Async Responsive Elastic
Reactive Manifesto
Build services that scale
saurabhlyftcom
Issues of race condition between workers arise that should be avoided
PostgreSQL docs
Any new simulations
Any new simulations
Register one job
DB
Yes 1 new job
Yes 1 new job
Ok running
Ok running
Worker 1 Worker 2 DatabaseClient
Use exclusive lock on select to prevent race conditions
TIM
E
Create base env and clone to speed up installation
Setting up new environment for each job is really slow and can be improved using conda
Base cond env Base
virtualenvBase conda env
Base virtualenv
Base conda env
Use conda to skip large build times
Conda Docs
Failures are expensive and need to be isolated and notified
DB
Register new job
Lock and start the job run
Job FailedRecord failure for retry
Email stacktrace
Lock and retry run
TIM
E
Thanks
Resilient Async Responsive Elastic
Reactive Manifesto
Build services that scale
saurabhlyftcom
Create base env and clone to speed up installation
Setting up new environment for each job is really slow and can be improved using conda
Base cond env Base
virtualenvBase conda env
Base virtualenv
Base conda env
Use conda to skip large build times
Conda Docs
Failures are expensive and need to be isolated and notified
DB
Register new job
Lock and start the job run
Job FailedRecord failure for retry
Email stacktrace
Lock and retry run
TIM
E
Thanks
Resilient Async Responsive Elastic
Reactive Manifesto
Build services that scale
saurabhlyftcom
Failures are expensive and need to be isolated and notified
DB
Register new job
Lock and start the job run
Job FailedRecord failure for retry
Email stacktrace
Lock and retry run
TIM
E
Thanks
Resilient Async Responsive Elastic
Reactive Manifesto
Build services that scale
saurabhlyftcom