PerfIso: Performance Isolation for Commercial Latency-Sensitive Services
Călin Iorgulescu Reza Azimi Youngjin KwonEPFL Brown University University of Texas
Sameh Elnikety Manoj Syamala Vivek Narasayya Herodotus HerodotouMicrosoft Research Cyprus University of Technology
Paulo Tomita Alex Chen Jack Zhang Junhua WangMicrosoft Bing
Interactive services must feel instantaneous
2 / 297/12/18
Interactive services must feel instantaneous
2 / 297/12/18
Interactive services must feel instantaneous
≤ 0.1 s
2 / 297/12/18
A single query involves hundreds of machines!
≤ 0.1 s
3 / 297/12/18
Web Index
A single query involves hundreds of machines!
≤ 0.1 s
3 / 297/12/18
A single query involves hundreds of machines!
≤ 0.1 s
Embarrassingly parallel search
3 / 297/12/18
A single query involves hundreds of machines!
≤ 0.1 s
Embarrassingly parallel search
3 / 297/12/18
Slowest response must be << 0.1 s
A single query involves hundreds of machines!
≤ 0.1 s
Embarrassingly parallel search
3 / 297/12/18
Slowest response must be << 0.1 s
Multiple layers of aggregation!Just one service out of many!
Wednesday Thursday Friday Saturday Sunday Monday Tuesday
Qu
ery
Arr
ival
Rat
e
Machines are provisioned for peak load
Request-rate variation for a Microsoft Bing sub-cluster over 1 week in 20174 / 297/12/18
Wednesday Thursday Friday Saturday Sunday Monday Tuesday
Qu
ery
Arr
ival
Rat
e
Machines are provisioned for peak load
Request-rate variation for a Microsoft Bing sub-cluster over 1 week in 2017
Average load
4 / 297/12/18
Wednesday Thursday Friday Saturday Sunday Monday Tuesday
Qu
ery
Arr
ival
Rat
e
Machines are provisioned for peak load
Request-rate variation for a Microsoft Bing sub-cluster over 1 week in 2017
Average load Peak load
4 / 297/12/18
Wednesday Thursday Friday Saturday Sunday Monday Tuesday
Qu
ery
Arr
ival
Rat
e
Machines are provisioned for peak load
Request-rate variation for a Microsoft Bing sub-cluster over 1 week in 2017
Average load Peak load
4 / 297/12/18
>>
Wednesday Thursday Friday Saturday Sunday Monday Tuesday
Qu
ery
Arr
ival
Rat
e
Machines are provisioned for peak load
Request-rate variation for a Microsoft Bing sub-cluster over 1 week in 2017
Average load Peak load
4 / 297/12/18
Datacenters have spare resources
How can we leverage this ?
>>
Solution: colocate batch jobs with online services
• Get spare resources to do useful work
• Primary tenant – guaranteed performance• e.g., Bing IndexServe
• Secondary tenant – best-effort performance• e.g., Apache Spark
Primary Idle PrimaryBatch Job
Without colocation With colocation + PerfIso
5 / 297/12/18
PerfIso: performance isolation for online services
6 / 297/12/18
PerfIso: performance isolation for online services
• Maintains P99 of response-times (10s of ms) under colocation
Provides performance isolation of Primary
6 / 297/12/18
PerfIso: performance isolation for online services
• Maintains P99 of response-times (10s of ms) under colocation
Provides performance isolation of Primary
• 45% of the CPU is used to do useful batch work
Increases system efficiency
6 / 297/12/18
PerfIso: performance isolation for online services
• Maintains P99 of response-times (10s of ms) under colocation
Provides performance isolation of Primary
• 45% of the CPU is used to do useful batch work
Increases system efficiency
• Many different interactive services and hardware setups
Deployed on over 90,000 servers
6 / 297/12/18
Many papers published on performance isolation
Quasar [ASPLOS ‘14] Heracles [ISCA ‘15] Elfen [USENIX ATC ’16]
7 / 297/12/18
Many papers published on performance isolation
Quasar [ASPLOS ‘14] Heracles [ISCA ‘15] Elfen [USENIX ATC ’16]
7 / 297/12/18
Existing solutions do not fit our requirements
PerfIso: Requirements
8 / 297/12/18
PerfIso: Requirements
1. “Black-box”: Fewest assumptions about tenants (wider applicability)
8 / 297/12/18
PerfIso: Requirements
1. “Black-box”: Fewest assumptions about tenants (wider applicability)
2. “Standalone”: Primary acts like it runs alone (negligible interference)
8 / 297/12/18
PerfIso: Requirements
1. “Black-box”: Fewest assumptions about tenants (wider applicability)
2. “Standalone”: Primary acts like it runs alone (negligible interference)
3. “Integrability”: Minimize software-stack changes (easy deployment)
8 / 297/12/18
Why is Performance Isolation hard?
Interactive services – highly sensitive to interference!
Leaf-servers keep 99th percentile low
• Over 10 years of optimization work!• e.g., compression, adaptive parallelism, etc.
How often does the 99th percentile occur?
• For 10,000 queries / s → 100 times / s
What happens in a 100-node fanout?
• Every query runs at the 99th percentile!
10 / 297/12/18
The Primary demands many resources quickly
• Bing IndexServe: multi-threaded web-index server
➢Up to 15 threads wake up in 5𝜇s1
1Constant query rate 4,000 Q/s, 500k queries experiment
11 / 297/12/18
The Primary demands many resources quickly
• Bing IndexServe: multi-threaded web-index server
➢Up to 15 threads wake up in 5𝜇s1
• Burstiness due to query-processing optimizations!• some queries will spawn many workers
1Constant query rate 4,000 Q/s, 500k queries experiment
11 / 297/12/18
The Primary demands many resources quickly
• Bing IndexServe: multi-threaded web-index server
➢Up to 15 threads wake up in 5𝜇s1
• Burstiness due to query-processing optimizations!• some queries will spawn many workers
• Workload arrives in bursts – exacerbates problem
1Constant query rate 4,000 Q/s, 500k queries experiment
11 / 297/12/18
The Primary must behave as if it were standalone
7/12/18 12 / 29
The Primary must behave as if it were standalone
• Primary’s resource demands must be fulfilled instantly.
7/12/18 12 / 29
The Primary must behave as if it were standalone
• Primary’s resource demands must be fulfilled instantly.
• Any delays → performance penalties incurred
7/12/18 12 / 29
The Primary must behave as if it were standalone
• Primary’s resource demands must be fulfilled instantly.
• Any delays → performance penalties incurred
• Any resource can become a performance bottleneck.
7/12/18 12 / 29
The Primary must behave as if it were standalone
• Primary’s resource demands must be fulfilled instantly.
• Any delays → performance penalties incurred
• Any resource can become a performance bottleneck.
If a query is delayed, it is already too late!
7/12/18 12 / 29
PerfIso
PerfIso: Implemented as a user-mode service
14 / 297/12/18
OS
Primary
PerfIso
Secondary
• Only keeps track of Secondary’s PID
PerfIso: Managed resources
15 / 297/12/18
OS
CPU
Primary
PerfIso
Secondary
Blind Isolation
PerfIso: Managed resources
15 / 297/12/18
OS
DISK
Primary
PerfIso
Secondary
I/O throttling
CPU
PerfIso: Managed resources
15 / 297/12/18
OS
MEMORY
Primary
PerfIso
Secondary
Restrict footprint
DISKCPU
PerfIso: Managed resources
15 / 297/12/18
OS
NETWORK
Primary
PerfIso
Secondary
Throttle egress packets
MEMORYDISKCPU
PerfIso: CPU is the most important resource
15 / 297/12/18
OS
CPU
Primary
PerfIso
Secondary
Blind Isolation
CPU sharing without PerfIso
• Primary and Secondary compete for cores.
• Secondary is aggressive: no idle cores exist.
16 / 297/12/18
Machine with 12 cores
Primary
Secondary
CPU Blind Isolation: Keep a “buffer” of idle cores
• PerfIso only knows the Secondary.
• Restrict Secondary by changing core affinities.
17 / 297/12/18
Primary
Secondary
Machine with 12 cores
Restrict Secondary to create a buffer of idle cores.
CPU Blind Isolation: Keep a “buffer” of idle cores
• PerfIso only knows the Secondary.
• Restrict Secondary by changing core affinities.Primary
Secondary
Machine with 12 cores
Idle
Restricted Secondary
Buffer of idle cores
17 / 297/12/18
CPU Blind Isolation: Keep a “buffer” of idle cores
• Primary is unrestricted. Secondary is restricted.
17 / 297/12/18
Machine with 12 cores
Primary can expand into the buffer!
Restricted Secondary
Buffer of idle cores
Primary
Secondary
Idle
CPU Blind Isolation: Keep a “buffer” of idle cores
• Primary is unrestricted. Secondary is restricted.
Machine with 12 cores
Primary can expand into the buffer!
Restricted Secondary
Buffer of idle cores
Primary
Secondary
Idle
17 / 297/12/18
CPU Blind Isolation: React to bursts from Primary
• Continuously read idle core status.
• Adjust Secondary ”slice” to maintain buffer.
18 / 297/12/18
Machine with 12 cores
Restricted Secondary
Buffer of idle cores
Primary
Secondary
Idle
CPU Blind Isolation: React to bursts from Primary
• Continuously read idle core status.
• Adjust Secondary ”slice” to maintain buffer.
18 / 297/12/18
Machine with 12 cores
Restricted Secondary
Buffer of idle cores
Primary
Secondary
Idle
CPU Blind Isolation: React to bursts from Primary
• Continuously read idle core status.
• Adjust Secondary ”slice” to maintain buffer.
18 / 297/12/18
Machine with 12 cores
Restricted Secondary
Buffer of idle cores
Primary
Secondary
Idle
CPU Blind Isolation: Secondary gets spare cores
• Allow Secondary to use spare idle cores.
• Release spare cores incrementally.
19 / 297/12/18
Machine with 12 cores
Restricted Secondary
Buffer of idle cores
Idle
Primary
Secondary
CPU Blind Isolation: Secondary gets spare cores
• Allow Secondary to use spare idle cores.
• Release spare cores incrementally.
19 / 297/12/18
Machine with 12 cores
Restricted Secondary
Buffer of idle cores
Idle
Primary
Secondary
CPU Blind Isolation: Secondary gets spare cores
• Allow Secondary to use spare idle cores.
• Release spare cores incrementally.
19 / 297/12/18
Machine with 12 cores
Restricted Secondary
Buffer of idle cores
Idle
Primary
Secondary
CPU Blind Isolation: Secondary gets spare cores
• Allow Secondary to use spare idle cores.
• Release spare cores incrementally.
19 / 297/12/18
Machine with 12 cores
Restricted Secondary
Buffer of idle cores
Idle
Primary
Secondary
CPU Blind Isolation: Secondary gets spare cores
• Allow Secondary to use spare idle cores.
• Release spare cores incrementally.
19 / 297/12/18
Machine with 12 cores
Restricted Secondary
Buffer of idle cores
Idle
Primary
Secondary
CPU Blind Isolation: Secondary gets spare cores
• Allow Secondary to use spare idle cores.
• Release spare cores incrementally.
19 / 297/12/18
Machine with 12 cores
Restricted Secondary
Buffer of idle cores
Idle
Primary
Secondary
CPU Blind Isolation: We dedicate 1 core to PerfIso
• PerfIso does continuous polling → we affinitize it to 1 core.
PerfIso
Machine with 12 cores
Restricted Secondary
Buffer of idle cores
Idle
Primary
Secondary
20 / 297/12/18
Evaluation
Experiment testbed
Hardware
• Intel Xeon E5 – 24 cores (48 w/ HT)
• 128GB RAM
Primary: Bing IndexServe
• 569 GB index-slice
• Open-loop client
• 500,000 queries @ 2,000 Q / s
Secondary: CPU micro-benchmark
22 / 297/12/18
11.65
349.08
1
10
100
1000
P9
9 la
ten
cy (
ms)
Standalone Colocated
SLO
No isolation
Single server: PerfIso protects tail-latencySecondary: CPU-intensive micro-benchmark
23 / 297/12/18
11.65
349.08
1
10
100
1000
P9
9 la
ten
cy (
ms)
Standalone Colocated
SLO
No isolation
One order of magnitude worse !
Single server: PerfIso protects tail-latencySecondary: CPU-intensive micro-benchmark
23 / 297/12/18
11.65
349.08
1
10
100
1000
P9
9 la
ten
cy (
ms)
Standalone Colocated
SLO 11.65 12.07
1
10
100
1000
Standalone Colocated
No isolation PerfIso
One order of magnitude worse !
Single server: PerfIso protects tail-latencySecondary: CPU-intensive micro-benchmark
23 / 297/12/18
Single server: CPU utilization 3x higher! Secondary: CPU-intensive micro-benchmark
0
20
40
60
80
100
CP
U u
tiliz
atio
n %
Primary Secondary
24 / 297/12/18
No colocation PerfIso
21%
11.65 12.07
1
10
100
1000
Standalone Colocated
67%
Single server: CPU utilization 3x higher! Secondary: CPU-intensive micro-benchmark
0
20
40
60
80
100
CP
U u
tiliz
atio
n %
Primary Secondary
24 / 297/12/18
No colocation PerfIso
21%
11.65 12.07
1
10
100
1000
Standalone Colocated
67%46% of CPU time → useful work
Restricting CPU cycles does not workSecondary: CPU-intensive micro-benchmark
11.65
349.08
12.07
33.74
1
10
100
1000
P9
9 la
ten
cy (
ms)
Standalone No isolation
PerfIso Restrict cycles
SLO
25 / 297/12/18
Secondary → 5% of CPU cycles
P99 latency – 3x higher than SLO!
Restricting CPU cores does not workSecondary: CPU-intensive micro-benchmark
0
20
40
60
80
100
CP
U u
tiliz
atio
n %
Primary Secondary
PerfIsoStandalone Restrict cores
SLO 11.65
349.08
12.07 11.63
1
10
100
1000
P9
9 la
ten
cy (
ms)
Standalone No isolation
PerfIso Restrict cores
21%
67%
26 / 297/12/18
38%
Restricting CPU cores does not workSecondary: CPU-intensive micro-benchmark
0
20
40
60
80
100
CP
U u
tiliz
atio
n %
Primary Secondary
PerfIsoStandalone Restrict cores
SLO 11.65
349.08
12.07 11.63
1
10
100
1000
P9
9 la
ten
cy (
ms)
Standalone No isolation
PerfIso Restrict cores
21%
67%Provisioned for peak load→
CPU utilization ~30% lower!
26 / 297/12/18
38%
020406080
100
CP
U u
til.
%
Avg CPU Utilization %
0
1000
2000
3000
4000
5000
0
10
20
30
40
0 10 20 30 40 50 60
Qu
erie
s /
s
Late
ncy
(m
s)
Time (minutes)
Top-Level Aggregator P99 latency (ms) Queries / s
1-hour run of 650 machine clusterSecondary: Machine-Learning computation
27 / 297/12/18
020406080
100
CP
U u
til.
%
Avg CPU Utilization %
0
1000
2000
3000
4000
5000
0
10
20
30
40
0 10 20 30 40 50 60
Qu
erie
s /
s
Late
ncy
(m
s)
Time (minutes)
Top-Level Aggregator P99 latency (ms) Queries / s
1-hour run of 650 machine cluster
Average CPU utilization is 50% - 80%!
Secondary: Machine-Learning computation
27 / 297/12/18
Interesting details in the paper
• Effectiveness of static CPU isolation methods
• Restricting CPU cycles
• Restricting CPU cores
• Comparison of state-of-the-art techniques
• Managing disk, memory, and network
28 / 297/12/18
PerfIso: colocate batch jobs with online services
29 / 297/12/18
PerfIso: colocate batch jobs with online services
• Black-box: do not tailor to one specific service
29 / 297/12/18
PerfIso: colocate batch jobs with online services
• Black-box: do not tailor to one specific service
• Robustness: favor user-mode over kernel implementation
29 / 297/12/18
PerfIso: colocate batch jobs with online services
• Black-box: do not tailor to one specific service
• Robustness: favor user-mode over kernel implementation
• Headroom: some core-slack makes Primary behave like standalone
29 / 297/12/18
PerfIso: colocate batch jobs with online services
• Black-box: do not tailor to one specific service
• Robustness: favor user-mode over kernel implementation
• Headroom: some core-slack makes Primary behave like standalone
• CPU Blind Isolation → colocation without impacting service performance
29 / 297/12/18