Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend...

transcript

Measuring and Optimizing Tail Latency Kathryn S McKinley, Google Xi Yang, Stephen M Blackburn, Md Haque, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Tail Latency Matters

Twosecondslowdownreducedrevenue/userby4.3%.[EricSchurman,Bing]

400milliseconddelaydecreasedsearches/userby0.59%.[JackBrutlag,Google]

TOP PRIORITY

4Photo:Google/ConnieZhou

Servers in US datacenters

illion

06 2008 2010 2012 2014 2016 2018 2020

Unbranded 2+ sockets Unbranded 1 socket Branded 2+ sockets Branded 1 socket

*Shehabi et al., United States Data Center Energy Usage Report, Lawrence Berkeley, 2016.

2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger

Billio

2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger

200 175 150 125 100 75 50 25 0 2000 2005 2010 2015 2020

Actual $1 to $6 billion

Electricity in US datacenters

~ $30,000,000 Savings from 1% less work Lots more by not building a datacenter

Datacenter economics quick facts*

~ $500,000 Cost of small datacenter ~3,000,000 US datacenters in 2016

~ $1.5 trillion US Capital investment to date ~ $3,000,000,000 KW dollars / year

Efficiency TOP PRIORITY

Tail Latency

Efficiency TOP PRIORITY

Tail Latency

Efficiency BOTH ?!

Server architecture

aggregator

workers

client

Characteristics of interactive services

0 20 40 60 80 100

Latency (ms)

Bursty,diurnalCDFchangesslowlySlowestserverdictatestailOrdersofmagnitudediffaverage&tail-99th%Ule

What is in the tail?

0 20 40 60 80 100

Latency (ms)

Roadmap

Diagnosing the tail with continuous profiling Noise systems are not perfect Queuing too much load is bad, but so is over provisioning Work many requests are long

Insights Use the CDF off line

Long requests reveal themselves, treat them specially

application

worker OS / VM

Java VM

application

worker OS / VM

Java VM

application

worker OS / VM

Java VM

Simplified life of a request

request response

Prior state of the art Dick Site, Google https://www.youtube.com/watch?v=QBu2Ae8-8LM

@ Google

Hand instrument system 1% on-line budget sample – but tails are rare… Off-line schematics Have insight Improve the system

Request profiling

Hand instrument system 1% on-line budget sample – but tails are rare… Off-line schematics Have insight Improve the system

Request profiling

Automated instrumentation 1% on-line budget continuous on-line profiling Off-line schematics Have insight Improve the system + On-line optimization

✗Hand instrument system 1% on-line budget sample – but tails are rare… Off-line schematics Have insight Improve the system

counters tags

Automated cycle-level on-line profiling

Insight Hardware & software generate signals

[ISCA’15(TopPicksHM),ATC’16]

hardware signals software signals performance counters memory locations

✓ ✓

SHIM Design ISCA’15 (Top Picks HM), ATC’16

Observe global state from other core

LLCmissespercycle

while(true):forcounterinLLCmisses,cycles:buf[i++]=readCounter(counter)

Observe local state with SMT hardware

HT1IPC

CoreIPC

4HT2SHIMIPC

HT1IPC=CoreIPC–HT2SHIMIPC

while(true):forcounterinHT2SHIM,Core,Cycles:buf[i++]=readCounter(counter);

Correlate hardware & software events

HT1IPC

CoreIPC

HT2SHIMIPC

A()B()C()

while(true):forcounterinHT2SHIM,Core,cycles:buf[i++]=readCounter(counter);tid=threadonHT1buf[i++]=tid.method;

Fidelity

0.0001

0.01 0.1

1 10 100 1000

IPC(logscale)

%ofsamples(logscale)

Raw samples

IPC1 IPC2 IPC3

R1C1 R2C2 R3C3

IPC=(Rt–Rt-1)/(Ct–Ct-1)

✗✓ ✓

CountersC:cyclesR:reUredinstrucUons

Problem: samples are not atomic

IPC1 IPC2 IPC3

✗✓ ✓

Solution: use clock as ground truth CPC=(Cet–Cet-1)/(Cst–Cst-1)thisshouldbe1!

CPC1=1.0+/-1% CPC2=1.0+/-1% CPC3!=1.0+/-1%

Cs0R0C0Ce0 Cs1R1C1Ce1 Cs2R2C2Ce2 Cs3R3C3Ce3

0.0001

0.01 0.1

1 10 100 1000

0.0001

0.01 0.1

1 10 100 1000

----rawIPC

%ofsam

ples(logscale)

0.0001

0.01 0.1

1 10 100 1000

----rawCPC----filteredIPC

----filteredCPCin[0.99,1.01]

Filtering Lusearch IPC samples

top10methods(74%totalexecuUonUme)

00.20.40.60.81

1.21.41.6

1 2 3 4 5 6 7 8 9 10

default1KHz maximum100KHz SHIM10MHz

IPC of individual methods in Lucene

30cycles 1213cycles

methodandloopIDs

Normalize

OverheadsfromwriteinvalidaUons

3MHz:1+orderofmagnitudeoverinterrupt‘maximum’

113MHz:3+ordersofmagnitudeoverinterrupt‘maximum’

Overheads from other core

Understanding Tail Latency

SHIM signals

Requests •  thread ids •  request id – configure •  time stamps, PC System threads •  thread ids •  time stamp, PC

All requests

0 20 40 60 80 100

Request groups (from the slowest 1% to the fastest 1%)

Client latency

Average queueing time

The Tail Longest 200 requests

0 50 100 150

Top 200 requests

Network and networking queueing time

Idle time

CPU time

Dispatch queueing time

Network & other Idle CPU work Queuing at worker not noise

Network imperfections OS imperfections Long requests Overload

} noise

Insight Long requests reveal themselves Regardless of the cause

Noise Replicate & reissue The Tail at Scale, Dean & Barroso, CACM’13

0 20 40 60 80 100

Latency (ms)

All requests? CFD for cost & potential Fixed issue time

10 % reissued 5% reissued

Probabilistic reissue Optimal Reissue Policies for Reducing Tail Latencies, Kaler, He, & Elnickety , SPAA’17

0 20 40 60 80 100

Latency (ms)

Adding randomness to reissue makes one earlier reissue time d (vs n) optimal Probability is proportional to reissue budget & noise in tail

1-3% reissue w/ prob. p

5% reissued

Single R Probabilistic reissue Optimal Reissue Policies for Reducing Tail Latencies, Kaler, He, & Elnickety , SPAA’17

Work Speed up the tail efficiently

0 20 40 60 80 100

Latency (ms)

Judicious parallelism [ASPLOS’15]

DVFS faster on the tail [DISC’14, MICRO’17]

Asymmetric multicore [DISC’14, MICRO’17]

0 0 0 0

Work Parallelism Parallelismhistoricallyforthroughput

ParallelismfortaillatencyIdea

Queuing theory Optimizing average latency maximizes throughput But not the tail! Shortening the tail reduces queuing latency 42

Parallelism

InsightApproach

Longrequestsrevealthemselves

Incrementallyaddparallelismtolongrequests–thetail–basedonrequestprogress&load

Parallelismhistoricallyforthroughput

Few-to-Many Dynamic Parallelism [ASPLOS’15]

0 0 0 0

30 32 34 36 38 40 42 44 46 48

Tail l

Lucene RPS

Sequential

Fixed interval 20 ms

Few to Many at fixed delay d Add thread every d ms

0 0 0 0

Long delay good at high load

Short delay good at low load

best at all loads?

Offline

Profiles Sequential & parallel demand distribution Efficiency of parallelism

Choose maximum target parallelism Utilize available hardware resources

Exhaustively explore parallelism given set of time intervals t & load find best tail latency & parallelism

IntervalTable

Online self scheduling

|requests| Interval0=0 Interval1,2=50,100≤2 @0parallelism=33 @0parallelism=1 @50,parallelism=34-6 @50parallelism=1 @100,parallelism=3≥7 @exitparallelism=1 @100,parallelism=3

Evaluation 2x8 64 bit 2.3 GHz Xeon, 64 GB

30 32 34 36 38 40 42 44 46 48

Requests per Second

21% fewer servers

or reduce tail by 28%

Few to Many Sequential

Work speed up the tail efficiently

0 20 40 60 80 100

Latency (ms)

Judicious parallelism [ASPLOS’15]

Work speed up the tail efficiently

0 20 40 60 80 100

Latency (ms)

Asymmetric multicore (AMP) [DISC’14, MICRO’17]

all requests @ 2.3 MHz

Speed up the tail efficiently

0 20 40 60 80 100

Latency (ms)

2.3 MHz

0.5 MHz

+ available in servers today

Speed up the tail efficiently

0 20 40 60 80 100

Latency (ms)

DVFS faster on the tail [MICRO’17]

+ available in servers today Asymmetric multicore (AMP) [DISC’14, MICRO’17]

+ much more energy efficient + hyper-threading is a form - core competition

Adaptive Slow to Fast Framework

Slow to fast migration is optimal [ICAC’13]

Goal Minimize energy consumption and satisfy a tail latency target

Challenges When to migrate? What if the core speed is not available?

Insight Use big core just enough th+(l99-th)/sp ≤ target Migrate oldest first and migrate early under load!

Controller design

Controller System

Target

Threshold

Tail latency

0 20 40 60 80 100

Latency (ms) load

0 20 40 60 80 100

Latency (ms)

All-cores Pegasus adjust all-core frequencies for load [Towards Energy Proportional…, Google & Stanford, ISCA'14]

Per-core approaches TL minimize tail latency using 0 threshold EETL energy efficient with target tail latency

Policies

0 25 50 75 100 125

99thpercen?

latency(m

TL Pegasus EETL

0 25 50 75 100 125Normalized

average

energy

TL Pegasus EETL

Lucene with DVFS on Broadwell

Lucene on emulated AMP and DVFS

100120140160180200220240

0 50 100 150

99thpercen?

latency(m

TL_DVFS Pegasus EETL_DVFS

TL_AMP EETL_AMP

0 50 100 150Normalized

average

energy

TL_DVFS Pegasus EETL_DVFS

TL_AMP EETL_AMP

Tail Latency

Efficiency BOTH !

Efficiency at scale for interactive workloads

Diagnosing the tail with continuous profiling Noise replicate, systems are not perfect Queuing not today! Work judicious use of resources on long requests

Request latency CDF is a powerful tool Tail efficiency ≠ average or throughput Hardware heterogeneity

Thank you

Heterogeneous hardware dominates homogeneous hardware for throughput, performance, and energy with a fixed power budget & variable request demand Slow-to-Fast sacrifice average a bit to reduce energy & tail latency

Requirements pull for heterogeneity! [DISC’14, ICAC’13, submission]

custom

Hardware heterogeneity – opportunity & challenge

Processors Memory

DDRNVM flash

PIMpaired

0 0 0 0

Parallelism

0 300 600 900

1200 1500

0 10 20 30 40 50

Lucene RPS

Sequential 99th 4 way 99th

improves at low load

degrades at high load

Parallelismhistoricallyforthroughput

Software & hardware

Lucene open source enterprise search Wikipedia English 10 GB index of 33 million pages 10k queries from Lucene nightly tests

Bing web search with one Index Serving Node (ISN) 160 GB web index in SSD, 17 GB cache 30k Bing user queries

Hardware 2x8 64 bit 2.3 GHz Xeon, 64 GB Windows 15 request servers, 1 core issues requests Target parallelism = 24 threads

Policies Sequential N way single degree of parallelism for each

request Adaptive Select parallelism degree when request starts using system load [EUROSYS’13] Request Clairvoyant parallelizes long requests by

perfect prediction of tail FM Few to Many incrementally add parallelism

Load variation

Alternatebetweenhigh&lowloadFMadaptstoburstswithlowvariance

Tail l

Lucene RPS

Sequential 2 way 4 way FM

Low Low High High

Fewer servers: Total Cost of ownership

30 32 34 36 38 40 42 44 46 48

Lucene RPS

Sequential FM

30 32 34 36 38 40 42 44 46 48

Lucene RPS

Adaptive FM

Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend...

Documents