+ All Categories
Home > Documents > Profiling a warehouse-scale computer · “microservice architecture” thousands of services are...

Profiling a warehouse-scale computer · “microservice architecture” thousands of services are...

Date post: 23-May-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
21
Profiling a warehouse-scale computer Svilen Kanev Juan Pablo Darago Kim Hazelwood Parthasarathy Ranganathan, Tipp Moseley Gu-Yeon Wei, David Brooks Harvard University Universidad de Buenos Aires Yahoo Labs Google Inc. Harvard University
Transcript
Page 1: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”

Profiling a warehouse-scale computerSvilen KanevJuan Pablo DaragoKim HazelwoodParthasarathy Ranganathan, Tipp MoseleyGu-Yeon Wei, David Brooks

Harvard UniversityUniversidad de Buenos Aires

Yahoo LabsGoogle Inc.

Harvard University

Page 2: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”

The cloud is here to stay

[http://google.com/trends, 2015]2

Page 3: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”

Warehouse-scale computers (of yore)

datacenters built arounda few “killer workloads”

problem sizes >> 1 machine

3

...distributed, buttightly interconnectedservices

communication through remote-procedure calls (RPCs)

Page 4: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”

Now “the datacenter is the computer”(the WSC model has caught on)

4

“microservice architecture”

thousands of services are“one RPC away”

“... about a hundred of services that comprise Siri’s backend...”

[Apple, Mesos meetup 2015]

Did you mean: #pldi15

frequency[“#isca15”]++

Page 5: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”

How do modern WSC applications interact with hardware?

And what does that imply for future server processors?

Page 6: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”

Traditional profiling: load testing

6

Find representative inputs

Find representative operating point

Profile / optimize

Repeat

Isolate a service

Page 7: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”

Live datacenter-scale profiling(Google-wide profiling)

Select randomproduction machines

~20,000 / day

GWP DB

[Ren et al. Google-wide profiling, 2010]7

Profile each one (for a while)without isolation

while running live trafficfor billions of users

Aggregate days, weeks,years worth of execution

Page 8: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”

Live WSC profiling insights

8

Where are cycles spent in a datacenter?

Are there really no killer applications?

How do WSC applications interact with instruction caches?

How much ILP is there? Big / small cores?

DRAM latency vs. bandwidth?

Hyperthreading?

Page 9: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”

Where are WSC cycles spent?

Page 10: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”

No “killer” application to optimize for

10

Instead: a long tail of various different services

[1 week of sampled WSC cycles]

Page 11: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”

Ongoing application diversification

11

[~3 years of sampled WSC cycles]

Optimizing hardware one-application-at-a-time has diminishing returns

Page 12: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”

Within applications: no hotspots

Corollary: hunting for per-application hotspots is not justified

12

[search leaf node; 1 week of cycles]

Page 13: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”

Shared low-level routines; typical for larger-than-1-server problems

Hotspots across applications:“datacenter tax’’

13

Page 14: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”

Hotspots across applications:“datacenter tax’’

Prime candidates for accelerators in server SoCs

14

Only 6 self-contained routines account for ~30% of WSC cycles

Page 15: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”

Live WSC profiling insights

15

Where are cycles spent in a datacenter? Everywhere.

Are there really no killer applications? Datacenter tax.

How do WSC applications interact with instruction caches?

How much ILP is there? Big / small cores?

DRAM latency vs. bandwidth?

Hyperthreading?

Page 16: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”

Microarchitecture:WSC i-cache pressure

Page 17: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”

Severe instruction cache bottlenecks

15-30% of core cycles wasted oninstruction-supply stalls

17

20,000 Intel IvyBridge servers2 daysTop-Down analysis [Yasin 2014]

Page 18: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”

Severe instruction cache bottlenecks

Fetching instructions from L3 cachesVery high i-cache miss rates

10x the highest in SPEC50% higher than CloudSuite

15-30% of core cycles wasted oninstruction-supply stalls

Lots of lukewarm code100s MBs of instructions per binary; no hotspots

18

Page 19: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”

A problem in the making

I-cache working sets 4-5x larger than largest in SPEC

Growing almost 30% / year

significantly fasterthan i-caches

One solution: L2 i/d partitioning

19

Page 20: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”

Live WSC profiling insights

20

Where are cycles spent in a datacenter? Everywhere.

Are there really no killer applications? Datacenter tax.

How do WSC applications interact with instruction caches? Poorly.

How much ILP is there? Big / small cores? Bimodal.

DRAM latency vs. bandwidth? Latency.

Hyperthreading? Yes.

Page 21: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”

To sum up

A growing number of programs cover “the world’s WSC cycles”. There is no “killer application”, and hand-optimizing each program is suboptimal.

Low-level routines (datacenter tax) are a surprisingly high fraction of cycles. Good candidates for accelerators in future server processors.

Common microarchitectural footprint: working sets too large for i-caches; many d-cache stalls; generally low IPC; bimodal ILP; low memory bandwidth utilization.


Recommended