+ All Categories
Home > Technology > Latency Trumps All

Latency Trumps All

Date post: 27-Jan-2015
Category:
Upload: guest22d4179
View: 118 times
Download: 0 times
Share this document with a friend
Description:
Web 2.0 Expo Thursday, Nov. 19th, 2009by Chris Saari
Popular Tags:
85
Latency Trumps All Chris Saari twitter.com/chrissaari blog.chrissaari.com [email protected] Thursday, November 19, 2009
Transcript
Page 1: Latency Trumps All

Latency Trumps AllChris Saaritwitter.com/[email protected]

Thursday, November 19, 2009

Page 2: Latency Trumps All

Packet Latency

Time for a packet to get between points A and B Physical distance + time queued in devices along the way

~60ms

Thursday, November 19, 2009

Page 3: Latency Trumps All

...

Thursday, November 19, 2009

Page 4: Latency Trumps All

Anytime...

... the system is waiting for data The system is end to end- Human response time- Network card buffering- System bus/interconnect speed- Interrupt handling- Network stacks- Process scheduling delays- Application process waiting for data from memory to get

to CPU, or from disk to memory to CPU- Routers, modems, last mile speeds- Backbone speed and operating condition- Inter-cluster/colo performance

Thursday, November 19, 2009

Page 5: Latency Trumps All

Big Picture

CPU

Net

wor

k

Mem

oryU

ser

Dis

k

Thursday, November 19, 2009

Page 6: Latency Trumps All

Tubes?

Thursday, November 19, 2009

Page 7: Latency Trumps All

Latency vs. Bandwidth

Latency

BandwidthBits / Second

Time

Thursday, November 19, 2009

Page 8: Latency Trumps All

Bandwidth of a Truck Full of Tape

Thursday, November 19, 2009

Page 9: Latency Trumps All

Latency Lags Bandwidth -David Patterson

Given the record ofadvances in bandwidth ver-sus latency, the logicalquestion is why? Here arefive technical reasons andone marketing reason.

1. Moore’s Law helpsbandwidth more thanlatency. The scaling ofsemiconductor processesprovides both faster transis-tors and many more on achip. Moore’s Law predictsa periodic doubling in thenumber of transistors perchip, due to scaling and inpart to larger chips;recently, that rate has been22–24 months [6]. Band-width is helped by fastertransistors, more transis-tors, and more pins operat-ing in parallel. The fastertransistors help latency, butthe larger number of tran-sistors and the relativelylonger distances on theactually larger chips limitthe benefits of scaling tolatency. For example,processors in Table 1 grew by more than a factor of300 in transistors, and by more than a factor of 6 inpins, but area increased by almost a factor of 5. Sincedistance grows by the square root of the area, distancein Table 1 doubled.

2. Distance limits latency. Distance sets a lowerbound to latency. The delay on the long word linesand bit lines are the largest part of the row access timeof a DRAM. The speed of light tells us that if theother computer on the network is 300 meters away, itslatency can never be less than one microsecond.

3. Bandwidth is generally easier to sell. The non-technical reason that latencylags bandwidth is the marketingof performance: it is easier tosell higher bandwidth than tosell lower latency. For example,the benefits of a 10Gbps band-width Ethernet are likely easierto explain to customers todaythan a 10-microsecond latency

Ethernet, no matter whichactually provides bettervalue. One can argue thatgreater advances in band-width led to marketingtechniques to sell band-width that in turn trainedcustomers to desire it. Nomatter what the real chainof events, unquestionablyhigher bandwidth forprocessors, memories, orthe networks is easier tosell today than latency.Since bandwidth sells,engineering resources tendto be thrown at band-width, which further tipsthe balance.

4. Latency helps band-width. Technology im-provements that helplatency usually also helpbandwidth, but not viceversa. For example,

DRAM latency determines the number of accesses persecond, so lower latency means more accesses per sec-ond and hence higher bandwidth. Also, spinningdisks faster reduces the rotational latency, but the readhead must read data at the new faster rate as well.Thus, spinning the disk faster improves both band-width and rotational latency. However, increasing thelinear density of bits per inch on a track helps band-width but offers no help to latency.

5. Bandwidth hurts latency. It is often easy toimprove bandwidth at the expense of latency. Queuingtheory quantifies how buffers help bandwidth buthurt latency. As a second example, adding chips towiden a memory module increases bandwidth but thehigher fan-out on address lines may increase latency.

6. Operating system overhead hurts latency. Auser program that wants to send a message invokes the

COMMUNICATIONS OF THE ACM October 2004/Vol. 47, No. 10 73

Figure 1. Log-log plot of bandwidth and latency

milestones from Table 1 relative to the first milestone.

Table 2. Summary of annual improvements in latency, capacity, and bandwidth in Table 1.

Thursday, November 19, 2009

Page 10: Latency Trumps All

The Problem

Relative Data Access Latencies, Fastest to Slowest- CPU Registers (1)- L1 Cache (1-2)- L2 Cache (6-10)- Main memory (25-100)

- Hard drive (1e7)- LAN (1e7-1e8)- WAN (1e9-2e9)

--- don’t cross this line, don’t go off mother board! ---

Thursday, November 19, 2009

Page 11: Latency Trumps All

Relative Data Access Latency

CPU Register L1 L2 RAM

Fast Slow

Thursday, November 19, 2009

Page 12: Latency Trumps All

Relative Data Access Latency

CPU Register L1 L2 RAM Hard Disk

Fast Slow

Thursday, November 19, 2009

Page 13: Latency Trumps All

Relative Data Access Latency

Register L1 L2 RAM Hard Disk LANFloppy/CD-ROMWAN

Lower Higher

Thursday, November 19, 2009

Page 14: Latency Trumps All

CPU Register

CPU Register Latency - Average Human Height

Thursday, November 19, 2009

Page 15: Latency Trumps All

L1 Cache

Thursday, November 19, 2009

Page 16: Latency Trumps All

L2 Cache

x 6 x 10

Thursday, November 19, 2009

Page 17: Latency Trumps All

RAM

x 25 to x 100

Thursday, November 19, 2009

Page 18: Latency Trumps All

Hard Drive

0.4 x equatorial circumference of Earth

x 10 M

Thursday, November 19, 2009

Page 19: Latency Trumps All

WAN

x 100 M

0.42 x Earth to Moon Distance

Thursday, November 19, 2009

Page 20: Latency Trumps All

To experience pain...

Mobile phone network latency 2-10x that of wired- iPhone 3G 500ms ping

x 500 M

2 x Earth to Moon Distance

Thursday, November 19, 2009

Page 21: Latency Trumps All

500ms isn’t that long...

Thursday, November 19, 2009

Page 22: Latency Trumps All

Google SPDY

“It is designed specifically for minimizing latency through features such as multiplexed streams, request prioritization and HTTP header compression.”

Thursday, November 19, 2009

Page 23: Latency Trumps All

Strategy Pattern: Move Data Up

Relative Data Access Latencies- CPU Registers (1)- L1 Cache (1-2)- L2 Cache (6-10)- Main memory (25-50)

- Hard drive (1e7)- LAN (1e7-1e8)- WAN (1e9-2e9)

Thursday, November 19, 2009

Page 24: Latency Trumps All

Batching: Do it Once

Thursday, November 19, 2009

Page 25: Latency Trumps All

Batching: Maximize Data Locality

Thursday, November 19, 2009

Page 26: Latency Trumps All

Let’s Dig In

Relative Data Access Latencies, Fastest to Slowest- CPU Registers (1)- L1 Cache (1-2)- L2 Cache (6-10)- Main memory (25-100)

- Hard drive (1e7)- LAN (1e7-1e8)- WAN (1e9-2e9)

Thursday, November 19, 2009

Page 27: Latency Trumps All

Network

If you can’t Move Data Up, minimize accesses

Thursday, November 19, 2009

Page 28: Latency Trumps All

Network

If you can’t Move Data Up, minimize accesses

Souders Performance Rules 1) Make fewer HTTP requests- Avoid going halfway to the moon whenever possible

Thursday, November 19, 2009

Page 29: Latency Trumps All

Network

If you can’t Move Data Up, minimize accesses

Souders Performance Rules 1) Make fewer HTTP requests- Avoid going halfway to the moon whenever possible

2) Use a content delivery network- Edge caching gets data physically closer to the user

Thursday, November 19, 2009

Page 30: Latency Trumps All

Network

If you can’t Move Data Up, minimize accesses

Souders Performance Rules 1) Make fewer HTTP requests- Avoid going halfway to the moon whenever possible

2) Use a content delivery network- Edge caching gets data physically closer to the user

3) Add an expires header- Instead of going halfway to the moon (Network),

climb Godzilla (RAM) or go 40% of the way around the Earth (Disk) instead

Thursday, November 19, 2009

Page 31: Latency Trumps All

Network: Packets and Latency

Less data = less packets = less packet loss = less latency

Thursday, November 19, 2009

Page 32: Latency Trumps All

Network

1) Make fewer HTTP requests 2) Use a content delivery network 3) Add an expires header 4) Gzip components

Thursday, November 19, 2009

Page 33: Latency Trumps All

Disk: Falling of the Latency Cliff

Thursday, November 19, 2009

Page 34: Latency Trumps All

Jim Gray, Microsoft 2006

Tape is DeadDisk is TapeFlash is DiskRAM Locality is King

Thursday, November 19, 2009

Page 35: Latency Trumps All

Strategy: Move Up: Disk to RAM

RAM gets you above the exponential latency line- Linear cost and power consumption = $$$

Main memory (25-50)Hard drive (1e7)

Thursday, November 19, 2009

Page 36: Latency Trumps All

Strategy: Avoidance: Bloom Filters

- Probabilistic answer to question if a member is in a set- Constant time via multiple hashes- Constant space bit string

- Used in BigTable, Cassandra, Squid

Thursday, November 19, 2009

Page 37: Latency Trumps All

In Memory Indexes

Haystack keeps file system indexes in RAM- Cut disk access per image from 3 to 1

Search index compression GFS master node prefix compression of names

Thursday, November 19, 2009

Page 38: Latency Trumps All

Managing Gigabytes -Witten, Moffat, and Bell

Thursday, November 19, 2009

Page 39: Latency Trumps All

SSDs

Disk SSD

I/O Ops / Sec ~ 180 - 200 (15K RPM) ~ 70 - 100

~ 10K - 100K

Seek times ~ 7 - 3.2 ms ~ 0.085 - 0.05 ms

SSDs < 1/5th power consumption of spinning disk

Thursday, November 19, 2009

Page 40: Latency Trumps All

Sequential vs. Random Disk Access

- James Hamilton

Thursday, November 19, 2009

Page 41: Latency Trumps All

1TB Sequential Read

Thursday, November 19, 2009

Page 42: Latency Trumps All

1TB Random Read

Sunday Monday Tuesday Wednesday

Thursday

Friday Saturday

1 2 3 4 5 6 7

8 9 10 11 12 13 14

15 Done!

Thursday, November 19, 2009

Page 43: Latency Trumps All

Strategy: Batching and Streaming

Fewer reads/writes of large contiguous chunks of data- GFS 64MB chunks

Thursday, November 19, 2009

Page 44: Latency Trumps All

Strategy: Batching and Streaming

Fewer reads/writes of large contiguous chunks of data- GFS 64MB chunks

Requires data locality- BigTable app specified data layout and compression

Thursday, November 19, 2009

Page 45: Latency Trumps All

The CPU

Thursday, November 19, 2009

Page 46: Latency Trumps All

“CPU Bound”

Data in RAM CPU access to that data

Thursday, November 19, 2009

Page 47: Latency Trumps All

The Memory Wall

Thursday, November 19, 2009

Page 48: Latency Trumps All

Latency Lags Bandwidth

-Dave Patterson

Thursday, November 19, 2009

Page 49: Latency Trumps All

Multicore Makes It Worse!

More cores accelerates the rate of divergence- CPU performance doubled 3x over the past 5 years- Memory performance doubled once

Thursday, November 19, 2009

Page 50: Latency Trumps All

Evolving CPU Memory Access Designs

Intel Nehalem integrated memory controller and new high-speed interconnect

40 percent shorter latency and increased bandwidth, 4-6x faster system

Thursday, November 19, 2009

Page 51: Latency Trumps All

More CPU evolution

Intel Nehalem-EX - 8 core, 24MB of cache, 2 integrated memory controllers- ring interconnect on-die network designed to speed

the movement of data among the caches used by each of the cores

IBM Power 7- 32MB Level 3 cache

AMD Magny-Cours - 12 cores, 12MB of Level 3 cache

Thursday, November 19, 2009

Page 52: Latency Trumps All

Cache Hit Ratio

Thursday, November 19, 2009

Page 53: Latency Trumps All

Cache Line Awareness

Linked list - Each node as a separate allocation is Bad

Thursday, November 19, 2009

Page 54: Latency Trumps All

Cache Line Awareness

Linked list - Each node as a separate allocation is Bad

Hash table- Reprobe on collision with stride of 1

Thursday, November 19, 2009

Page 55: Latency Trumps All

Cache Line Awareness

Linked list - Each node as a separate allocation is Bad

Hash table- Reprobe on collision with stride of 1

Stack allocation- Top of stack is usually in cache, top of the heap is

usually not in cache

Thursday, November 19, 2009

Page 56: Latency Trumps All

Cache Line Awareness

Linked list - Each node as a separate allocation is Bad

Hash table- Reprobe on collision with stride of 1

Stack allocation- Top of stack is usually in cache, top of the heap is

usually not in cache Pipeline processing- Stages of operations on a piece of data do them all at

once vs. each stage separately

Thursday, November 19, 2009

Page 57: Latency Trumps All

Cache Line Awareness

Linked list - Each node as a separate allocation is Bad

Hash table- Reprobe on collision with stride of 1

Stack allocation- Top of stack is usually in cache, top of the heap is

usually not in cache Pipeline processing- Stages of operations on a piece of data do them all at

once vs. each stage separately Optimize for size - Might be faster execution than code optimized for speed

Thursday, November 19, 2009

Page 58: Latency Trumps All

Cycles to Burn

1) Make fewer HTTP requests 2) Use a content delivery network 3) Add an expires header 4) Gzip components- Use excess compute for compression

Thursday, November 19, 2009

Page 59: Latency Trumps All

Datacenter

Thursday, November 19, 2009

Page 60: Latency Trumps All

Datacenter Storage HeiracrchyStorage hierarchy: a different view

A bumpy ride that has been getting bumpier over time- Jeff Dean, Google

Thursday, November 19, 2009

Page 61: Latency Trumps All

Intra-Datacenter Round Trip

x 500,000

~500 miles~NYC to Columbus, OH

Thursday, November 19, 2009

Page 62: Latency Trumps All

Datacenter Level Systems

Facebook Cassandra

Google BigTable

memcached

Redis Project Voldemort

Yahoo Sherpa

Sawzall / Pig

Google File System

RethinkDB

MonetDB

HBaseFacebook Haystack

Thursday, November 19, 2009

Page 63: Latency Trumps All

Memcached Facebook Optimizations

- UDP to reduce network traffic - Less Packets

Thursday, November 19, 2009

Page 64: Latency Trumps All

Memcached Facebook Optimizations

- UDP to reduce network traffic - Less Packets- One core saturated with network interrupt handing- opportunistic polling of the network interfaces and

setting interrupt coalescing thresholds aggressively - Batching

Thursday, November 19, 2009

Page 65: Latency Trumps All

Memcached Facebook Optimizations

- UDP to reduce network traffic - Less Packets- One core saturated with network interrupt handing- opportunistic polling of the network interfaces and

setting interrupt coalescing thresholds aggressively - Batching

- Contention on network device transmit queue lock, packets added/removed from the queue one at a time- Change dequeue algorithm to batch dequeues for

transmit, drop the queue lock, and then transmit the batched packets

Thursday, November 19, 2009

Page 66: Latency Trumps All

Memcached Facebook Optimizations

- UDP to reduce network traffic - Less Packets- One core saturated with network interrupt handing- opportunistic polling of the network interfaces and

setting interrupt coalescing thresholds aggressively - Batching

- Contention on network device transmit queue lock, packets added/removed from the queue one at a time- Change dequeue algorithm to batch dequeues for

transmit, drop the queue lock, and then transmit the batched packets

- More lock contention fixes

Thursday, November 19, 2009

Page 67: Latency Trumps All

Memcached Facebook Optimizations

- UDP to reduce network traffic - Less Packets- One core saturated with network interrupt handing- opportunistic polling of the network interfaces and

setting interrupt coalescing thresholds aggressively - Batching

- Contention on network device transmit queue lock, packets added/removed from the queue one at a time- Change dequeue algorithm to batch dequeues for

transmit, drop the queue lock, and then transmit the batched packets

- More lock contention fixes

- Result 200,000 UDP requests/second with average latency of 173 microseconds

Thursday, November 19, 2009

Page 68: Latency Trumps All

Google BigTable

Table contains a sequence of blocks- block index loaded into memory - Move Up

Thursday, November 19, 2009

Page 69: Latency Trumps All

Google BigTable

Table contains a sequence of blocks- block index loaded into memory - Move Up

Table can be completely mapped into memory - Move Up

Thursday, November 19, 2009

Page 70: Latency Trumps All

Google BigTable

Table contains a sequence of blocks- block index loaded into memory - Move Up

Table can be completely mapped into memory - Move Up Bloom filters hint for data - Move Up

Thursday, November 19, 2009

Page 71: Latency Trumps All

Google BigTable

Table contains a sequence of blocks- block index loaded into memory - Move Up

Table can be completely mapped into memory - Move Up Bloom filters hint for data - Move Up Locality groups loaded in memory - Move Up, Batching- Clients can control compression of locality groups

Thursday, November 19, 2009

Page 72: Latency Trumps All

Google BigTable

Table contains a sequence of blocks- block index loaded into memory - Move Up

Table can be completely mapped into memory - Move Up Bloom filters hint for data - Move Up Locality groups loaded in memory - Move Up, Batching- Clients can control compression of locality groups

2 levels of caching - Move Up- Scan cache of key/value pairs and block cache

Thursday, November 19, 2009

Page 73: Latency Trumps All

Google BigTable

Table contains a sequence of blocks- block index loaded into memory - Move Up

Table can be completely mapped into memory - Move Up Bloom filters hint for data - Move Up Locality groups loaded in memory - Move Up, Batching- Clients can control compression of locality groups

2 levels of caching - Move Up- Scan cache of key/value pairs and block cache

Clients cache tablet server locations- 3 to 6 network trips if cache is invalid - Move Up

Thursday, November 19, 2009

Page 74: Latency Trumps All

Facebook Cassandra

Bloom filters used for keys in files on disk - Move Up

Thursday, November 19, 2009

Page 75: Latency Trumps All

Facebook Cassandra

Bloom filters used for keys in files on disk - Move Up Sequential disk access only - Batching Append w/o read ahead

Thursday, November 19, 2009

Page 76: Latency Trumps All

Facebook Cassandra

Bloom filters used for keys in files on disk - Move Up Sequential disk access only - Batching Append w/o read ahead Log to memory and write to commit log on dedicated disk -

Batching

Thursday, November 19, 2009

Page 77: Latency Trumps All

Facebook Cassandra

Bloom filters used for keys in files on disk - Move Up Sequential disk access only - Batching Append w/o read ahead Log to memory and write to commit log on dedicated disk -

Batching Programmer controlled data layout for locality - Batching

Thursday, November 19, 2009

Page 78: Latency Trumps All

Facebook Cassandra

Bloom filters used for keys in files on disk - Move Up Sequential disk access only - Batching Append w/o read ahead Log to memory and write to commit log on dedicated disk -

Batching Programmer controlled data layout for locality - Batching

Result: 2 orders of magnitude better performance than MySQL

Thursday, November 19, 2009

Page 79: Latency Trumps All

Move the Compute to the Data: YQL Execute

Thursday, November 19, 2009

Page 80: Latency Trumps All

From the Browser Perspective

Performance bounded by 3 things:

Thursday, November 19, 2009

Page 81: Latency Trumps All

From the Browser Perspective

Performance bounded by 3 things:- Fetch time- Unless you’re bundling everything it is a cascade of

interdependent requests, at least 2 phases worth

Thursday, November 19, 2009

Page 82: Latency Trumps All

From the Browser Perspective

Performance bounded by 3 things:- Fetch time- Unless you’re bundling everything it is a cascade of

interdependent requests, at least 2 phases worth- Parse time- HTML- CSS- Javascript

Thursday, November 19, 2009

Page 83: Latency Trumps All

From the Browser Perspective

Performance bounded by 3 things:- Fetch time- Unless you’re bundling everything it is a cascade of

interdependent requests, at least 2 phases worth- Parse time- HTML- CSS- Javascript

- Execution time- Javascript execution- DOM construction and layout- Style application

Thursday, November 19, 2009

Page 84: Latency Trumps All

Recap

Move Data Up- Caching- Compression- If You Can’t Move All The Data Up- Indexes- Bloom filters

Batching and Streaming- Maximize locality

Thursday, November 19, 2009

Page 85: Latency Trumps All

Take 2 And Call Me In The Morning

An Engineer’s Guide to Bandwidth- http://developer.yahoo.net/blog/archives/2009/10/

a_engineers_gui.html High Performance Web Sites- Steve Souders

Even Faster Web Sites- Steve Souders

Managing Gigabytes: Compressing and Indexing Documents and Images- Witten, Moffat, Bell

Yahoo Query Language (YQL)- http://developer.yahoo.com/yql/

Thursday, November 19, 2009


Recommended