Developing the fastest HTTP/2 server

transcript

DeNA Co., Ltd.Kazuho Oku

Who am I?

n  Kazuho Okun  Major works:

⁃  Palmscape / Xiino (web browser for Palm OS)•  awarded M.I.T. TR 100/2002

⁃  Mitoh project 2004 super creator⁃  Q4M (message queue plugin for MySQL)•  MySQL Conference Community Awards 2011

⁃  H2O (HTTP/2 server)•  Japan OSS Contribution Award 2015

2Developing the fastest HTTP/2 server

Background

Responsiveness is important

source:h@p://radar.oreilly.com/2009/06/bing-and-google-agree-slow-pag.html

n  500ms increase → -1.2% revenue

Increasing size and # of requests

source:h@p://h@parchive.org/trends.php?s=All&minlabel=Aug+1+2011&maxlabel=Aug+1+2015#bytesTotal&reqTotal

Bandwidth is also increasing

n  end-usersʼ B/W increase 50% every year (Nielsenʼs Law)

source:h@p://www.nngroup.com/arRcles/law-of-bandwidth/

More bandwidth doesnʼt matter

source:MoreBandwidthDoesn'tMa@er-2011MikeBelshe(Google)

* effective B/W reaches ceiling at around 1.6Mbps

Latency is the new bottleneck

source:MoreBandwidthDoesn'tMa@er-2011MikeBelshe(Google)

Latency cannot be optimized

n  latency = speed of light⁃  round-trip bet. Japan and US: 80ms

n  mobile carriers have huge latency⁃  LTE ~ 50ms

n  the Web is becoming more and more complex

Web is becoming slower ... unless we do something.

Solution: new protocol

HTTP/2!

The reasons HTTP/1.1 is slow

n  concurrency is too small⁃  multiple round-trips required when issuing many

requestsn  no prioritization between. requests

⁃  can suspend HTML / image streams in favor of CSS / JS

n  big request / response headers⁃  typically hundreds of octets⁃  becomes an overhead when issuing many reqs.

HTTP/2

n  RFC 7540 (2015/5)⁃  based on SPDY by Google

n  key features:⁃  binary protocol⁃  header compression⁃  multiplexing⁃  prioritization

Benchmark

n  red bar: time spent until first-paintn  big difference bet. server implementations

n  reason: quality of prioritization logicn  H2O shows the true potential of HTTP/2

Have we reached the limit?

Letʼs consider what would be the ideal HTTP flow.

TCP slow start

n  Initial Congestion Window (IW)=10⁃  only 10 packets can be sent in first RTT⁃  used to be IW=3

n  window increase: 1.5x/RTT

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

1 2 3 4 5 6 7 8

bytestransmi,ed

TCPslowstart(IW10,MSS1460)

Flow of the ideal HTTP

n  fastest within the limits of TCP/IPn  receive a request 0-RTT, and:

⁃  first send CSS/JS*⁃  then send the HTML⁃  then send the images*

*: but only the ones not cached by the browser

client server

request

response

The reality in HTTP/2

n  TCP establishment: +1 RTTn  TLS handshake: +2 RTT*n  HTML fetch: +1 RTTn  JS,CSS fetch: +2 RTT**

n  Total: 6 RTT

*: 1 RTT on reconnection**: servers often cannot switch to sending JS,CSS instantly, due to the output buffered in TCP send buffer

client server

TCPSYN

TCPSYNACK

TLSHandshake

GETcss,js

css,js〜〜

Ongoing optimizations

n  TCP Fast Open⁃  connection establishment in 0 RTT

n  TLS 1.3⁃  initial handshake complete in 1 RTT⁃  resumption in 0 RTT

n  what can be done in the HTTP/2 layer?

Further optimizations in HTTP/2 layer

n  optimize TCP for responsivenessn  Cache-aware server push

Optimizing TCP for responsiveness

Typical sequence of HTTP/2

HTTP/2 200 OK

<!DOCTYPE HTML>…<SCRIPT SRC=”jquery.js”>…

client server

GET /jquery.js

needtoswitchsendingfromHTMLtoJSatthisverymoment(meansthatamountofdatasentin*mustbesmallerthanIW）

Buffering in TCP and TLS layer

TCPsendbuffer

CWNDunacked pollthreshold

BIObuf.

// ordinary code (non-blocking)while (SSL_write(…) != SSL_ERR_WANT_WRITE) ;

TLSRecords

sentimmediately notimmediatelysent

HTTP/2frames

Why do we have buffers?

n  TCP send buffer:⁃  reduce ping-pong bet. kernel and application

n  BIO buffer:⁃  for data that couldnʼt be stored in TCP send buffer

TCPsendbuffer

BIObuf.

TLSRecords

HTTP/2frames

Improvement: poll-then-write

TCPsendbuffer

// only call SSL_write when polls notifies the app.while (poll_for_write(fd) == SOCKET_IS_READY) SSL_write(…);

TLSRecords

HTTP/2frames

Adjust poll threshold

TCPsendbuffer

n  set poll threshold to the end of CWND?⁃  setsockopt(TCP_NOTSENT_LOWAT)⁃  in linux, the minimum is CWND + 1 octet•  becomes unstable when set to CWND + 0

TLSRecords

HTTP/2frames

Adjust poll threshold

// only call SSL_write when polls notifies the app.while (poll_for_write(fd) == SOCKET_IS_READY) SSL_write(…);

TLSRecords

HTTP/2frames

TCPsendbuffer

Further improvement: read TCP states

// calc size of data to send by calling getsockopt(TCP_INFO)if (poll_for_write(fd) == SOCKET_IS_READY) { capacity = CWND + unacked + ONE_MSS - TLS_overhead; SSL_write(prepare_http2_frames(capacity));}

TLSRecords

HTTP/2frames

TCPsendbuffer

Issues in the proposed approach

n  increased delay bet. ACK recv. → data send⁃  leads to slower peak speed⁃  reason:•  traditional approach: completes within kernel•  this approach: application needs to be notified to

generate new datan  solution:

⁃  use the approach only when necessary•  i.e. when RTT is big and CWND is small•  increased delay can be ignored if: delay << RTT

Code for calculating size of data to sendsize_t get_suggested_write_size() { getsockopt(fd, IPPROTO_TCP, TCP_INFO, &tcp_info, sizeof(tcp_info)); if (tcp_info.tcpi_rtt < min_rtt || tcp_info.tcpi_snd_cwnd > max_cwnd) return UNKNOWN;

switch (SSL_get_current_cipher(ssl)->id) { case TLS1_CK_RSA_WITH_AES_128_GCM_SHA256: case …: tls_overhead = 5 + 8 + 16; break; default: return UNKNOWN; }

packets_sendable = tcp_info.tcpi_snd_cwnd > tcp_info.tcpi_unacked ? tcp_info.tcpi_snd_cwnd - tcp_info.tcpi_unacked : 0; return (packets_sendable + 1) * (tcp_info.tcpi_snd_mss - tls_overhead);}

Benchmark

n  conditions:⁃  server in Ireland, client in Japan (RTT 250ms)⁃  load tiny js at the top of a large HTML

n  result: delay decreased from 511ms to 250ms⁃  i.e. JS fetch latency was 2RTT, became 1 RTT•  similar results in other environments

Conclusion

n  near-optimal result can be achieved⁃  by adjusting poll threshold and reading TCP

states⁃  1-packet overhead due to restriction in Linux

kerneln  1-RTT improvement in H2O

⁃  estimated 1-RTT improvement per the depth of the load graph

Same problem exists with load balancers

n  L4 L/B or TLS terminator also act as buffers⁃  impact bigger than that of TCP send buffer of

httpdn  solution:

⁃  best: donʼt use L/B⁃  next to best: implement mitigations in L/B⁃  long-term: TCP migration + L3 NAT or DSR•  i.e. accept in L/B, then transfer the connection to

HTTP/2 server

Cache-aware Server Push

What is server-push?

n  start the delivery of CSS / JS when receiving a request for HTML

n  effect:⁃  1 RTT reduction, or more

Use-case: conceal request process time

n  ex. RTT=50ms, process time=200ms

req.�

processrequest�push-asset�

HTML�

push-asset�

req.�

processrequest�

asset�

HTML�

asset�

req.�

450ms(5RTT+processing=m

250ms(1RTT+processing=m

withoutpush� withpush�

Use-case: conceal network distance

n  CDNsʼ use-case⁃  utilize the conn. while waiting for app. response⁃  side-effect: reduce the number of app DCs

req.�

push-asset�

HTML�

push-asset�

client� edgeserver(CDN)� app.server�

req.�

HTML�

Issues of server-push

n  how to determine if a resource is already cached⁃  shouldnʼt push a resource already in cache•  waste of bandwidth (and time)

⁃  canʼt issue a request to identify the cache state•  since it would waste 1 RTT we are trying to reduce!

Cache-aware server push

n  experimental feature since H2O 1.5n  create a digest of URLs found in browser cache

⁃  uses Golomb coded sets•  space-efficient variant of bloom filter

n  server uses the digest to determine whether or not to push

Memo: fresh vs. stale

n  two states of a cached resourcen  fresh:

⁃  resource that can be used⁃  example: Expires: Jan 1 2030

n  stale:⁃  needs revalidation before use•  i.e. issue GET with if-modified-since

Generating a digest

1.  calc hashcode of URLs of every fresh cache⁃  range: 0 .. #-of-URL / false-positive-rate

2.  sort the hashcodes, remove duplicates3.  emit the first element using the following encoding:

1.  “value * FPR” using unary coding2.  “value mod (1/false-positive-rate)” using binary

coding4.  for every other element, emit the delta from

preceding element subtracted by one using the encoding

5.  pad 1 up to the byte boundary43Developing the fastest HTTP/2 server

Generating a digest

n  scenario:⁃  FPR: 1/256⁃  URLs of fresh resources in cache:•  https://example.com/ecma.js•  https://example.com/style.css

n  calc hash modulo 512: 0x3d, 0x16bn  sort, remove dupes, and emit the delta:

⁃  0x3d → 0 00111101⁃  0x16b - 0x3d - 1 → 0x12d → 10 00101101⁃  padding → 111111

Overhead of sending the digest

n  size: #-of-URLs * (1/log2(FPR) + 1.x) bitsn  1,400 URLs can be stored in 1 packet

⁃  when false-positive-rate set to 1/128n  can raise FPR to cram more URLs

⁃  false-positive means the resource is not pushed, browser can just pull it

⁃  pushing some of the required resources is better than none

Where to store the digest?

n  cookie⁃  pros: runs on any browser, anytime⁃  cons: digest becomes inaccurate•  only the browser knows whatʼs in the browser cache

n  ServiceWorker (+ServiceWorker Cache)⁃  pros: runs on Chrome, Firefox⁃  cons: doesnʼt start until leaving the landing page

n  HTTP/2 frame⁃  pros: minimal octets transferred•  thanks to the knowledge of HTTP/2 connection

⁃  cons: needs to be implemented by browser developer

Discussion at IETF

n  IETF 95 (April)⁃  initial submission of the internet draft•  co-author: Mark Nottingham (HTTP WG Chair)

⁃  defines the HTTP/2 frame•  since itʼs the best way in the long-term•  store the frame in headers / cookies for the short-

termn  IETF 96, HTTP Workshop (July)

⁃  to define digest calculation of stale resources

Handling stale resources

n  hash key changed to URL + Etag⁃  anyone needs support for last-modified?

n  server uses URL + Etag of the resource to check the digest⁃  push the resource in case a match is not found⁃  push 304 Not Modified in case a match is found

Difficulties in pushing 304

n  Etag cannot always be obtained immediately⁃  cannot build If-Match request header without

etag⁃  the “request*” of a pushed resource SHOULD be

sent before the main responsen  proposed solution:

⁃  allow 304 against a non-conditional GET

*: in case of server-push, the server generates both request and response, sends them to the client.

Using server-push from Ruby

n  Link: rel=preload header⁃  web server pushes the specified URL

HTTP/1.1 200 OK

Content-Type: text/html

Link: </style.css>; rel=preload # this header!!!

⁃  supported by:•  H2O, nghttpx (nghttp2), mod_h2 (Apache)

⁃  patch for nginx exists

The issue with Link: rel=preload

n  cannot initiate push while processing the request

client HTTP/2server Webapp.

can’tpushatthismoment

200OKLink:…200OK

processrequest

1xx Early Metadata

n  send Link: rel=preload as interim response⁃  application sends 1xx then processes the request

n  supported in H2O 2.1n  might propose for standardization in IETF

GET / HTTP/1.1Host: example.com

HTTP/1.1 1xx Early MetadataLink: </style.css>; rel=preload

HTTP/1.1 200 OKContent-Type: text/html; charset=utf-8

<!DOCTYPE HTML>...

Sending 1xx from Rack

n  in case of Unicorn:Proc.new do |env| env[”unicorn.socket”].write( ”HTTP/1.1 1xx Early Metadata\r\n” + ”Link: </style.js>; rel=preload\r\n” + ”\r\n”); # time-consuming operation ... [ 200, [ ... ], [ ... ] ]end

...we need to define the formal API

Conclusion

n  the Web has become faster with HTTP/2n  HTTP/2 becomes fast as to the limit of TCP/IP with:

⁃  optimizing TCP for responsiveness⁃  Cache Digest⁃  1xx Early Metadata

n  Q. Can it be made faster than the limits o TCP/IP?n  A. Yes!

⁃  shorten the RTT!•  CDNsʼ approach

⁃  make DNS query part of TLS handshake•  was part of TLS 1.3 draft (removed as too

premature)⁃  fairness isnʼt a issue for a private network!•  TCP optimizer for mobile carriers

Developing the fastest HTTP/2 server

Internet