Empirical Quantification of Opportunities for Content Adaptation
in Web Servers
Michael Gopshtein and Dror FeitelsonSchool of Engineering and Computer Science
The Hebrew University of Jerusalem
Supported by a grant from the Israel Internet Association
Capacity Planning
• The problem:– Required capacity for flash crowds cannot be
anticipated in advance– Even capacity for daily fluctuations is highly
wasteful
• Academic solution: use admission control
• Business practice: unacceptable to reject any clients– Especially in cases of surge in traffic
Content Adaptation
• Trade off quality for throughput– Installed capacity matches normal load– Handle abnormal load by reducing quality– But still manage to provide meaningful service
to all clients
• Assumes normal optimizations have been made already– Compress or combine images, promote
caching, …– Empirically this usually is not the case
Content Adaptation
• Maintain the invariant:
• Need to change quality (and cost!) of content– Prepare multiple versions in advance
capacityrequest
perstco
requests
ofrate
The Questions
• What are the main costs in web service?– Bottleneck is CPU / network / disk?– What do we gain by eliminating HTTP requests?– What do we gain by reducing file sizes?
• What can realistically be done?– What is the structure of a “random” site?– How much can we reduce quality?
Assumption: static web pages only
Measuring Random Web Sites
• http://en.wikipedia.org/wiki/Special:Random
• Use title of page as input to Google search
• Extract domain of first link to get home page
• Retrieve it using IE
• Collect statistical data by intercepting system calls to send and receive
Retrieved Component Sizes
This is only 0.02% of the components
A ¼ of total data from components
larger than 200 KB
Network Bandwidth
• Typical Ethernet packets are 1526 bytes– Ethernet and TCP/IP headers require 54 bytes– HTTP response headers require 280-325
• Most components fit into few packets– 43% fit into a single packet– 24% more fit into 2 packets
Save bandwidth by reducingnumber of small componentsor size of large components
Locality and Caching
• Flash crowds typically involve a very small number of pages (possibly the home page)
• Servers allocate GB of memory for cache
• This is enough for thousands of files
Disk is not expected to bea bottleneck
CPU Overhead
• CPU usage reflects several activities– Opening TCP connection– Processing request– Sending data
• Measure using combinatorical microbenchmarks– Open connection only– One extremely large file– Many small files– Many requests for non-existent file
CPU Overhead
Example: single 10KB file
• Equal processing and transfer at 240KB– Only 0.3% of files are so big
Establishing connection 25%
Processing request 72%
Data transfer 3%
If CPU is bottleneck, needto reduce number of requests
Guidelines
• Either CPU or network are the bottleneck
• Network bandwidth saved by reducing large components
• CPU saved by eliminating small components
• Maintaining “acceptable” quality is subjective
Eliminating Images
• Images have many functions– Story (main illustrative item)– Preview (for other page)– Commercial– Logo– Decoration (bullets, background)– Navigation (buttons, menus)– Text (special formatting)
• Some can be eliminated or replaced
Distribution of Types
• Manually classified 959 images from 30 random sites
• 50% decoration• 18% preview• 11% commercial• 6% logo• 6% text
Automatic Identification
• Decorations are candidates for elimination
• Identified by combination of attributes:– Use gif format– Appear in HTML tags other than <IMG>– Appear multiple times in same page– Small original size– Displayed size much bigger than original– Large change in aspect ratio when displayed
Auxiliary Files
• JavaScript– May be crucial for page function– Impossible to understand automatically
• CSS (style sheets)– May be crucial for page structure– May be possible to identify those parts that
are used
Auxiliary Files
• Cannot be eliminated
• Common wisdom: use separate files– Allow caching at client– Save retransmission with each page
• Alternative: embed in HTML– Reduce number of requests– May be better for flash crowds that do not
request multiple pages
Text and HTML
• Some areas may be eliminated under extreme conditions– Commercials– Some previews and navigation options
• Often encapsulated in <DIV> tags
• Sometimes identified by ID or class names, e.g. “sidebanner”– Especially when using modular design
Content Adaptation
• Degraded content usually better than exclusion
• Only way to handle flash crowds that overwhelm installed capacity
• Empirical results identify main options– Identify and eliminate decorations– Compress large images (story, commercial)– Embed JavaScript and CSS– Hide unnecessary blocks