Date post: | 15-Dec-2015 |
Category: |
Documents |
Upload: | kourtney-brame |
View: | 212 times |
Download: | 0 times |
Web MeasurementChapter 7, Section 7.3
Hessam Mirsadeghi
ECE Department, University of Tehran
Fall 2009
Outline
Web measurement motivation
Properties of interest
Challenges of web measurement
Web measurement tools
State of the art Web properties Web traffic data gathering and analysis Web performance Web applications
2Web MeasurementUniversity of Tehran
Motivation
The single most popular Internet application. Measurement can be very useful.
The single largest application studied in Internet measurement
75% of the Internet traffic in the first decade of existence
Around a billion web users
3Web MeasurementUniversity of Tehran
University of Tehran Web Measurement 4
Web Measurement Properties Web is at the most-visible level for users
Some of the properties are decomposable into components at other layers of protocol stack
Web latency DNS, TCP, HTTP Web server delay Client-side rendering
5Web MeasurementUniversity of Tehran
Web Measurement Properties (cont’d)
6Web MeasurementUniversity of Tehran
High-Level Characterization
Measuring fraction of web traffic Measuring the use of HTTP protocol
Considerable traffic over HTTP while the clients and servers are p2p nodes
Here we consider web traffic that involves web clients communicating with a web server
7Web MeasurementUniversity of Tehran
High-Level Characterization (cont’d)
Knowledge of entities involved in web transactions Clients, proxies, servers
Measuring the count and growth of web entities Providing insight on how the web has evolved and is being used e.g. number of clients behind a proxy provides insights on the extent of
caching
8Web MeasurementUniversity of Tehran
Location
Identifying where clients and proxies are present can help content providers move resources closer to them
Location data can help businesses tailor content , manner of delivery, and consider alternate architectural improvements in placement of services.
Network and physical location
9Web MeasurementUniversity of Tehran
Configuration
Different server configuration impact performance
Clients and proxies configurations Protocol variants supported Compliance with protocol specification Clients connectivity
10Web MeasurementUniversity of Tehran
User Workload Models
How resources are accessed within a web site reconfiguring the web site modifying the resources Alternatives for delivery of popular resources
Constructing models for “think-time” of users
Help in dealing with the new classes of users
Modeling novel phenomena such as flash crowds and attacks
11Web MeasurementUniversity of Tehran
Traffic Properties
Reduction of redundant transfers and sudden surges Caching the resources
Cacheability of resources, deployment and use of caches, performance of caches
Handling circumstances like flash crowds
12Web MeasurementUniversity of Tehran
Application Demands
Better understanding of the interaction between the application and transport-level protocols Improvements in the protocols Reducing time-to-glass
The actual flow of a web transaction from the user click to displaying data
13Web MeasurementUniversity of Tehran
Web Performance
Dominating much of the web measurement work
Popularity of a web site is highly dependant on it’s performance
Finding ways to reduce delays
Sources of slowdowns
14Web MeasurementUniversity of Tehran
University of Tehran Web Measurement 15
Challenges to Measurement
Application-level nature
Dependence on multiple protocols DNS, TCP, HTTP
Large sets of entities with varying configurations
Equally diverse user population
16Web MeasurementUniversity of Tehran
Challenges to Measurement (cont’d)
Hidden data
Hidden layers
Hidden entities
17Web MeasurementUniversity of Tehran
Hidden Data
Much of the traffic is intra-net and inaccessible.
Access to remote server data, even old logs is often
unavailable.
From the server end, information about the clients (e.g.
connection bandwidth) is obscured.
New pages are constantly added, old ones removed or
modified.
18Web MeasurementUniversity of Tehran
Hidden Data (cont’d)
Access information of web pages are not accessible.
TCP configuration parameters significantly impact
performance and can not be remotely ascertained
Tools like TBIT for testing impacts of TCP variants like
Reno, Tahoe, or Vegas
19Web MeasurementUniversity of Tehran
Hidden Layers
Protocol and network layers are harder to measure. Requires both deep knowledge of the network protocol as well as an
understanding of the precise interactions between the different network protocols
Not knowing the number of end-clients due to proxies.
Requests may be redirected at different layers of the protocol to different servers. Redirections can happen at DNS, TCP, or HTTP level.
20Web MeasurementUniversity of Tehran
Hidden Layers (cont’d)
21
Foo1.jpg
Foo2.jpg
Foo3.jpg
ad1 ad2 ad3
Index.html
<text>
ServerClient
ad1
ad2
ad3
Foo1.jpg Foo2
.jpg
Foo3
.jpg
Ad Server1
Ad Server2
Ad Server3
CDN Server 1 CDN Server 2
Index.html
Web MeasurementUniversity of Tehran
Hidden Entities
Proxies, HTTP and TCP redirectors
Transparent interception proxies, return results from a cache.
Different behavior of switches for web-related and non web-
related traffic
Lack of predictability due to multiple hidden entities at
various layers of protocol stack.
22Web MeasurementUniversity of Tehran
University of Tehran Web Measurement 23
Tools: Estimation of Web Traffic From 21st century peer-to-peer traffic took the lead in terms of
number of bytes
Web still remains the number one application in terms of active users
Almost 1 billion Internet users, a vast majority of whom use the web
24Web MeasurementUniversity of Tehran
Tools: Sampling & DNS
Netflow: traffic to the HTTP port (80)
DNS traces to see what IP addresses are looked up Well-known web servers are likely to be high
25Web MeasurementUniversity of Tehran
Tools: Server Logs
Number of requests and clients are logged in web server logs Web log analyzers for generating statistics Presence of obscured data
Proxies Inter-arrival time of requests Range and diversity of resources requested
Crawlers and Spiders Disproportionate number of requests from one of a few IP addresses
Anonymizers Caches
26Web MeasurementUniversity of Tehran
Tools: Surveys
Estimating the number of web servers (Netcraft)
Important metric: number and identity of popular web servers Business, technical, and social implications
27Web MeasurementUniversity of Tehran
Tools: Locating Entities
An increasingly difficult problem
Servers resources are distributed geographically
Large number of resources
Increase availability
Being closer to clients
Several businesses can use the same server farm to increase
utilization.
Locating clients: simple ‘traceroute’, techniques such as
network aware clustering28Web MeasurementUniversity of Tehran
Tools: Structural View
The linkage structure on web pages
HITS algorithm for identifying hubs and authorities
Hub: a page having multiple high-value links about a topic
Authority: the page having high-quality content on a given topic
Web pages as nodes and links as edges in a graph model
Page rankings and Improvement of web searching
29Web MeasurementUniversity of Tehran
Tools: Web Searching & Crawling One of the most important www applications
Components:
Crawler: traverses the accessible part of the web to fetch web pages
Indexer: indexes the crawled pages
Search tool: accepts queries and returns pointers to the matching pages
30Web MeasurementUniversity of Tehran
Tools: Web Performance (cont’d)
Measuring a particular web site’s latency and availability
from diverse client perspectives.
Examining different latency components such as DNS,
TCP or HTTP differences, and CDNs
Global measurements of the web to examine protocol
compliance and ensure reduction of outages.
31Web MeasurementUniversity of Tehran
Tools: Web Performance (cont’d)
A variety of companies offer such services:
Keynote, Akamai, eValid Test Suit, etc.
A common technique: a distributed set of monitors around the
world sending periodic requests to web sites.
32Web MeasurementUniversity of Tehran
Tools: Network Aware Clustering An effective technique to group IP addresses into clusters
quickly and automatically
Non-overlapping cluster
Being close topologically
Common administrative control
Clustering by use of BGP routing table snapshots and longest
prefix matching.
Same prefix → same cluster
33Web MeasurementUniversity of Tehran
Tools: Network Aware Clustering (cont’d) BGP routing table snapshot
34Web MeasurementUniversity of Tehran
Tools: Network Aware Clustering (cont’d) Application
Used to group client IP addresses in web server
logs
Recognizing proxies and spiders
Better content access prediction
etc
35Web MeasurementUniversity of Tehran
Tools: Network Aware Clustering (cont’d)
36
Total server log
Client containing spider
Cluster containing proxy
Web MeasurementUniversity of Tehran
Tools: Handling Mobile Clients (cont’d)
39
Figure 3. Document Browsing with Summarizer on WAP
Christopher C. Yang and Fu Lee Wang. Fractal Summarization for Mobile Devices to Access Large Documents on the Web. In Proceedings of the World Wide Web Conference, May 2003.
Web MeasurementUniversity of Tehran
Tools: Handling Mobile Clients (cont’d)
Continues growth in mobile web
Wireless network delays
Tailored content
Similar methods:
Server logs of mobile content providers
Lab experiments (e.g emulate mobile devices, induce packet loss)
Wide-area experiments
40Web MeasurementUniversity of Tehran
University of Tehran Web Measurement 41
State of the Art
42Web MeasurementUniversity of Tehran
Web Properties: High Level
Reduction in web traffic estimation Unreachable data
Firewalls and other barriers due to attacks
Use of internal web sites
The shift from Web to P2P
Around a million new sites a month (Netcraft)
43Web MeasurementUniversity of Tehran
Web Properties: High Level (cont’d)
60 million web sites in fall 2004 A vast fraction have little or no traffic compared to the top few
hundred.
Apache and Microsoft server implementations
together have 90% of the market (68% for Apache)
44Web MeasurementUniversity of Tehran
Web Properties: High Level (cont’d)
45Netcraft survey. (news.netcraft.com)
Web MeasurementUniversity of Tehran
Web Properties: High Level (cont’d)
46Netcraft survey. (news.netcraft.com)
Web MeasurementUniversity of Tehran
Web Properties: Location
Steadily growing number of users are in Asian countries such as China and India.
The fraction of web content from the US and Europe is falling.
Implications on where servers will be mirrored and supported languages.
47Web MeasurementUniversity of Tehran
Web Properties: Configuration Popular sites use a variety of techniques to improve
server performance:
Distribute servers geographically (e.g. 3 world cup servers in the U.S., 1 in France)
Redirecting requests to the least loaded server in a farm.
Caching frequently requested resources
48Web MeasurementUniversity of Tehran
Web Properties: User Workload Models We measure user workload by looking at:
the duration of HTTP connections
request and response sizes,
unique number of IP addresses contacting a given Web site
number and frequency of accesses of individual resources at a given Web site
etc.
49Web MeasurementUniversity of Tehran
Web Properties: Access Dynamics Web page access has been experimentally verified to
follow Zipf-like distribution.
Zipf’s law: Probability of a request to the ith most popular page is
proportional to 1/i
50Web MeasurementUniversity of Tehran
State of the Art
51Web MeasurementUniversity of Tehran
Web Traffic: Critical Path Analysis Constructing critical path to understand where delays
are introduced in web requests
Packet propagation
Network variation (e.g. queuing at routers)
Packet loss
Delay at server and client
52Web MeasurementUniversity of Tehran
Web Traffic: Critical Path Analysis (cont’d)
Only some of the components are responsible for
overall response time
Importance of activities on the critical path
53Web MeasurementUniversity of Tehran
Web Traffic: Software Aid
httperf: Sends HTTP requests and processes responses Simulates workload Gathers statistics Supports HTTP/1.1 Freely available in source code
54Web MeasurementUniversity of Tehran
Web Traffic: Software Aid (cont’d)
wget Fetches a large number of pages rooted at a particular node. Can fetch all the pages up to a certain “level” according to
links
Mercator (a personalized crawler) Uses a seed page and then does breadth-first search on the
links to find pages. Higher weight for pages having more incoming links.
55Web MeasurementUniversity of Tehran
State of the Art
58Web MeasurementUniversity of Tehran
Web Performance: Intro
User-perceived latency is a key factor because it affects the popularity of a site.
beyond a certain delay, user cancellations of the page increases sharply.
59Web MeasurementUniversity of Tehran
Web Performance: CDNs
Busy servers outsource delivery of some of their pages
CDNs combine the workload of several sites into a single provider.
Mirroring the CDNs to be located near clients.
DNS-based redirection
DNS overhead is a serious bottleneck in some CDNs
60Web MeasurementUniversity of Tehran
• Motivation:• More hops between client and Web server => more
congestion!
• Same data flowing repeatedly over links between clients and Web server
61
Web Performance: CDNs (cont’d)
S
C1
C4
C2
C3
- IP router
Web MeasurementUniversity of Tehran
Web Performance: CDNs (cont’d) Caches
62
Web Serverwww.cnn.com
Usermerlot.cis.udel.edu
1000,000other hosts
1000,000other hosts
New ContentWTC News!
oldcontent request
request
- Caching Proxy
ISP
- Congestion / Bottleneck
Web MeasurementUniversity of Tehran
63
Web Performance: CDNs (cont’d)• Caching problems:
• Caching proxies serve only their clients, not all users on the Internet
• Content providers (say, Web servers) cannot rely on existence and correct implementation of caching proxies
• Accounting issues with caching proxies. For instance, www.cnn.com needs to know the number of hits to the webpage for advertisements displayed on the webpage
Web MeasurementUniversity of Tehran
Web Serverwww.cnn.com
Usermerlot.cis.udel.edu
64
Web Performance: CDNs (cont’d)
New ContentWTC News!
requestnew
content
1000,000other users
1000,000other users
- Mirrors
- Distribution Infrastructure
FL
IL
DE
NY
MA
MICA
WA
Web MeasurementUniversity of Tehran
• Overlay network to distribute content from origin servers to users
• Avoids large amounts of same data repeatedly traversing potentially congested links on the Internet
• Reduces Web server load
• Reduces user perceived latency
65
Web Performance: CDNs (cont’d)
Web MeasurementUniversity of Tehran
66
DNS-based Request Routing
Akamai DNS
DN
S q
uery
:w
ww
.cnn
.com
DN
S r
espo
nse:
A 1
45.1
55.1
0.15
Sess
ion
local DNS server (louie.udel.edu)128.4.4.12
DNS query:www.cnn.com
DNS response:A 145.155.10.15
www.cnn.com
Surrogate145.155.10.1
5
Surrogate58.15.100.15
2
AkamaiCDN
merlot.cis.udel.edu
128.4.30.15
delaware.cnn.akamai.com
california.cnn.akamai.com
Q:How does the Akamai DNS know which surrogate is closest ?
Web MeasurementUniversity of Tehran
67
DNS-Based Request Routing (cont’d)
DN
S q
uery
DN
S r
espo
nse
Sess
ion
Akamai DNS
www.cnn.com
Surrogate
Surrogate
AkamaiCDN
merlot.cis.udel.edu
128.4.30.15
local DNS server (louie.udel.edu)
128.4.4.12
DNS query
DNS response
Measure to
Client D
NS
Measure to Client DNS
Measurement results
Measurement re
sults
Mea
sure
men
tsMeasurements
Web MeasurementUniversity of Tehran
DNS-Based Redirection
Problem: The content server is optimized for the local name server,
not the actual client
Client may be far from name server
In a study, only 16% of the clients were in the same network-aware cluster as the local DNS server
68Web MeasurementUniversity of Tehran
Total & Selective Redirection
1. Total redirection Any request for origin server is redirected to CDN Basically, CDN takes control of content provider’s DNS zone Benefit: All requests are automatically redirected Disadvantage: May send lots of traffic to CDN, hence expensive for the
content provider
2. Selective redirection Content provider marks which objects are to be served from CDN Typically, larger objects like images are selected Refer to images as: <img src=http://cdn.com/foo/bar/img.gif> Pro: Fine-grained control over what gets delivered Con: Have to (manually) mark content for CDN
69Web MeasurementUniversity of Tehran
Surrogate Server
CDN
Origin Server
Client
GET index.html
GET image1.gif, image2.gif
inde
x.ht
ml,
imag
e1.g
if,
imag
e2.g
if
Total Redirection
70
index.html
embedded image1.gifimage2.gif
Web MeasurementUniversity of Tehran
Origin Server
SurrogateServer
CDN
Client
GET index.html
GET image1.gif, image2.gif
imag
e1.g
if,
imag
e2.g
if
Partial Redirection
71
index.html
embedded image1.gifimage2.gif
Web MeasurementUniversity of Tehran
Total vs. Selective Redirection
Total redirection has clearly superior performance
Selective redirection is typically slower than downloading everything from the origin server But origin server might be loaded…
Which redirection is more used? Initially, selective redirection was used These days, mainly total redirection
72Web MeasurementUniversity of Tehran
Web Performance: Client Connectivity Finding clients’ connection quality
Delivering the most suitable version of content Sending just the base document Using compression
Tailoring server’s policy Keep persistent connections open longer
Measure the inter-arrival time of requests to classify clients.
73Web MeasurementUniversity of Tehran
Web Performance: Client Connectivity (cont’d)
Stability of client classification
Classifying new clients using network-aware clustering same cluster → same class
Classification works best for sites having variety of clients.
74Web MeasurementUniversity of Tehran
Web performance: Client Connectivity (cont’d)
75
Balachander Krishnamurthy, Craig E. Wills, Yin Zhang, and Kashi Vishwanath. Design, Implementation, and Evaluation of a Client Characterization Driven Web Server. In Proceedings of the World Wide Web Conference, May 2003.
Server Action conclusions:
- Compression - consistently good results for poorer but not well-connected clients.
- Reducing the quality of objects only yielded benefits for a modem client.
- Bundling was effective when there was good connectivity or poor connectivity with large latency.
- Persistent connections with serialized requests did not show significant improvement
- Pipelining was only significant for client with high throughput or RTT.
Web MeasurementUniversity of Tehran
Web Performance: Protocol Compliance A 16-month study used the httperf tool to test for HTTP
protocol compliance.
Absence of required headers (such as date)
Nearly half the servers did not implement range requests.
Inability to handle long URIs in a graceful manner.
The popular Apache server was most compliant, then Microsoft’s IIS.
76Web MeasurementUniversity of Tehran
State of the Art
77Web MeasurementUniversity of Tehran
Web Applications: Searching
In 1999, 200 million pages and 1.5 billion links were examined.
The probability of a node having in-degree i is proportional to
1/ix (x>1).
Nodes with a large in-degree are considered “high rank”
Used frequently in search engines
Sites may use fake linkages to trick crawlers.
78Web MeasurementUniversity of Tehran
Web Applications: Searching (cont’d)
A four-part separation in web structure.
A central core
Two parts connected to the core
One part with no connection to the core
All the components have roughly equal number of pages!
79Web MeasurementUniversity of Tehran
Web Applications: Searching (cont’d)
Over 90% of web pages are reachable from each other.
The probability of reaching a random page from another is
only 0.25.
The well-connected component will remain connected even if
we remove nodes with large degrees (hubs).
80Web MeasurementUniversity of Tehran
Web Applications: Searching (cont’d)
Image resources change infrequently.
Many text documents change periodically.
Some studies have tried to model the rate of change of pages
as a Poisson process.
Some studies done to examine the rate of change in different
domains.(e.g. .com vs .org)
81Web MeasurementUniversity of Tehran
Web Applications: Searching (cont’d)
150 web sites were studied over a 7-month period.
Incoming links of the pages were computed
Rich getting richer!
Pages in the bottom 60% ranking received no additional
links.
Need for change in search engines ranking manner.
82Web MeasurementUniversity of Tehran
Web Applications: Searching (cont’d)
A study examined several subset of pages.
Significant fraction of links were dead with impact on
crawling an page ranking.
Over 50% dead links in some cases.
Faster crawling and more useful ranking by avoiding dead
links.
83Web MeasurementUniversity of Tehran
Web Applications: Flash Crowds Large number of legitimate and wanted requests (unlike DoS
attacks in which the requests are not wanted)
During flash crowds Same average number of requests per client No increase in the number of client clusters Between 60% and 82% of the resources are accessed only at this time. Less than 10% are responses for 90% of the requests.
DoS attackers have no way of knowing the typical distribution of client clusters. Many new clusters emerge.
84Web MeasurementUniversity of Tehran
Flash Crowd vs DoS Attack
Flash crowd Increase in number of clients Fixed number of clusters
DoS attack Increase in number of both clients and clusters
University of Tehran Web Measurement 85
Web Applications: Blogs
Providing early warning of flash crowds
Different rate of change comparing to traditional web pages
Having much references, the same as popular web sites
Significant fraction of links going to other blogs
having significantly more self-references
86Web MeasurementUniversity of Tehran