+ All Categories
Home > Documents > Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall...

Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall...

Date post: 15-Dec-2015
Category:
Upload: kourtney-brame
View: 212 times
Download: 0 times
Share this document with a friend
Popular Tags:
82
Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009
Transcript
Page 1: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web MeasurementChapter 7, Section 7.3

Hessam Mirsadeghi

ECE Department, University of Tehran

Fall 2009

Page 2: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Outline

Web measurement motivation

Properties of interest

Challenges of web measurement

Web measurement tools

State of the art Web properties Web traffic data gathering and analysis Web performance Web applications

2Web MeasurementUniversity of Tehran

Page 3: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Motivation

The single most popular Internet application. Measurement can be very useful.

The single largest application studied in Internet measurement

75% of the Internet traffic in the first decade of existence

Around a billion web users

3Web MeasurementUniversity of Tehran

Page 4: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

University of Tehran Web Measurement 4

Page 5: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Measurement Properties Web is at the most-visible level for users

Some of the properties are decomposable into components at other layers of protocol stack

Web latency DNS, TCP, HTTP Web server delay Client-side rendering

5Web MeasurementUniversity of Tehran

Page 6: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Measurement Properties (cont’d)

6Web MeasurementUniversity of Tehran

Page 7: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

High-Level Characterization

Measuring fraction of web traffic Measuring the use of HTTP protocol

Considerable traffic over HTTP while the clients and servers are p2p nodes

Here we consider web traffic that involves web clients communicating with a web server

7Web MeasurementUniversity of Tehran

Page 8: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

High-Level Characterization (cont’d)

Knowledge of entities involved in web transactions Clients, proxies, servers

Measuring the count and growth of web entities Providing insight on how the web has evolved and is being used e.g. number of clients behind a proxy provides insights on the extent of

caching

8Web MeasurementUniversity of Tehran

Page 9: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Location

Identifying where clients and proxies are present can help content providers move resources closer to them

Location data can help businesses tailor content , manner of delivery, and consider alternate architectural improvements in placement of services.

Network and physical location

9Web MeasurementUniversity of Tehran

Page 10: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Configuration

Different server configuration impact performance

Clients and proxies configurations Protocol variants supported Compliance with protocol specification Clients connectivity

10Web MeasurementUniversity of Tehran

Page 11: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

User Workload Models

How resources are accessed within a web site reconfiguring the web site modifying the resources Alternatives for delivery of popular resources

Constructing models for “think-time” of users

Help in dealing with the new classes of users

Modeling novel phenomena such as flash crowds and attacks

11Web MeasurementUniversity of Tehran

Page 12: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Traffic Properties

Reduction of redundant transfers and sudden surges Caching the resources

Cacheability of resources, deployment and use of caches, performance of caches

Handling circumstances like flash crowds

12Web MeasurementUniversity of Tehran

Page 13: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Application Demands

Better understanding of the interaction between the application and transport-level protocols Improvements in the protocols Reducing time-to-glass

The actual flow of a web transaction from the user click to displaying data

13Web MeasurementUniversity of Tehran

Page 14: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Performance

Dominating much of the web measurement work

Popularity of a web site is highly dependant on it’s performance

Finding ways to reduce delays

Sources of slowdowns

14Web MeasurementUniversity of Tehran

Page 15: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

University of Tehran Web Measurement 15

Page 16: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Challenges to Measurement

Application-level nature

Dependence on multiple protocols DNS, TCP, HTTP

Large sets of entities with varying configurations

Equally diverse user population

16Web MeasurementUniversity of Tehran

Page 17: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Challenges to Measurement (cont’d)

Hidden data

Hidden layers

Hidden entities

17Web MeasurementUniversity of Tehran

Page 18: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Hidden Data

Much of the traffic is intra-net and inaccessible.

Access to remote server data, even old logs is often

unavailable.

From the server end, information about the clients (e.g.

connection bandwidth) is obscured.

New pages are constantly added, old ones removed or

modified.

18Web MeasurementUniversity of Tehran

Page 19: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Hidden Data (cont’d)

Access information of web pages are not accessible.

TCP configuration parameters significantly impact

performance and can not be remotely ascertained

Tools like TBIT for testing impacts of TCP variants like

Reno, Tahoe, or Vegas

19Web MeasurementUniversity of Tehran

Page 20: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Hidden Layers

Protocol and network layers are harder to measure. Requires both deep knowledge of the network protocol as well as an

understanding of the precise interactions between the different network protocols

Not knowing the number of end-clients due to proxies.

Requests may be redirected at different layers of the protocol to different servers. Redirections can happen at DNS, TCP, or HTTP level.

20Web MeasurementUniversity of Tehran

Page 21: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Hidden Layers (cont’d)

21

Foo1.jpg

Foo2.jpg

Foo3.jpg

ad1 ad2 ad3

Index.html

<text>

ServerClient

ad1

ad2

ad3

Foo1.jpg Foo2

.jpg

Foo3

.jpg

Ad Server1

Ad Server2

Ad Server3

CDN Server 1 CDN Server 2

Index.html

Web MeasurementUniversity of Tehran

Page 22: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Hidden Entities

Proxies, HTTP and TCP redirectors

Transparent interception proxies, return results from a cache.

Different behavior of switches for web-related and non web-

related traffic

Lack of predictability due to multiple hidden entities at

various layers of protocol stack.

22Web MeasurementUniversity of Tehran

Page 23: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

University of Tehran Web Measurement 23

Page 24: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Tools: Estimation of Web Traffic From 21st century peer-to-peer traffic took the lead in terms of

number of bytes

Web still remains the number one application in terms of active users

Almost 1 billion Internet users, a vast majority of whom use the web

24Web MeasurementUniversity of Tehran

Page 25: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Tools: Sampling & DNS

Netflow: traffic to the HTTP port (80)

DNS traces to see what IP addresses are looked up Well-known web servers are likely to be high

25Web MeasurementUniversity of Tehran

Page 26: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Tools: Server Logs

Number of requests and clients are logged in web server logs Web log analyzers for generating statistics Presence of obscured data

Proxies Inter-arrival time of requests Range and diversity of resources requested

Crawlers and Spiders Disproportionate number of requests from one of a few IP addresses

Anonymizers Caches

26Web MeasurementUniversity of Tehran

Page 27: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Tools: Surveys

Estimating the number of web servers (Netcraft)

Important metric: number and identity of popular web servers Business, technical, and social implications

27Web MeasurementUniversity of Tehran

Page 28: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Tools: Locating Entities

An increasingly difficult problem

Servers resources are distributed geographically

Large number of resources

Increase availability

Being closer to clients

Several businesses can use the same server farm to increase

utilization.

Locating clients: simple ‘traceroute’, techniques such as

network aware clustering28Web MeasurementUniversity of Tehran

Page 29: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Tools: Structural View

The linkage structure on web pages

HITS algorithm for identifying hubs and authorities

Hub: a page having multiple high-value links about a topic

Authority: the page having high-quality content on a given topic

Web pages as nodes and links as edges in a graph model

Page rankings and Improvement of web searching

29Web MeasurementUniversity of Tehran

Page 30: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Tools: Web Searching & Crawling One of the most important www applications

Components:

Crawler: traverses the accessible part of the web to fetch web pages

Indexer: indexes the crawled pages

Search tool: accepts queries and returns pointers to the matching pages

30Web MeasurementUniversity of Tehran

Page 31: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Tools: Web Performance (cont’d)

Measuring a particular web site’s latency and availability

from diverse client perspectives.

Examining different latency components such as DNS,

TCP or HTTP differences, and CDNs

Global measurements of the web to examine protocol

compliance and ensure reduction of outages.

31Web MeasurementUniversity of Tehran

Page 32: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Tools: Web Performance (cont’d)

A variety of companies offer such services:

Keynote, Akamai, eValid Test Suit, etc.

A common technique: a distributed set of monitors around the

world sending periodic requests to web sites.

32Web MeasurementUniversity of Tehran

Page 33: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Tools: Network Aware Clustering An effective technique to group IP addresses into clusters

quickly and automatically

Non-overlapping cluster

Being close topologically

Common administrative control

Clustering by use of BGP routing table snapshots and longest

prefix matching.

Same prefix → same cluster

33Web MeasurementUniversity of Tehran

Page 34: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Tools: Network Aware Clustering (cont’d) BGP routing table snapshot

34Web MeasurementUniversity of Tehran

Page 35: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Tools: Network Aware Clustering (cont’d) Application

Used to group client IP addresses in web server

logs

Recognizing proxies and spiders

Better content access prediction

etc

35Web MeasurementUniversity of Tehran

Page 36: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Tools: Network Aware Clustering (cont’d)

36

Total server log

Client containing spider

Cluster containing proxy

Web MeasurementUniversity of Tehran

Page 37: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Tools: Handling Mobile Clients (cont’d)

39

Figure 3. Document Browsing with Summarizer on WAP

Christopher C. Yang and Fu Lee Wang. Fractal Summarization for Mobile Devices to Access Large Documents on the Web. In Proceedings of the World Wide Web Conference, May 2003.

Web MeasurementUniversity of Tehran

Page 38: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Tools: Handling Mobile Clients (cont’d)

Continues growth in mobile web

Wireless network delays

Tailored content

Similar methods:

Server logs of mobile content providers

Lab experiments (e.g emulate mobile devices, induce packet loss)

Wide-area experiments

40Web MeasurementUniversity of Tehran

Page 39: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

University of Tehran Web Measurement 41

Page 40: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

State of the Art

42Web MeasurementUniversity of Tehran

Page 41: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Properties: High Level

Reduction in web traffic estimation Unreachable data

Firewalls and other barriers due to attacks

Use of internal web sites

The shift from Web to P2P

Around a million new sites a month (Netcraft)

43Web MeasurementUniversity of Tehran

Page 42: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Properties: High Level (cont’d)

60 million web sites in fall 2004 A vast fraction have little or no traffic compared to the top few

hundred.

Apache and Microsoft server implementations

together have 90% of the market (68% for Apache)

44Web MeasurementUniversity of Tehran

Page 43: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Properties: High Level (cont’d)

45Netcraft survey. (news.netcraft.com)

Web MeasurementUniversity of Tehran

Page 44: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Properties: High Level (cont’d)

46Netcraft survey. (news.netcraft.com)

Web MeasurementUniversity of Tehran

Page 45: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Properties: Location

Steadily growing number of users are in Asian countries such as China and India.

The fraction of web content from the US and Europe is falling.

Implications on where servers will be mirrored and supported languages.

47Web MeasurementUniversity of Tehran

Page 46: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Properties: Configuration Popular sites use a variety of techniques to improve

server performance:

Distribute servers geographically (e.g. 3 world cup servers in the U.S., 1 in France)

Redirecting requests to the least loaded server in a farm.

Caching frequently requested resources

48Web MeasurementUniversity of Tehran

Page 47: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Properties: User Workload Models We measure user workload by looking at:

the duration of HTTP connections

request and response sizes,

unique number of IP addresses contacting a given Web site

number and frequency of accesses of individual resources at a given Web site

etc.

49Web MeasurementUniversity of Tehran

Page 48: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Properties: Access Dynamics Web page access has been experimentally verified to

follow Zipf-like distribution.

Zipf’s law: Probability of a request to the ith most popular page is

proportional to 1/i

50Web MeasurementUniversity of Tehran

Page 49: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

State of the Art

51Web MeasurementUniversity of Tehran

Page 50: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Traffic: Critical Path Analysis Constructing critical path to understand where delays

are introduced in web requests

Packet propagation

Network variation (e.g. queuing at routers)

Packet loss

Delay at server and client

52Web MeasurementUniversity of Tehran

Page 51: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Traffic: Critical Path Analysis (cont’d)

Only some of the components are responsible for

overall response time

Importance of activities on the critical path

53Web MeasurementUniversity of Tehran

Page 52: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Traffic: Software Aid

httperf: Sends HTTP requests and processes responses Simulates workload Gathers statistics Supports HTTP/1.1 Freely available in source code

54Web MeasurementUniversity of Tehran

Page 53: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Traffic: Software Aid (cont’d)

wget Fetches a large number of pages rooted at a particular node. Can fetch all the pages up to a certain “level” according to

links

Mercator (a personalized crawler) Uses a seed page and then does breadth-first search on the

links to find pages. Higher weight for pages having more incoming links.

55Web MeasurementUniversity of Tehran

Page 54: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

State of the Art

58Web MeasurementUniversity of Tehran

Page 55: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Performance: Intro

User-perceived latency is a key factor because it affects the popularity of a site.

beyond a certain delay, user cancellations of the page increases sharply.

59Web MeasurementUniversity of Tehran

Page 56: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Performance: CDNs

Busy servers outsource delivery of some of their pages

CDNs combine the workload of several sites into a single provider.

Mirroring the CDNs to be located near clients.

DNS-based redirection

DNS overhead is a serious bottleneck in some CDNs

60Web MeasurementUniversity of Tehran

Page 57: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

• Motivation:• More hops between client and Web server => more

congestion!

• Same data flowing repeatedly over links between clients and Web server

61

Web Performance: CDNs (cont’d)

S

C1

C4

C2

C3

- IP router

Web MeasurementUniversity of Tehran

Page 58: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Performance: CDNs (cont’d) Caches

62

Web Serverwww.cnn.com

Usermerlot.cis.udel.edu

1000,000other hosts

1000,000other hosts

New ContentWTC News!

oldcontent request

request

- Caching Proxy

ISP

- Congestion / Bottleneck

Web MeasurementUniversity of Tehran

Page 59: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

63

Web Performance: CDNs (cont’d)• Caching problems:

• Caching proxies serve only their clients, not all users on the Internet

• Content providers (say, Web servers) cannot rely on existence and correct implementation of caching proxies

• Accounting issues with caching proxies. For instance, www.cnn.com needs to know the number of hits to the webpage for advertisements displayed on the webpage

Web MeasurementUniversity of Tehran

Page 60: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Serverwww.cnn.com

Usermerlot.cis.udel.edu

64

Web Performance: CDNs (cont’d)

New ContentWTC News!

requestnew

content

1000,000other users

1000,000other users

- Mirrors

- Distribution Infrastructure

FL

IL

DE

NY

MA

MICA

WA

Web MeasurementUniversity of Tehran

Page 61: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

• Overlay network to distribute content from origin servers to users

• Avoids large amounts of same data repeatedly traversing potentially congested links on the Internet

• Reduces Web server load

• Reduces user perceived latency

65

Web Performance: CDNs (cont’d)

Web MeasurementUniversity of Tehran

Page 62: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

66

DNS-based Request Routing

Akamai DNS

DN

S q

uery

:w

ww

.cnn

.com

DN

S r

espo

nse:

A 1

45.1

55.1

0.15

Sess

ion

local DNS server (louie.udel.edu)128.4.4.12

DNS query:www.cnn.com

DNS response:A 145.155.10.15

www.cnn.com

Surrogate145.155.10.1

5

Surrogate58.15.100.15

2

AkamaiCDN

merlot.cis.udel.edu

128.4.30.15

delaware.cnn.akamai.com

california.cnn.akamai.com

Q:How does the Akamai DNS know which surrogate is closest ?

Web MeasurementUniversity of Tehran

Page 63: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

67

DNS-Based Request Routing (cont’d)

DN

S q

uery

DN

S r

espo

nse

Sess

ion

Akamai DNS

www.cnn.com

Surrogate

Surrogate

AkamaiCDN

merlot.cis.udel.edu

128.4.30.15

local DNS server (louie.udel.edu)

128.4.4.12

DNS query

DNS response

Measure to

Client D

NS

Measure to Client DNS

Measurement results

Measurement re

sults

Mea

sure

men

tsMeasurements

Web MeasurementUniversity of Tehran

Page 64: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

DNS-Based Redirection

Problem: The content server is optimized for the local name server,

not the actual client

Client may be far from name server

In a study, only 16% of the clients were in the same network-aware cluster as the local DNS server

68Web MeasurementUniversity of Tehran

Page 65: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Total & Selective Redirection

1. Total redirection Any request for origin server is redirected to CDN Basically, CDN takes control of content provider’s DNS zone Benefit: All requests are automatically redirected Disadvantage: May send lots of traffic to CDN, hence expensive for the

content provider

2. Selective redirection Content provider marks which objects are to be served from CDN Typically, larger objects like images are selected Refer to images as: <img src=http://cdn.com/foo/bar/img.gif> Pro: Fine-grained control over what gets delivered Con: Have to (manually) mark content for CDN

69Web MeasurementUniversity of Tehran

Page 66: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Surrogate Server

CDN

Origin Server

Client

GET index.html

GET image1.gif, image2.gif

inde

x.ht

ml,

imag

e1.g

if,

imag

e2.g

if

Total Redirection

70

index.html

embedded image1.gifimage2.gif

Web MeasurementUniversity of Tehran

Page 67: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Origin Server

SurrogateServer

CDN

Client

GET index.html

GET image1.gif, image2.gif

imag

e1.g

if,

imag

e2.g

if

Partial Redirection

71

index.html

embedded image1.gifimage2.gif

Web MeasurementUniversity of Tehran

Page 68: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Total vs. Selective Redirection

Total redirection has clearly superior performance

Selective redirection is typically slower than downloading everything from the origin server But origin server might be loaded…

Which redirection is more used? Initially, selective redirection was used These days, mainly total redirection

72Web MeasurementUniversity of Tehran

Page 69: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Performance: Client Connectivity Finding clients’ connection quality

Delivering the most suitable version of content Sending just the base document Using compression

Tailoring server’s policy Keep persistent connections open longer

Measure the inter-arrival time of requests to classify clients.

73Web MeasurementUniversity of Tehran

Page 70: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Performance: Client Connectivity (cont’d)

Stability of client classification

Classifying new clients using network-aware clustering same cluster → same class

Classification works best for sites having variety of clients.

74Web MeasurementUniversity of Tehran

Page 71: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web performance: Client Connectivity (cont’d)

75

Balachander Krishnamurthy, Craig E. Wills, Yin Zhang, and Kashi Vishwanath. Design, Implementation, and Evaluation of a Client Characterization Driven Web Server. In Proceedings of the World Wide Web Conference, May 2003.

Server Action conclusions:

- Compression - consistently good results for poorer but not well-connected clients.

- Reducing the quality of objects only yielded benefits for a modem client.

- Bundling was effective when there was good connectivity or poor connectivity with large latency.

- Persistent connections with serialized requests did not show significant improvement

- Pipelining was only significant for client with high throughput or RTT.

Web MeasurementUniversity of Tehran

Page 72: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Performance: Protocol Compliance A 16-month study used the httperf tool to test for HTTP

protocol compliance.

Absence of required headers (such as date)

Nearly half the servers did not implement range requests.

Inability to handle long URIs in a graceful manner.

The popular Apache server was most compliant, then Microsoft’s IIS.

76Web MeasurementUniversity of Tehran

Page 73: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

State of the Art

77Web MeasurementUniversity of Tehran

Page 74: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Applications: Searching

In 1999, 200 million pages and 1.5 billion links were examined.

The probability of a node having in-degree i is proportional to

1/ix (x>1).

Nodes with a large in-degree are considered “high rank”

Used frequently in search engines

Sites may use fake linkages to trick crawlers.

78Web MeasurementUniversity of Tehran

Page 75: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Applications: Searching (cont’d)

A four-part separation in web structure.

A central core

Two parts connected to the core

One part with no connection to the core

All the components have roughly equal number of pages!

79Web MeasurementUniversity of Tehran

Page 76: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Applications: Searching (cont’d)

Over 90% of web pages are reachable from each other.

The probability of reaching a random page from another is

only 0.25.

The well-connected component will remain connected even if

we remove nodes with large degrees (hubs).

80Web MeasurementUniversity of Tehran

Page 77: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Applications: Searching (cont’d)

Image resources change infrequently.

Many text documents change periodically.

Some studies have tried to model the rate of change of pages

as a Poisson process.

Some studies done to examine the rate of change in different

domains.(e.g. .com vs .org)

81Web MeasurementUniversity of Tehran

Page 78: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Applications: Searching (cont’d)

150 web sites were studied over a 7-month period.

Incoming links of the pages were computed

Rich getting richer!

Pages in the bottom 60% ranking received no additional

links.

Need for change in search engines ranking manner.

82Web MeasurementUniversity of Tehran

Page 79: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Applications: Searching (cont’d)

A study examined several subset of pages.

Significant fraction of links were dead with impact on

crawling an page ranking.

Over 50% dead links in some cases.

Faster crawling and more useful ranking by avoiding dead

links.

83Web MeasurementUniversity of Tehran

Page 80: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Applications: Flash Crowds Large number of legitimate and wanted requests (unlike DoS

attacks in which the requests are not wanted)

During flash crowds Same average number of requests per client No increase in the number of client clusters Between 60% and 82% of the resources are accessed only at this time. Less than 10% are responses for 90% of the requests.

DoS attackers have no way of knowing the typical distribution of client clusters. Many new clusters emerge.

84Web MeasurementUniversity of Tehran

Page 81: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Flash Crowd vs DoS Attack

Flash crowd Increase in number of clients Fixed number of clusters

DoS attack Increase in number of both clients and clusters

University of Tehran Web Measurement 85

Page 82: Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Web Applications: Blogs

Providing early warning of flash crowds

Different rate of change comparing to traditional web pages

Having much references, the same as popular web sites

Significant fraction of links going to other blogs

having significantly more self-references

86Web MeasurementUniversity of Tehran


Recommended