+ All Categories
Home > Documents > Ed O’Neill Brian Lavoie OCLC Online Computer Library Center, Inc. Web Measurement, Metrics, and...

Ed O’Neill Brian Lavoie OCLC Online Computer Library Center, Inc. Web Measurement, Metrics, and...

Date post: 26-Dec-2015
Category:
Upload: thomasine-taylor
View: 218 times
Download: 2 times
Share this document with a friend
Popular Tags:
33
Ed O’Neill Ed O’Neill Brian Lavoie Brian Lavoie OCLC Online Computer Library Center, Inc. OCLC Online Computer Library Center, Inc. Web Measurement, Metrics, Web Measurement, Metrics, and Mathematical Models and Mathematical Models Workshop Workshop WWW9 Conference WWW9 Conference
Transcript

Ed O’NeillEd O’NeillBrian LavoieBrian Lavoie

OCLC Online Computer Library Center, Inc.OCLC Online Computer Library Center, Inc.

Web Measurement, Metrics,Web Measurement, Metrics,and Mathematical Models and Mathematical Models

WorkshopWorkshop

WWW9 ConferenceWWW9 Conference

Nonprofit, membership, library computer Nonprofit, membership, library computer service and research organization …service and research organization …

• 9,000 member libraries world-wide9,000 member libraries world-wide

• W3C MemberW3C Member

• Cataloging, reference, resource sharing, Cataloging, reference, resource sharing, and preservation servicesand preservation services

• Maintain and distribute the Dewey Decimal Maintain and distribute the Dewey Decimal ClassificationClassification

RoadmapRoadmap

• Web Characterization ProjectWeb Characterization Project

• Sampling the WebSampling the Web

• Data Collection and StorageData Collection and Storage

• Data AnalysisData Analysis

• Ongoing project: 1997 to presentOngoing project: 1997 to present

• Answer basic questions about the Web:Answer basic questions about the Web:– How big is it?How big is it?– What’s out there?What’s out there?– How is it evolving?How is it evolving?– Focus on content, not network infrastructureFocus on content, not network infrastructure

• Help libraries cope with integrating Web content Help libraries cope with integrating Web content into their collectionsinto their collections

Definitions: Web Definitions: Web ObjectsObjects

• Sampling the Web requires clear and Sampling the Web requires clear and unambiguous definition of unitsunambiguous definition of units

• The organization of Web-accessible The organization of Web-accessible information suggests three object types: information suggests three object types:

Web resource, Web page, Web siteWeb resource, Web page, Web site

• Based on W3C Working Draft:Based on W3C Working Draft:http://www.w3.org/1999/05/WCA-terms/http://www.w3.org/1999/05/WCA-terms/

Web ResourceWeb Resource

• An information object that:An information object that:– is accessible from the Web (via HTTP)is accessible from the Web (via HTTP)– is irreducible (finest level of meaningful is irreducible (finest level of meaningful

granularity)granularity)– has an unambiguous identity (URI)has an unambiguous identity (URI)

• In practice, a Web resource is a file In practice, a Web resource is a file accessible from the Internet via HTTP accessible from the Internet via HTTP

http://www.oclc.org/info.htm

Web PageWeb Page

http://www.oclc.org/images/logo.gif

http://www.oclc.org/applet.class

An aggregate object, consisting of one or An aggregate object, consisting of one or more Web resources that are:more Web resources that are:

• Collectively identified by a single URICollectively identified by a single URI

• Rendered simultaneously as a single objectRendered simultaneously as a single object

Web SiteWeb SiteA collection of Web pages that …A collection of Web pages that …– reside at a single network location (IP reside at a single network location (IP

address)address)

– are interlinked: any of site’s Web pages can are interlinked: any of site’s Web pages can be accessed by:be accessed by:• following a sequence of hyperlink referencesfollowing a sequence of hyperlink references• beginning at the site’s home pagebeginning at the site’s home page• spanning only Web pages residing at the same spanning only Web pages residing at the same

network location. network location.

Sampling the WebSampling the Web

• Objective:Objective: Collect representative Web sampleCollect representative Web sample

• Methodology:Methodology: Identify and collect random Identify and collect random sample of Web sites — every Web site should sample of Web sites — every Web site should have an equal probability of being included in have an equal probability of being included in the samplethe sample

• Result:Result: Random sample of Web sites; cluster Random sample of Web sites; cluster sample of Web pagessample of Web pages

Sampling ApproachSampling Approach

IP Address Space (4,294,967,296)

Allocated addresses

HTTP hosts

Sampled addresses

Data CollectionData Collection

IP #1

IP #2

IP #3

HarvesterHarvester

“Hello … Do you speak HTTP?”

No responseNo response

““Yes … Welcome”Yes … Welcome”HTTP Code = 200HTTP Code = 200

““Yes … Go away”Yes … Go away”HTTP Code = 403HTTP Code = 403

RandomIPs

Polychrest HarvesterPolychrest Harvester

• Java-based Web harvesting agentJava-based Web harvesting agent

• Analyzes URI references in HTML markup to Analyzes URI references in HTML markup to determine object type and extentdetermine object type and extent

• Currently analyzes following elements:Currently analyzes following elements:<A><A> <FRAME><FRAME> <INPUT> <INPUT> <AREA><AREA> <HEAD><HEAD> <LINK> <LINK> <BASE><BASE> <IFRAME><IFRAME> <SCRIPT><SCRIPT><BODY><BODY> <IMG> <IMG>

URI AnalysisURI Analysis

• Two stages:Two stages:

(1) determine object type(1) determine object type

(2) filter on network location (if applicable)(2) filter on network location (if applicable)

• Examples: Examples: Sample IP: 132.174.1.5Sample IP: 132.174.1.5

NO

YES

YES

YES

<A HREF=“http://www.oclc.org/page.htm”>

<A HREF=“http://www.microsoft.com”>

<IMG SRC=“oclc.gif”>

<IMG SRC=“http://www.w3.org/w3.gif”>

HarvestingHarvesting

• Harvesting of a Web site is initiated immediately Harvesting of a Web site is initiated immediately after it is identifiedafter it is identified

• Polychrest understands Web object definitions Polychrest understands Web object definitions for resources, pages, and sitesfor resources, pages, and sites

• Web site extent determined by:Web site extent determined by:– breadth-first search, using home page as rootbreadth-first search, using home page as root– follow internal Web page links onlyfollow internal Web page links only

Unique Web SitesUnique Web Sites

• Not uncommon for a single Web site to be Not uncommon for a single Web site to be accessible from multiple IP addresses accessible from multiple IP addresses

• Sites at different IPs, but with identical Sites at different IPs, but with identical content, are considered to be one logical site content, are considered to be one logical site (often identified with a single domain name)(often identified with a single domain name)

• Creates bias in sample: greater probability of Creates bias in sample: greater probability of these sites being selected than sites these sites being selected than sites associated with a single addressassociated with a single address

Filtering RuleFiltering Rule

A harvested IP is only considered a “hit” if ...A harvested IP is only considered a “hit” if ...… … sample IP is “lowest” among all IPs sample IP is “lowest” among all IPs associated with a given collection of Web associated with a given collection of Web

pages)pages)

Example: Example: 132.174.1..6132.174.1..6 132.174.1.5132.174.1.5 132.174.1.4132.174.1.4

How can we identify sites with multiple IPs?How can we identify sites with multiple IPs?

De-Duping TestsDe-Duping Tests

Domain-name-to-IP-address mapping:Domain-name-to-IP-address mapping:– for sites with domain namesfor sites with domain names– resolve domain name to IP address; if sampled IP is resolve domain name to IP address; if sampled IP is

lowest among returned IP(s), OKlowest among returned IP(s), OK

Example:Example:Sample IP: Sample IP: 207.46.130.149207.46.130.149

Resolves to Resolves to www.microsoft.comwww.microsoft.com

www.microsoft.comwww.microsoft.com resolves to: resolves to:207.46.131.137207.46.131.137 207.46.131.30207.46.131.30

207.46.130.149207.46.130.149 207.46.130.45207.46.130.45207.46.130.14207.46.130.14

De-Duping Tests … De-Duping Tests … ContinuedContinued

““Same-Octet” Test:Same-Octet” Test:– Harvest home page from IP addresses with same first Harvest home page from IP addresses with same first

three octets as sampled IP, but lower 4th octetthree octets as sampled IP, but lower 4th octet

Example:Example: 132.174.1.5132.174.1.5 132.174.1.4132.174.1.4

132.174.1.3132.174.1.3 132.174.1.2132.174.1.2

132.174.1.1132.174.1.1 132.174.1.0132.174.1.0

– If any home page harvested from a lower 4th octet If any home page harvested from a lower 4th octet matches home page from sampled IP, filtering rule is matches home page from sampled IP, filtering rule is failedfailed

De-Duping Tests … De-Duping Tests … ContinuedContinued

• Intra-Sample Duplicate Detection:Intra-Sample Duplicate Detection:– Identify sites within sample with identical contentIdentify sites within sample with identical content– Retain only site with lowest IP addressRetain only site with lowest IP address

• Unique Web Site:Unique Web Site:– Defined as any site identified in the sample that Defined as any site identified in the sample that

passes all three of the duplicate detection testspasses all three of the duplicate detection tests

Synopsis: 1999 SampleSynopsis: 1999 Sample

IP Addresses:IP Addresses: 4,294,967,2964,294,967,296

Sampled IPs (0.1%):Sampled IPs (0.1%): 4,294,9674,294,967

Connect to Port 80 for each sampled IP addressConnect to Port 80 for each sampled IP address– Web site identified if HTTP response code = 200Web site identified if HTTP response code = 200

Sampled Web Sites:Sampled Web Sites: 4,8824,882– hit rate of about 1 out of a thousandhit rate of about 1 out of a thousand

Apply De-Duping TestsApply De-Duping Tests

Sampled Unique Sites: Sampled Unique Sites: 3,6493,649

Network SecurityNetwork Security

• Attempts to connect to random IP addresses have been Attempts to connect to random IP addresses have been viewed suspiciously by network administratorsviewed suspiciously by network administrators– like calling unlisted telephone numberslike calling unlisted telephone numbers

• Inquiries have been made about our activity (mostly Inquiries have been made about our activity (mostly cordial) cordial)

• For June 2000 Web sample:For June 2000 Web sample:– assign separate IP and domain name to machine running harvesterassign separate IP and domain name to machine running harvester– run Web server with page explaining our project and supplying run Web server with page explaining our project and supplying

contact information contact information

Data StorageData Storage

• Polychrest stores data collected from a Polychrest stores data collected from a single Web site in one SGML-format single Web site in one SGML-format archive filearchive file

• Software splits archive file into separate Software splits archive file into separate file for manual viewing; links are localizedfile for manual viewing; links are localized– Harvested Site Example: 192.48.117.67.dmp Harvested Site Example: 192.48.117.67.dmp

• For long-term storage, converting SGML For long-term storage, converting SGML into Internet Archive formatinto Internet Archive format

0 500 1000 1500 2000 2500 3000 3500 4000

Year

Site Growth (1,000)

1997:1.2 million

1998: 2 million

19991999: 3.6 million

Web Site TypesWeb Site Types

• Provisional site:Provisional site: serves only temporary or serves only temporary or transitional pages (server templates, “under transitional pages (server templates, “under construction” pages, “site has moved” pages)construction” pages, “site has moved” pages)

• Private site:Private site: prohibits access explicitly prohibits access explicitly (password, IP filter, firewall) or implicitly (site (password, IP filter, firewall) or implicitly (site intended to be used by specific users)intended to be used by specific users)

• Public site:Public site: provides unrestricted access to provides unrestricted access to some portion of the site containing some portion of the site containing meaningful contentmeaningful content

0 500 1000 1500 2000 2500

Types of Sites (1,000): 1999

Provisional: 1 million

Private:400,000

Public: 2.2 million

AccomplishmentsAccomplishments

• Well-tested sampling methodologyWell-tested sampling methodology

• Data collection and analysis toolsData collection and analysis tools

• Innovative data analysisInnovative data analysis

• Only consistent time-series (1998 - Only consistent time-series (1998 - present)present)

• Data available on request for scholarly useData available on request for scholarly use

Further Information...Further Information...

OCLC Online Computer Library Center, Inc.:http://www.oclc.org/

Web Characterization Project:http://www.oclc.org/oclc/research/projects/webstats/

E-mail:[email protected]

Web Publishing Web Publishing PatternsPatterns

• Self-publishing:Self-publishing: Web publishing patterns do not follow Web publishing patterns do not follow print model. Vast majority of Web sites exist to promote print model. Vast majority of Web sites exist to promote and disseminate information about site’s publisher. and disseminate information about site’s publisher. Unlike traditional print publishers, only a minority of Web Unlike traditional print publishers, only a minority of Web publishers ‘sell’ informationpublishers ‘sell’ information

• Volatility:Volatility: The Web is The Web is very volatile—less than half of the Web sites in the 1998 sample still existed when the 1999 sample was collected. Pages are even more volatile

• Inaccessible:Inaccessible: Less than half the Web sites have been Less than half the Web sites have been indexed by the major search engines, even a lower indexed by the major search engines, even a lower proportion of the pages have been indexedproportion of the pages have been indexed

Emergence of Dark Emergence of Dark MatterMatter

• Dynamically generated information, Dynamically generated information, usually in response to a queryusually in response to a query

• Inaccessible to harvestersInaccessible to harvesters

• Cannot be indexed Cannot be indexed

• Dark information appears to be more Dark information appears to be more common in the latest samplecommon in the latest sample

!

Dark Matter ExampleDark Matter Example

194.66.97.202, 194.66.99.88, 194.66.102.59, 194.66.110.112, 194.66.122.251, 194.66.123.63

These six IP addresses from the sample produced:

Site with Multiple IP Site with Multiple IP AddressesAddresses

Example Responses Example Responses (Edited)(Edited)

For the past two weeks or so, a host registered to you, has been sending network-scanning-like activity to port 80 of seemingly random IP addresses in our address space. I’m not sure the purpose of this activity but it appears to be in error. It appears innocuous enough; figured it would’ve stopped on its own by now.

We have noticed that an oclc server has been regularly checking a machine in our domain. Can you tell us why this server is interested in our little purple SGI?

[Our] server has no restrictions on access, but as far as I know, there are no links to it on any other web sites or search engines, and we have told no one but our development partners about it. Therefore, I was surprised when I found [oclc] in the server's log files many times over the last several months. So...can you identify this user, how they found out about our server, and what their intentions are? If it is a user, we'd appreciate knowing who it is.


Recommended