COM1721: Freshman Honors SeminarA Random Walk Through Computing
Lecture 2: Structure of the WebOctober 1, 2002
Why is Web Structure Interesting?
Design of search engines: Improved crawl strategies Make use of link information to give better
ranking, e.g., Google Generate good representative structures for
simulations Relationship to other Internet structures
Traffic patterns User access patterns
Why is Web Structure Interesting?
Understanding the sociology of content creation on the web: Six degrees of separation and the
small-world phenomenon [Milgram 67]
Is every web page just six clicks away from every other web page?
Simply because it is out there!
Background for the Study Conducted by researchers at AltaVista,
Compaq, and IBM Analyzed the connectivity of more than
200M web pages and 1.5B links AltaVista web crawl, May 1999
Start from a large number of sources Follow links in a breadth-first search manner
and add pages to the database Structure determined by set of all web pages
crawled together with their in-links and out-links
Bowtie Components
SCC (Core) Largest strongly connected component Every page in core can reach every other
page in core 56 million
IN (Origination) All pages outside the core that can reach
the core 44 million
Bowtie Components
OUT (Termination) All pages that are reachable from SCC 44 million
Other pages: Neither reachable from SCC nor can reach the SCC Reachable from IN or can reach OUT
(Tendrils) Completely disconnected from the rest
(Disconnected) Total of 60 million
Example Pages: SCC
CCS! http://www.ccs.neu.edu Links to many communities and other
authoritative sites outside CCS Authoritative sites such as
http://www.ccs.neu.edu/home/rraj/Courses/172x/F02/ http://www.northeastern.edu http://www.boston.com http://www.yahoo.com
Example Pages: IN
Individual home pages on web hosting services: Do not have links from authoritative
sources and core pages Have connections to core pages
through series of links New or obscure web pages that
have not attracted attention
Example Pages: OUT
Commercial sites Pages point to pages within the site Rarely point to pages outside the site http://www.ibm.com
Can be reached from a core site, but does not have links back to core http://www.ccs.neu.edu/home/rraj/papers.ht
ml
Example Pages: Tendrils
Pages not in OUT or CORE with paths to OUT
Pages not in IN or CORE with paths from IN
A private web page in IN points to a page with links to corporate sites
Example Pages: Disconnected Pages
Temporary set of pages for working on a project
http://www.ccs.neu.edu/home/chenj/rsch/discussions.htm
Pages that were linked to the core, OUT, or IN earlier, with the links now removed
How was the Study Done?
Crawlers searched from over many initial locations: Covered over 200 M webpages With 1.5 billion links among these
pages 9.6 GB storage after compression
Webpage characterized by URL and links to other URLs only Page content not relevant to studyA view that extracts essential information
relevant to the purpose and ignores inessential details
Abstraction!
Finding the Structure
Got a list of 200 M web pages and 1.5 billion links
How do we find out: The distance between two pages? Which pages can be reached from a
given page? Which is the most popular webpage?
Represent the web as a graph!
CCS Web as a Graph
http://www.ccs.neu.edu
ChaptersDirectory
US
CCS
Contact Us
IS
People
Help
ResearchNU
Orgns.
Alumni
NU ACM
Directed Graphs
A directed graph is a pair G = (V,E) V: Set of vertices (nodes) E: Set of directed edges (links), each
going from one vertex to another
NU ACMDirectory
US
ChaptersV = {NUACM, Chapters, Directory, US}E = {(NUACM,Chapters), (Chapters, Directory), (Directory, US), (US, NUACM)}
Graph Terms
In-degree: Number of edges into a node
Out-degree: Number of edges out of a node
Suppose a directed graph has n nodes and m edges: Average in-degree? Average out-degree?
More Graph Terms
Strongly connected graph: There is a path between every two nodes
Distance from node u to v: Number of links on the shortest path
from u to v Diameter:
Maximum distance between any two nodes
Finite for strongly connected graphs only
Undirected Graphs
Edges are undirected (u,v) equivalent to (v,u)
Degree of a node: Number of edges adjacent to it
Connected: If there is a path between any two
nodes
4
1
2
3
Graphs: Useful Representation Tools
Social networks Transportation networks Control flow of a program Flowchart of a manufacturing process Computer networks Bibliography citations …
Structural Properties of the Web
Diameter of the SCC is at least 28 Pick a random source page u and a
random destination page v: How many links is v away from u? 75% of the time, there is no path! The other 25% of the time, average distance is
16 Interesting distribution of degrees and
sizes of connected components: power laws