Detecting Sequences and Cycles of Web Pages
Narayan L. Bhamidipati and
Sankar K. Pal
Indian Statistical InstituteKolkata
Contents
• Introduction• Objective• Significance• Procedure• Experiments• Future directions
The Web: A Directed Graph
• (V, A)• Vertices Web pages
• V = {v1, v2, …, vN}
• Arcs Hyperlinks• A = {eij : vj vi}
• Path: p1.p2. … .pn with arcs from pi to pi+1
• Cycle: A Path with pn = p1
Sequences of Web Pages
• Paths consisting of adjacent web pages• Order sensitive• A surfer may follow one such sequence
when browsing pages
Cycles of Web Pages• http://www.stanford.edu/• http://www.stanford.edu/home/atoz/letterw.html• http://www.stanford.edu/group/wellspring/• http://www.stanford.edu/group/wellspring/yahoo_spotlight.html• http://www.yahoo.com/• http://dir.yahoo.com/Education/• http://dir.yahoo.com/Education/Higher_Education/• http://dir.yahoo.com/Education/Higher_Education/Colleges_and_Universities/• http://dir.yahoo.com/Education/Higher_Education/
Colleges_and_Universities/United_States/• http://www.stanford.edu
What are we looking for ?
• A particular kind of sequences and cycles• Regular• Consisting of similar units• Units having similar relationship• Reasonably sized
Why are these Sequences and Cycles Interesting ?
• Individual units form a single object• These were intended to be together• They collectively include the complete
information• Despite being part of a collection,
individuality is maintained
Significance of Detecting Such Sequences and Cycles
• Compression• Merge groups of pages• Fewer pages fewer links
• Pre-fetching• Know where the surfer wants to be next• Fetch the page(s) before being requested• Saves time• Errors: pre-fetching wrong pages
Significance of Detecting Such Sequences and Cycles (Contd.)
• Fair comparison• Comparison independent of how content is
presented• Content split into multiple pages should be
treated equivalent to the same in a single page• Better retrieval
• Retrieval independent of the presentation• Output a set of pages instead of a single one as
a match
Fair Comparison
Fair Comparison
Fair Comparison
Improved Retrieval
• Retrieve only portions of interest• Instead of, whole (huge) documents• Avoid rewarding more content
How to Detect Sequences and Cycles of Web Pages ?
• Find navigational links• Find consecutive pages
• Define what the elements of the sequence would satisfy
• Identify subsequences (or units)• Concatenate
• Check for cycles
Finding Navigational Links: Background
• The purpose of a link may be• Navigation• Reference• Advertisement
• Links between pages on the same server are treated as navigational
• Have also been treated as noise
Finding Navigational Links: Our Method
• Avoid treating links on the same server as navigational links
• Appear mostly either at the top or at the bottom
• Navigational links are generally huddled together
• Fewer text and images around such links
Advantages and Limitations
• Simple and fast• Navigational links across servers are also
identified
• Heuristics need not always work – fall back on sophisticated methods
Units of the Sequences
• ABC is a unit if C is “related” to B in the same way as B is “related” to A
• “related” is defined in terms of how they are linked
• Relation is stored as “position” of the link• Several ways of defining “position”
Combining the units into sequences
• DEF• BCD• ABC• CDE
• ABCDEF
Cycle detection
• Existing cycle detection algorithms• Cycle detection in number theory• Special case of cycle detection in graph
theory• Stack based algorithm
Improvements and Speedups
• Believe the “rel” information provided by the (author of the) pages
• Use keywords like “next” and “previous” to perceive the relationships
• Utilize the information of the naming convention
Experimental Results
• Data• Toy data: python tutorial in HTML• Tutorial split into several chapters and sections• Several cycles
• Mutilated data• Certain pages deleted (missing links)
• 100% detection in all cases
Other experiments planned
• Real test: unorganized web pages• Difficulties:
• Finding navigational links• Noise (advertisements, etc)• Dynamically generated
• Will the relationships hold ?
Leads us to …
• Concatenate detected sequences for analysis• Modify retrieval mechanism• Return sets of pages as results• Improve mirror/duplicate detection
Future Work
• Consider other relations• Unifying framework ?• Improve identification of navigational links