Part III - Web Mining © Prentice Hall 1
Chapter 7 Web Mining Outline
Goal: Examine the use of data mining onthe World Wide WebIntroductionWeb Content MiningWeb Structure MiningWeb Usage Mining
Part III - Web Mining © Prentice Hall 2
Web Mining Issues
Size>350 million pages (1999)Grows at about 1 million pages a dayGoogle indexes 3 billion documents
Diverse types of data
Part III - Web Mining © Prentice Hall 3
Web Data
Web pagesIntra-page structuresInter-page structuresUsage dataSupplemental dataProfilesRegistration informationCookies
Part III - Web Mining © Prentice Hall 5
Web Content Mining
Extends work of basic search enginesSearch EnginesIR applicationKeyword basedSimilarity between query and documentCrawlersIndexingProfilesLink analysis
Part III - Web Mining © Prentice Hall 6
CrawlersRobot (spider) traverses the hypertext
sructure in the Web.Collect information from visited pagesUsed to construct indexes for search enginesTraditional Crawler –visits entire Web (?)
and replaces indexPeriodic Crawler –visits portions of the Web
and updates subset of indexIncremental Crawler –selectively searches
the Web and incrementally modifies indexFocused Crawler –visits pages related to a
particular subject
Part III - Web Mining © Prentice Hall 7
Focused Crawler
Only visit links from a page if that page isdetermined to be relevant.Classifier is static after learning phase.Components:Classifier which assigns relevance score to
each page based on crawl topic.Distiller to identify hub pages.Crawler visits pages to based on crawler and
distiller scores.
Part III - Web Mining © Prentice Hall 8
Focused Crawler
Classifier to related documents to topicsClassifier also determines how useful
outgoing links areHub Pages contain links to many relevant
pages. Must be visited even if not highrelevance score.
Part III - Web Mining © Prentice Hall 10
Context Focused Crawler
Context Graph: Context graph created for each seed document . Root is the sedd document. Nodes at each level show documents with links to
documents at next higher level. Updated during crawl itself .
Approach:1. Construct context graph and classifiers using seed
documents as training data.2. Perform crawling using classifiers and context graph
created.
Part III - Web Mining © Prentice Hall 12
Virtual Web ViewMultiple Layered DataBase (MLDB) built on top
of the Web.Each layer of the database is more generalized
(and smaller) and centralized than the onebeneath it.
Upper layers of MLDB are structured and can beaccessed with SQL type queries.
Translation tools convert Web documents to XML.Extraction tools extract desired information to
place in first layer of MLDB.Higher levels contain more summarized data
obtained through generalizations of the lowerlevels.
Part III - Web Mining © Prentice Hall 13
Personalization
Web access or contents tuned to better fit thedesires of each user.
Manual techniques identify user’s preferencesbased on profiles or demographics.
Collaborative filtering identifies preferencesbased on ratings from similar users.
Content based filtering retrieves pagesbased on similarity between pages and userprofiles.
Part III - Web Mining © Prentice Hall 14
Web Structure Mining
Mine structure (links, graph) of the WebTechniquesPageRankCLEVER
Create a model of the Web organization.May be combined with content mining to
more effectively retrieve important pages.
Part III - Web Mining © Prentice Hall 15
PageRank
Used by GooglePrioritize pages returned from search by
looking at Web structure.Importance of page is calculated based
on number of pages which point to it –Backlinks.Weighting is used to provide more
importance to backlinks coming formimportant pages.
Part III - Web Mining © Prentice Hall 16
PageRank (cont’d)
PR(p) = c (PR(1)/N1 + … + PR(n)/Nn)PR(i): PageRank for a page i which points to
target page p.Ni: number of links coming out of page i
Part III - Web Mining © Prentice Hall 17
CLEVER
Identify authoritative and hub pages.Authoritative Pages :Highly important pages.Best source for requested information.
Hub Pages :Contain links to highly important pages.
Part III - Web Mining © Prentice Hall 18
HITS
Hyperlink-Induces Topic SearchBased on a set of keywords, find set of relevant
pages –R.Identify hub and authority pages for these.Expand R to a base set, B, of pages linked to or from R.Calculate weights for authorities and hubs.
Pages with highest ranks in R are returned.
Part III - Web Mining © Prentice Hall 20
Web Usage Mining
Extends work of basic search enginesSearch EnginesIR applicationKeyword basedSimilarity between query and documentCrawlersIndexingProfilesLink analysis
Part III - Web Mining © Prentice Hall 21
Web Usage Mining Applications
PersonalizationImprove structure of a site’s Web pagesAid in caching and prediction of future
page referencesImprove design of individual pagesImprove effectiveness of e-commerce
(sales and advertising)
Part III - Web Mining © Prentice Hall 22
Web Usage Mining Activities
Preprocessing Web logCleanseRemove extraneous informationSessionize
Session: Sequence of pages referenced by one user at asitting.
Pattern DiscoveryCount patterns that occur in sessionsPattern is sequence of pages references in session.Similar to association rulesTransaction: sessionItemset: pattern (or subset)Order is important
Pattern Analysis
Part III - Web Mining © Prentice Hall 23
ARs in Web Mining
Web Mining:ContentStructureUsage
Frequent patterns of sequential pagereferences in Web searching.
Uses:CachingClustering usersDevelop user profilesIdentify important pages
Part III - Web Mining © Prentice Hall 24
Web Usage Mining Issues
Identification of exact user not possible.Exact sequence of pages referenced by a
user not possible due to caching.Session not well definedSecurity, privacy, and legal issues
Part III - Web Mining © Prentice Hall 25
Web Log Cleansing
Replace source IP address with uniquebut non-identifying ID.Replace exact URL of pages referenced
with unique but non-identifying ID.Delete error records and records
containing not page data (such as figuresand code)
Part III - Web Mining © Prentice Hall 26
Sessionizing
Divide Web log into sessions.Two common techniques:Number of consecutive page references from a
source IP address occurring within a predefinedtime interval (e.g. 25 minutes).All consecutive page references from a source
IP address where the interclick time is less thana predefined threshold.
Part III - Web Mining © Prentice Hall 27
Data Structures
Keep track of patterns identified duringWeb usage mining processCommon techniques:TrieSuffix TreeGeneralized Suffix TreeWAP Tree
Part III - Web Mining © Prentice Hall 28
Trie vs. Suffix Tree
Trie:Rooted treeEdges labeled which character (page) from
patternPath from root to leaf represents pattern.
Suffix Tree:Single child collapsed with parent. Edge
contains labels of both prior edges.
Part III - Web Mining © Prentice Hall 30
Generalized Suffix Tree
Suffix tree for multiple sessions.Contains patterns from all sessions.Maintains count of frequency of
occurrence of a pattern in the node.WAP Tree:
Compressed version of generalized suffix tree
Part III - Web Mining © Prentice Hall 31
Types of Patterns
Algorithms have been developed to discoverdifferent types of patterns.
Properties:Ordered –Characters (pages) must occur in the exact
order in the original session.Duplicates –Duplicate characters are allowed in the
pattern.Consecutive –All characters in pattern must occur
consecutive in given session.Maximal –Not subsequence of another pattern.
Part III - Web Mining © Prentice Hall 32
Pattern Types
Association RulesNone of the properties hold
EpisodesOnly ordering holds
Sequential PatternsOrdered and maximal
Forward SequencesOrdered, consecutive, and maximal
Maximal Frequent SequencesAll properties hold
Part III - Web Mining © Prentice Hall 33
Episodes
Partially ordered set of pagesSerial episode –totally ordered with time
constraintParallel episode –partial ordered with
time constraintGeneral episode –partial ordered with no
time constraint