+ All Categories
Home > Documents > Chapter 7 Web Mining Outline

Chapter 7 Web Mining Outline

Date post: 11-Feb-2022
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
34
Part III - Web Mining © Prentice Hall 1 Chapter 7 Web Mining Outline Goal: Examine the use of data mining on the World Wide Web Introduction Web Content Mining Web Structure Mining Web Usage Mining
Transcript

Part III - Web Mining © Prentice Hall 1

Chapter 7 Web Mining Outline

Goal: Examine the use of data mining onthe World Wide WebIntroductionWeb Content MiningWeb Structure MiningWeb Usage Mining

Part III - Web Mining © Prentice Hall 2

Web Mining Issues

Size>350 million pages (1999)Grows at about 1 million pages a dayGoogle indexes 3 billion documents

Diverse types of data

Part III - Web Mining © Prentice Hall 3

Web Data

Web pagesIntra-page structuresInter-page structuresUsage dataSupplemental dataProfilesRegistration informationCookies

Part III - Web Mining © Prentice Hall 4

Web Mining Taxonomy

Modified from [zai01]

Part III - Web Mining © Prentice Hall 5

Web Content Mining

Extends work of basic search enginesSearch EnginesIR applicationKeyword basedSimilarity between query and documentCrawlersIndexingProfilesLink analysis

Part III - Web Mining © Prentice Hall 6

CrawlersRobot (spider) traverses the hypertext

sructure in the Web.Collect information from visited pagesUsed to construct indexes for search enginesTraditional Crawler –visits entire Web (?)

and replaces indexPeriodic Crawler –visits portions of the Web

and updates subset of indexIncremental Crawler –selectively searches

the Web and incrementally modifies indexFocused Crawler –visits pages related to a

particular subject

Part III - Web Mining © Prentice Hall 7

Focused Crawler

Only visit links from a page if that page isdetermined to be relevant.Classifier is static after learning phase.Components:Classifier which assigns relevance score to

each page based on crawl topic.Distiller to identify hub pages.Crawler visits pages to based on crawler and

distiller scores.

Part III - Web Mining © Prentice Hall 8

Focused Crawler

Classifier to related documents to topicsClassifier also determines how useful

outgoing links areHub Pages contain links to many relevant

pages. Must be visited even if not highrelevance score.

Part III - Web Mining © Prentice Hall 9

Focused Crawler

Part III - Web Mining © Prentice Hall 10

Context Focused Crawler

Context Graph: Context graph created for each seed document . Root is the sedd document. Nodes at each level show documents with links to

documents at next higher level. Updated during crawl itself .

Approach:1. Construct context graph and classifiers using seed

documents as training data.2. Perform crawling using classifiers and context graph

created.

Part III - Web Mining © Prentice Hall 11

Context Graph

Part III - Web Mining © Prentice Hall 12

Virtual Web ViewMultiple Layered DataBase (MLDB) built on top

of the Web.Each layer of the database is more generalized

(and smaller) and centralized than the onebeneath it.

Upper layers of MLDB are structured and can beaccessed with SQL type queries.

Translation tools convert Web documents to XML.Extraction tools extract desired information to

place in first layer of MLDB.Higher levels contain more summarized data

obtained through generalizations of the lowerlevels.

Part III - Web Mining © Prentice Hall 13

Personalization

Web access or contents tuned to better fit thedesires of each user.

Manual techniques identify user’s preferencesbased on profiles or demographics.

Collaborative filtering identifies preferencesbased on ratings from similar users.

Content based filtering retrieves pagesbased on similarity between pages and userprofiles.

Part III - Web Mining © Prentice Hall 14

Web Structure Mining

Mine structure (links, graph) of the WebTechniquesPageRankCLEVER

Create a model of the Web organization.May be combined with content mining to

more effectively retrieve important pages.

Part III - Web Mining © Prentice Hall 15

PageRank

Used by GooglePrioritize pages returned from search by

looking at Web structure.Importance of page is calculated based

on number of pages which point to it –Backlinks.Weighting is used to provide more

importance to backlinks coming formimportant pages.

Part III - Web Mining © Prentice Hall 16

PageRank (cont’d)

PR(p) = c (PR(1)/N1 + … + PR(n)/Nn)PR(i): PageRank for a page i which points to

target page p.Ni: number of links coming out of page i

Part III - Web Mining © Prentice Hall 17

CLEVER

Identify authoritative and hub pages.Authoritative Pages :Highly important pages.Best source for requested information.

Hub Pages :Contain links to highly important pages.

Part III - Web Mining © Prentice Hall 18

HITS

Hyperlink-Induces Topic SearchBased on a set of keywords, find set of relevant

pages –R.Identify hub and authority pages for these.Expand R to a base set, B, of pages linked to or from R.Calculate weights for authorities and hubs.

Pages with highest ranks in R are returned.

Part III - Web Mining © Prentice Hall 19

HITS Algorithm

Part III - Web Mining © Prentice Hall 20

Web Usage Mining

Extends work of basic search enginesSearch EnginesIR applicationKeyword basedSimilarity between query and documentCrawlersIndexingProfilesLink analysis

Part III - Web Mining © Prentice Hall 21

Web Usage Mining Applications

PersonalizationImprove structure of a site’s Web pagesAid in caching and prediction of future

page referencesImprove design of individual pagesImprove effectiveness of e-commerce

(sales and advertising)

Part III - Web Mining © Prentice Hall 22

Web Usage Mining Activities

Preprocessing Web logCleanseRemove extraneous informationSessionize

Session: Sequence of pages referenced by one user at asitting.

Pattern DiscoveryCount patterns that occur in sessionsPattern is sequence of pages references in session.Similar to association rulesTransaction: sessionItemset: pattern (or subset)Order is important

Pattern Analysis

Part III - Web Mining © Prentice Hall 23

ARs in Web Mining

Web Mining:ContentStructureUsage

Frequent patterns of sequential pagereferences in Web searching.

Uses:CachingClustering usersDevelop user profilesIdentify important pages

Part III - Web Mining © Prentice Hall 24

Web Usage Mining Issues

Identification of exact user not possible.Exact sequence of pages referenced by a

user not possible due to caching.Session not well definedSecurity, privacy, and legal issues

Part III - Web Mining © Prentice Hall 25

Web Log Cleansing

Replace source IP address with uniquebut non-identifying ID.Replace exact URL of pages referenced

with unique but non-identifying ID.Delete error records and records

containing not page data (such as figuresand code)

Part III - Web Mining © Prentice Hall 26

Sessionizing

Divide Web log into sessions.Two common techniques:Number of consecutive page references from a

source IP address occurring within a predefinedtime interval (e.g. 25 minutes).All consecutive page references from a source

IP address where the interclick time is less thana predefined threshold.

Part III - Web Mining © Prentice Hall 27

Data Structures

Keep track of patterns identified duringWeb usage mining processCommon techniques:TrieSuffix TreeGeneralized Suffix TreeWAP Tree

Part III - Web Mining © Prentice Hall 28

Trie vs. Suffix Tree

Trie:Rooted treeEdges labeled which character (page) from

patternPath from root to leaf represents pattern.

Suffix Tree:Single child collapsed with parent. Edge

contains labels of both prior edges.

Part III - Web Mining © Prentice Hall 29

Trie and Suffix Tree

Part III - Web Mining © Prentice Hall 30

Generalized Suffix Tree

Suffix tree for multiple sessions.Contains patterns from all sessions.Maintains count of frequency of

occurrence of a pattern in the node.WAP Tree:

Compressed version of generalized suffix tree

Part III - Web Mining © Prentice Hall 31

Types of Patterns

Algorithms have been developed to discoverdifferent types of patterns.

Properties:Ordered –Characters (pages) must occur in the exact

order in the original session.Duplicates –Duplicate characters are allowed in the

pattern.Consecutive –All characters in pattern must occur

consecutive in given session.Maximal –Not subsequence of another pattern.

Part III - Web Mining © Prentice Hall 32

Pattern Types

Association RulesNone of the properties hold

EpisodesOnly ordering holds

Sequential PatternsOrdered and maximal

Forward SequencesOrdered, consecutive, and maximal

Maximal Frequent SequencesAll properties hold

Part III - Web Mining © Prentice Hall 33

Episodes

Partially ordered set of pagesSerial episode –totally ordered with time

constraintParallel episode –partial ordered with

time constraintGeneral episode –partial ordered with no

time constraint

Part III - Web Mining © Prentice Hall 34

DAG for Episode


Recommended