+ All Categories
Home > Documents > Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining?...

Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining?...

Date post: 13-Dec-2015
Category:
Upload: natalie-thomas
View: 224 times
Download: 3 times
Share this document with a friend
Popular Tags:
63
Web Mining G.Anuradha References from Dunham
Transcript
Page 1: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Web Mining

G.AnuradhaReferences from Dunham

Page 2: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Objective

• What is web mining?• Taxonomy of web mining?• Web content mining• Web structure mining• Web usage mining

Page 3: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

What is web mining?

• Mining of data related to WWW– Data present in Web pages or data related to web

activity• Web data is classified– Content of web pages– Intrapage structure which include code and actual

linkage– Usage data – how used by visitors– User profiles

Page 4: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Taxonomy of Web Mining

Page 5: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Web Content Mining

• Extension of basic search engines• Search engines are keyword-based• Traditional search engines use crawlers – to search the Web– gather information– indexing techniques to store the information– query processing to provide fast and accurate

information to users

Page 6: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Taxonomy of Web content mining

AGENT BASED APPROACH

WEB CONTENT MINING

DATABASE APPROACH

USE SOFTWARE SYSTEMS TO PERFORM THE CONTENT MININGEG. SEARCH ENGINES

VIEWS WEB DATA AS BELONGING TO DATABASEWEB IS A MULTILEVEL DATABASE AND QUERY LANGUAGES ARE USED FOR QUERYING THE DATA

CONTENT MINING IS A TYPE OF TEXT MINING

Page 7: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Text mining hierarchy

Keyword

Term Association

Similarity Search

Classification and Clustering

Natural Language processing

Simple

Complex

Page 8: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Crawlers

Page 9: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

How do crawlers work?

• Robot, spider, crawler is a program that traverses the hypertext structure in the web

• Page that the crawler starts is referred to as seed URL

• All links from that page are recorded and saved in a queue

• The new pages are in turn searched and their links are saved

• The crawlers collect information about each page, extract keywords, store indices for users

Page 10: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Types of crawlers

• Periodic crawlers: activated periodically; every time it is activated it replaces the existing index

• Incremental crawler: updates the index incrementally instead of replacing it

• Focused crawler: visits pages related to topics of interest

Page 11: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Focused crawling

Page 12: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Architecture of focused crawler

• Has 3 components:– Crawler: Performs the actual crawling on the Web.

It visits pages based on priority-based structure associated with pages by classifier and distiller

– Classifier: Associates a relevance score for each document with respect to the crawl topic. Determines the resource rating

– Distiller: Determines which pages contain links to many relevant pages. These are called hub pages.

Page 13: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Harvest Rate

• Harvest rate is the performance objective for focused crawler

• The seed documents are used to begin the focused crawling

• The relevant documents are found using – Hard focus: Follows links if there is an ancestor of that node

which is marked as good – Soft focus: identifies the relevant page with a probability

c- is a page and good(c) is an indication that the page is a relevant page

Page 14: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Context focused crawler

• Crawling takes place in two phases– Training phase: context graphs and classifiers are

constructed using a set of seed documents as training set– Classifiers are used for crawling and context graphs are

updated.• Context crawler overcomes the problems of focused

crawler– Follows links from those pages which point to relevant

pages but they themselves are not relevant– Helps in backward crawling

Page 15: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Context graph

• Rooted graph in which root represents seed document and nodes at each level represent pages that have links to node at higher level

• Context graph created for all seed documents are merged to create a merged context graph

Page 16: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Harvest system

• Based on use of caching, indexing, crawling• Harvest is centered around the use of – Gatherers: obtain information for indexing from

Internet Service Provider– Brokers: provides index and query interface– Brokers may directly or indirectly interface with

gatherers

Page 17: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Virtual Web View

• Large amount of unstructured data can be handled using multiple layered database(MLDB) on top of the web data

• Every layer of this dbase is more generalized then the preceding layer

• The upper layer are structured and can be accessed using SQL

• View of MLDB- Virtual Web View(VWV)

Page 18: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

WebML

• Query language which supports data mining operations on MLDB

• Four primitive operations in WebML are– COVERS– COVERED BY– LIKE– CLOSE TOSELECT * FROM document in “www.engr.smu.edu”\\WHERE ONE OF keywords COVERS “cat”

Page 19: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Personalization

• Contents of a web page are modified to fit the desires of the user

• Advertisements are sent to a potential customer based on his specific knowledge

• Personalization is performed on target web page• Targeting is different from personalization– In targeting businesses display advertisements at other

sites visited by their users– In personalization when a person visits a Web site, the

advertising can be designed specifically for that person

Page 20: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Personalization Contd….

• Personalization is a combination of clustering, classification and prediction

• Types of personalization are– Manual techniques – user registration details– Collaborative filtering– Content-based filtering

• Eg. My Yahoo

Page 21: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Web Structure Mining

• Creating a model of the web organization • Used to classify Web pages or to create

similarity measures between documents

Page 22: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Page Rank

• Designed to increase the effectiveness of search engines and improve their efficiency

• Used to– Measure the importance of a page– Prioritize the pages returned from a traditional

search engine using keyword searching• Page Rank is calculated based on the number

of pages that point to it

Page 23: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Page Rank Contd…

Where c between 0 to 1 used for normalization;Bp=Set of pages that point to pFp=set of links out of pNq=|Fq|

Page 24: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Rank Sink

• When there is a cyclic reference a rank sink problem occurs

• Eliminated using an additional term cE(v) to the page rank formula

• E(v)- is a vector that adds an artificial link.

Page 25: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Hyperlink-induced topic search(HITS)

• Finds hubs and authoritative pages• HITS has two components– Based on a given set of keywords relevant pages

are found– Hubs and authority measures are associated with

these pages. Pages with highest values are returned

Page 26: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Selime Işık-Büşra İpek 26

Authorities and hubs

• The algorithm produces two types of pages: - Authority: pages that provide an important,

trustworthy information on a given topic - Hub: pages that contain links to authorities• Authorities and hubs exhibit a mutually

reinforcing relationship: a better hub points to many good authorities, and a better authority is pointed to by many good hubs

Page 27: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Selime Işık-Büşra İpek 27

Authorities and hubs (2)

a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

1 1

5

6

7

1

2

3

4

5

7

6

Page 28: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Selime Işık-Büşra İpek 28

Definitions

• Authority: pages that provide an important, trustworthy information on a given topic

• Hubs: pages that contain links to authorities• Indegree: number of incoming links to a given node, used

to measure the authoritativeness• Outdegree: number of outgoing links from a given node,

here it is used to measure the hubness

Page 29: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

29

HITS Algorithm

• Hubs point to lots of authorities.• Authorities are pointed to by lots of hubs.• Together they form a bipartite graph: • Hubs Authorities

Page 30: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Selime Işık-Büşra İpek 30

Step By Step HITS-1

• determines a base set S • let set of documents returned by a standard

search engine be called the root set R• Initialize S to R

Page 31: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Selime Işık-Büşra İpek 31

Step By Step HITS - 2

Add to S all pages pointed to by any page in R. Add to S all pages that point to any page in R Maintain for each page p in S: Authority score: ap (vector a) Hub score: hp (vector h)

Page 32: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Selime Işık-Büşra İpek 32

Step By Step HITS - 3

• For each node initiliaze the ap and hp to 1/n

• In each iteration calculate the authority weight for each node in S

Page 33: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Selime Işık-Büşra İpek 33

Step By Step HITS - 4

• In each iteration calculate the hub weight for each node in S

• Note: The hub weights are computed from the current authority weights, which were computed from the previous hub weights.

Page 34: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Selime Işık-Büşra İpek 34

Step By Step HITS - 5

• After new weights are computed for all nodes, the weights are normalized:

Page 35: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Selime Işık-Büşra İpek 37

The Pseudocode of HITS

Page 36: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Selime Işık-Büşra İpek38

HITS Example

• Root Set R {1,2,3,4} • Extend it to form the base set S

Page 37: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Selime Işık-Büşra İpek 39

HITS Example Results

• Authority and Hubness Weight

AuthorityHubness

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Page 38: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Selime Işık-Büşra İpek 40

HITS vs PageRank

• HITS emphasizes mutual reinforcement between authority and hub webpages, while PageRank does not attempt to capture the distinction between hubs and authorities. It ranks pages just by authority.

• HITS is applied to the local neighborhood of pages surrounding the results of a query whereas PageRank is applied to the entire web

• HITS is query dependent but PageRank is query-independent

Page 39: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Selime Işık-Büşra İpek 41

HITS vs PageRank (2)

• Both HITS and PageRank correspond to matrix computations.

• Both can be unstable: changing a few links can lead to quite different rankings.

• PageRank doesn't handle pages with no outedges very well, because they decrease the PageRank overall

Page 40: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Selime Işık-Büşra İpek 42

Conclusion

• HITS is a general algorithm used for calculating the authority and hubs in order to rank the retrieved data

• The basic aim of that algorithm is to induce the Web graph by finding set of pages with a search on a given topic (query).

Page 41: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

INPUTW ///WWW viewed as a directed graphq //Querys //supportOUTPUTA //Set of authority pagesH //Set of hub pagedHITS AlgorithmR=SE(W,q) //SEARCH ENGINE SE IS USED TO FIND A SMALL SET ROOT RB=RU{pages linked to from R}U{pages that link to pages in R};G(B,L)=Subgraph of W induced in B;//B –vertices or pages in G and L is linksG(B,L1)=Delete links in G within same site;Xp=∑yq //authority weightsYp= ∑xp //hub weightsA={p|p has one of the higest xp};H={p|p has one of the highest yp};

Page 42: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Web usage mining

• Mining on web usage data, or web logs• Web log is a listing of page reference data

(clickstream data)• Logs are examined at client or server perspective– Server perspective-mining uncovers information

about the sites where the server resides– Client perspective- information about a user is

detected• Aids in personalization

Page 43: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Web usage mining applications

• Personalization for a user• From frequent access behavior of user, overall

performance can be improved• Caching of frequently accessed pages• Modifications of linkage structure, common

access behavior are accessed.• Gather business intelligence to improve sales

and advertisements

Page 44: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Issues related with web log

• Identification of exact user is not possible from log

• With web client cache, sequence of pages a user visits is difficult to uncover from server site

• Legal, privacy and security issues to be resolved

Page 45: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Preprocessing

• The preprocessing phase includes– cleansing– User identification– Session identification– Path completion– Formatting

Page 46: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

What is log?

• Log ={(u1,p1,t1),….,(un,pn,tn)}

• Ppages; UUsers;

Page 47: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

What is session?

• Ordered list of pages accessed by a user {<p1,t1>,,p2,t2>….<pn,tn>}

• Each session has a unique identifier called as session ID.

• The length of session is number of pages in it denoted by len(S)

• D be a database having all sessions and length of D is total len(S)

Page 48: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Recap of networking

• What is ISP?• Internet Service Provider • What are cookies?• Cookies are used in identifying a single user

regardless of machine used to access the WEB

Page 49: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Trie

• Data structure that is used to keep track of patterns during web usage mining

• Path from root to leaf represents a sequence• Tries are used to store strings fro pattern-

matching applications• Each character in the string is stored on the

edge to the node and common prefixes of strings are shared

Page 50: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Sample tries

A

N

Y

C

A

R

T$

TRIE SUFFIXTRIE

CAR C

ART

ANY

$

Page 51: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Characteristics of suffix trie

• Each internal node except the rot has atleast two children

• Each edge represents a nonempty subsequence

• Subsequences begin with different symbols• Suffix tree build for multiple sessions is called

a generalized suffix tree (GST)

Page 52: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Pattern Discovery

• For clickstream data the common DM technique is uncovering traversal pattern

• Traversal pattern is a set of pages visited by a user in a session

• There are different traversal patterns having the following features– Duplicate page references– Pattern may have contiguous page references or pages

referenced in the same session– A pattern may or may not be maximal– Frequent pattern may or may not be maximal if it has no

subpattern that is also frequent

Page 53: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Association rules

• Can be used to find what pages are accessed together

• In this case a page is regarded as an item and a session is regarded as a transaction with duplicates and ordering ignored

• Support=No: of occurrences of itemset-------------------------------------------------------------

No. of transactions or sessions

Page 54: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Sequential Patterns

• Sequential pattern is an ordered set of pages that satisfies a given support and is maximal

• Support is the percentage of customers who have the pattern

• Users can span many sessions, hence sequential patterns can also span many sessions

Page 55: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

Algorithm to find sequential patterns

INPUTD={S1,S2,…,Sk} //Database of sessionss //Support

Output :Sequential patternsSequential pattern algorithm:

D=sort D on user-id and tie of first page reference in each session;Find L1 in D;L=ApprioriAll(D,s,L1);Find maximal reference sequences from L;

Page 56: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

58

The Apriori Property of Sequential Patterns

• A basic property: Apriori (Agrawal & Sirkant’94)

– If a sequence S is not frequent, then none of the super-sequences of S is frequent

– E.g, <hb> is infrequent so do <hab> and <(ah)b>

<a(bd)bcb(ade)>50<(be)(ce)d>40

<(ah)(bf)abf>30<(bf)(ce)b(fg)>20<(bd)cb(ac)>10

SequenceSeq. ID

Given support threshold min_sup =2

Page 57: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

59

GSP—Generalized Sequential Pattern Mining

• GSP (Generalized Sequential Pattern) mining algorithm • Outline of the method– Initially, every item in DB is a candidate of length-1– for each level (i.e., sequences of length-k) do

• scan database to collect support count for each candidate sequence• generate candidate length-(k+1) sequences from length-k frequent

sequences using Apriori – repeat until no frequent sequence or no candidate can be

found

• Major strength: Candidate pruning by Apriori

Page 58: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

60

Finding Length-1 Sequential Patterns

• Initial candidates: – <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h>

• Scan database once, count support for candidates

<a(bd)bcb(ade)>50<(be)(ce)d>40

<(ah)(bf)abf>30<(bf)(ce)b(fg)>20<(bd)cb(ac)>10

SequenceSeq. ID

min_sup =2

Cand Sup

<a> 3

<b> 5

<c> 4

<d> 3

<e> 3

<f> 2

<g> 1

<h> 1

Page 59: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

61

Generating Length-2 Candidates

<a> <b> <c> <d> <e> <f>

<a> <aa> <ab> <ac> <ad> <ae> <af>

<b> <ba> <bb> <bc> <bd> <be> <bf>

<c> <ca> <cb> <cc> <cd> <ce> <cf>

<d> <da> <db> <dc> <dd> <de> <df>

<e> <ea> <eb> <ec> <ed> <ee> <ef>

<f> <fa> <fb> <fc> <fd> <fe> <ff>

<a> <b> <c> <d> <e> <f>

<a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)>

<b> <(bc)> <(bd)> <(be)> <(bf)>

<c> <(cd)> <(ce)> <(cf)>

<d> <(de)> <(df)>

<e> <(ef)>

<f>

51 length-2Candidates

Without Apriori property,8*8+8*7/2=92 candidates

Apriori prunes 44.57% candidates

Page 60: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

62

Finding Lenth-2 Sequential Patterns

• Scan database one more time, collect support count for each length-2 candidate

• There are 19 length-2 candidates which pass the minimum support threshold– They are length-2 sequential patterns

Page 61: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

63

The GSP Mining Process

<a> <b> <c> <d> <e> <f> <g> <h>

<aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>

<abb> <aab> <aba> <baa> <bab> …

<abba> <(bd)bc> …

<(bd)cba>

1st scan: 8 cand. 6 length-1 seq. pat.

2nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all

3rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all

4th scan: 8 cand. 6 length-4 seq. pat.

5th scan: 1 cand. 1 length-5 seq. pat.

Cand. cannot pass sup. threshold

Cand. not in DB at all

<a(bd)bcb(ade)>50

<(be)(ce)d>40

<(ah)(bf)abf>30

<(bf)(ce)b(fg)>20

<(bd)cb(ac)>10

SequenceSeq. ID

min_sup =2

Page 62: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

64

The GSP Algorithm

• Take sequences in form of <x> as length-1 candidates• Scan database once, find F1, the set of length-1

sequential patterns• Let k=1; while Fk is not empty do– Form Ck+1, the set of length-(k+1) candidates from Fk;

– If Ck+1 is not empty, scan database once, find Fk+1, the set of length-(k+1) sequential patterns

– Let k=k+1;

Page 63: Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.

65

The GSP Algorithm

• Benefits from the Apriori pruning– Reduces search space

• Bottlenecks– Scans the database multiple times

– Generates a huge set of candidate sequences

There is a need for more efficient mining methods


Recommended