+ All Categories
Home > Documents > 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt...

1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt...

Date post: 16-Dec-2015
Category:
Upload: eleanore-cobb
View: 215 times
Download: 0 times
Share this document with a friend
Popular Tags:
62
1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science http://vasarely.wiwi.hu-berlin.de/WebMining09/ ast update: 21 May 2009
Transcript
Page 1: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

1

1

1

Web Mining –

An introduction to server-log-based Web usage mining

Bettina Berendt

Universidad Politécnica de Madrid, Department of Computer Science

http://vasarely.wiwi.hu-berlin.de/WebMining09/

Last update: 21 May 2009

Page 2: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

2

2

2( Recall the 2002 slide: )

Web Usage Mining: Basics and data sources

Definition of Web usage mining:

discovery of meaningful patterns from data generated by client-server transactions on one or more Web servers

Typical Sources of Data

automatically generated data stored in server access logs, referrer logs, agent logs, and client-side cookies

e-commerce and product-oriented user events (e.g., shopping cart changes, ad or product click-throughs, purchases)

user profiles and/or user ratings

meta-data, page attributes, page content, site structure

Page 3: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

3

3

3

Agenda

Data Acquisition, Understanding, and Preparation

Forms of analysis; mining techniques

Case study: A multi-channel retailer method: Association-rule discovery

Free tools for logfile analysis

Page 4: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

4

4

4

Web Usage Mining

Discovery of meaningful patterns from data generated by client-server transactions on one or more Web servers

Typical Sources of Data

automatically generated data stored in server access logs, referrer logs, agent logs, and client-side cookies

e-commerce and product-oriented user events (e.g., shopping cart changes, ad or product click-throughs, etc.)

user profiles and/or user ratings

meta-data, page attributes, page content, site structure

Page 5: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

5

5

5Data collection

Web server

Proxy

Client (Browser)

Page 6: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

What’s in a typical Web server log …

<ip_addr> - - <date><method><file><protocol><code><bytes><referrer><user_agent> <ip_addr> - - <date><method><file><protocol><code><bytes><referrer><user_agent>

203.30.5.145 - - [01/Jun/1999:03:09:21 -0600] "GET /Calls/OWOM.html HTTP/1.0" 200 3942 "http://www.lycos.com/cgi-bin/pursuit?query=advertising+psychology-&maxhits=20&cat=dir" "Mozilla/4.5 [en] (Win98; I)"

203.30.5.145 - - [01/Jun/1999:03:09:23 -0600] "GET /Calls/Images/earthani.gif HTTP/1.0" 200 10689 "http://www.acr-news.org/Calls/OWOM.html" "Mozilla/4.5 [en] (Win98; I)"

203.30.5.145 - - [01/Jun/1999:03:09:24 -0600] "GET /Calls/Images/line.gif HTTP/1.0" 200 190 "http://www.acr-news.org/Calls/OWOM.html" "Mozilla/4.5 [en] (Win98; I)"

203.252.234.33 - - [01/Jun/1999:03:12:31 -0600] "GET / HTTP/1.0" 200 4980 "" "Mozilla/4.06 [en] (Win95; I)"

203.252.234.33 - - [01/Jun/1999:03:12:35 -0600] "GET /Images/line.gif HTTP/1.0" 200 190 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)"

203.252.234.33 - - [01/Jun/1999:03:12:35 -0600] "GET /Images/red.gif HTTP/1.0" 200 104 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)"

203.252.234.33 - - [01/Jun/1999:03:12:35 -0600] "GET /Images/earthani.gif HTTP/1.0" 200 10689 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)"

203.252.234.33 - - [01/Jun/1999:03:13:11 -0600] "GET /CP.html HTTP/1.0" 200 3218 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)“

203.30.5.145 - - [01/Jun/1999:03:13:25 -0600] "GET /Calls/AWAC.html HTTP/1.0" 200 104 "http://www.acr-news.org/Calls/OWOM.html" "Mozilla/4.5 [en] (Win98; I)"

(Requests to www.acr-news.org)

Page 7: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

… and what does it mean?

<ip_addr> - - <date><method><file><protocol><code><bytes><referrer><user_agent> <ip_addr> - - <date><method><file><protocol><code><bytes><referrer><user_agent>

203.30.5.145 - - [01/Jun/1999:03:09:21 -0600] "GET /Calls/OWOM.html HTTP/1.0" 200 3942 "http://www.lycos.com/cgi-bin/pursuit?query=advertising+psychology-&maxhits=20&cat=dir" "Mozilla/4.5 [en] (Win98; I)"

203.30.5.145 - - [01/Jun/1999:03:09:23 -0600] "GET /Calls/Images/earthani.gif HTTP/1.0" 200 10689 "http://www.acr-news.org/Calls/OWOM.html" "Mozilla/4.5 [en] (Win98; I)"

203.30.5.145 - - [01/Jun/1999:03:09:24 -0600] "GET /Calls/Images/line.gif HTTP/1.0" 200 190 "http://www.acr-news.org/Calls/OWOM.html" "Mozilla/4.5 [en] (Win98; I)"

203.252.234.33 - - [01/Jun/1999:03:12:31 -0600] "GET / HTTP/1.0" 200 4980 "" "Mozilla/4.06 [en] (Win95; I)"

203.252.234.33 - - [01/Jun/1999:03:12:35 -0600] "GET /Images/line.gif HTTP/1.0" 200 190 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)"

203.252.234.33 - - [01/Jun/1999:03:12:35 -0600] "GET /Images/red.gif HTTP/1.0" 200 104 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)"

203.252.234.33 - - [01/Jun/1999:03:12:35 -0600] "GET /Images/earthani.gif HTTP/1.0" 200 10689 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)"

203.252.234.33 - - [01/Jun/1999:03:13:11 -0600] "GET /CP.html HTTP/1.0" 200 3218 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)“

203.30.5.145 - - [01/Jun/1999:03:13:25 -0600] "GET /Calls/AWAC.html HTTP/1.0" 200 104 "http://www.acr-news.org/Calls/OWOM.html" "Mozilla/4.5 [en] (Win98; I)"

(Requests to www.acr-news.org)

Page 8: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

8

8

8

Sources and destinations

Logs may extend beyond visits to the site and show where a visitor was before (referrer) ...

203.30.5.145 - - [01/Jun/1999:03:09:21 -0600] "GET /Calls/OWOM.html HTTP/1.0" 200 3942 "http://www.lycos.com/cgi-bin/pursuit?query=advertising+psychology-&maxhits=20&cat=dir" "Mozilla/4.5 [en] (Win98; I)"

... and where s/he went next (URL rewriting):

Page 9: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

9

9

9

Raw UsageData

DataCleaning

EpisodeIdentification

User/SessionIdentification

Page ViewIdentification

PathCompletion Server Session File

Episode File

Site Structureand Content

Usage Statistics

Preprocessing of Web Usage Data

Page 10: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

10

10

10

Raw UsageData

DataCleaning

EpisodeIdentification

User/SessionIdentification

Page ViewIdentification

PathCompletion Server Session File

Episode File

Site Structureand Content

Usage Statistics

Preprocessing of Web Usage Data

not always necessary and/or done

Page 11: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

11

11

11

Data Preprocessing (1)

Data cleaning

remove irrelevant references and fields in server logs

remove references due to spider navigation

remove erroneous references

add missing references due to caching (done after sessionization)

Data integration

synchronize data from multiple server logs

Integrate semantics, e.g., meta-data (e.g., content labels)

e-commerce and application server data

integrate demographic / registration data

Page 12: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

12

12

12

Data Preprocessing (2)

Data Transformation

user identification

sessionization / episode identification

pageview identification

a pageview is a set of page files and associated objects that contribute to a single display in a Web Browser

Data Reduction

sampling and dimensionality reduction (ignoring certain pageviews / items)

Identifying User Transactions (i.e., sets or sequences of pageviews possibly with associated weights)

Page 13: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

13

13

13

Why sessionize?

Quality of the patterns discovered in KDD depends on the quality of the data on which mining is applied.

In Web usage analysis, these data are the sessions of the site visitors: the activities performed by a user from the moment she enters the site until the moment she leaves it.

Difficult to obtain reliable usage data due to proxy servers and anonymizers, dynamic IP addresses, missing references due to caching, and the inability of servers to distinguish among different visits.

Cookies and embedded session IDs produce the most faithful approximation of users and their visits, but are not used in every site, and not accepted by every user.

Therefore, heuristics are needed that can sessionize the available access data.

Page 14: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

14

14

14

Mechanisms for User Identification

Examples: page tags (use javascript), some browser plugins

Page 15: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

15

15

15Examples of “software agents“ – or: Alternatives to Webserver-log based data collection

Page tagging with Javascript: see also http://www.bruceclay.com/analytics/disadvantages.htm

Page 16: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

16

16

16

Sessionization strategies:Sessionization heuristics

These heuristics are quite accurate! (see Spiliopoulou et al., 2003)

Page 17: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

17

17

17

Path Completion

Refers to the problem of inferring missing user references due to caching.

Effective path completion requires extensive knowledge of the link structure within the site

Referrer information in server logs can also be used in disambiguating the inferred paths.

Problem gets much more complicated in frame-based sites.

Page 18: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

18

18

18

Why integrate semantics?

Basic idea: associate each requested page with one or more domain concepts, to better understand the process of navigation / Web usage

Example: a shopping site

p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:03:51 +0100] "GET /search.html?l=ostsee%20strand&syn=023785&ord=asc HTTP/1.0" 200 1759 p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:05:06 +0100] "GET /search.html?l=ostsee%20strand&p=low&syn=023785&ord=desc HTTP/1.0" 200 8450p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:06:41 +0100] "GET /mlesen.html?Item=3456&syn=023785 HTTP/1.0" 200 3478

Search by category Search by Category+title

Refine search Choose item

Look at indiv-idual product

From ...

To ...

Page 19: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

19

19

19From URLs to topics / concepts: Basics of semantic session modelling

1 request 1 concept or n concepts

Concepts can concern content or service

Concepts can be part of an ontology (simple case: concept hierarchy)

Session = set / sequence / tree / graph of requests

also possible: n requests 1 concept

Page 20: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

20

20

20

Ontology-based behaviour modelling – basic ideas (1)

The request for a Web page signals interest in the concept(s) and relations dealt with in this page – interest in the obtained content as well as in the requested service.

Formally: a request as a (multi)set, or as a vector, of concepts/relations.

Page 21: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

21

21

21

Resulting format: if the request is the instance

Usually flat file (format like Web server log) or database

Page 22: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

22

22

22

Resulting format: If a session is the instance

What features can a session have?

Refer again to the example:

p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:03:51 +0100] "GET /search.html?l=ostsee%20strand&syn=023785&ord=asc HTTP/1.0" 200 1759 p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:05:06 +0100] "GET /search.html?l=ostsee%20strand&p=low&syn=023785&ord=desc HTTP/1.0" 200 8450p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:06:41 +0100] "GET /mlesen.html?Item=3456&syn=023785 HTTP/1.0" 200 3478

Search by category Search by Category+title

Refine search Choose item

Look at indiv-idual product

Page 23: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

customers

ordersproducts

OperationalDatabase

ContentAnalysisModule

Web/ApplicationServer Logs

Data Cleaning /Sessionization

Module

Site Map

SiteDictionary

IntegratedSessionized

Data

DataIntegration

Module

E-CommerceData Mart

Data MiningEngine

OLAPTools

Session Analysis /Static Aggregation

PatternAnalysis

OLAPAnalysis

SiteContent

Data Cube

Basic Framework for E-Commerce Data Analysis

Web Usage and E-Business Analytics

Page 24: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

24

24

24

Agenda

Data Acquisition, Understanding, and Preparation

Forms of analysis; mining techniques

Case study: A multi-channel retailer method: Association-rule discovery

Free tools for logfile analysis

Page 25: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

25

25

25

Web Usage and E-Business Analytics

Session Analysis

Static Aggregation and Statistics

OLAP

Data Mining

Different Levels of AnalysisDifferent Levels of Analysis

Page 26: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

26

26

26

Session Analysis

Simplest form of analysis: examine individual or groups of server sessions and e-commerce data.

Advantages:

Gain insight into typical customer behaviors.

Trace specific problems with the site.

Drawbacks:

LOTS of data.

Difficult to generalize.

Page 27: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

27

27

27

Static Aggregation (Reports)

Most common form of analysis.

Data aggregated by predetermined units such as days or sessions.

Generally gives most “bang for the buck.”

Advantages:

Gives quick overview of how a site is being used.

Minimal disk space or processing power required.

Drawbacks:

No ability to “dig deeper” into the data.

Page Number of Average View Count View Sessions per Session

Home Page 50,000 1.5Catalog Ordering 500 1.1Shopping Cart 9000 2.3

Page 28: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

28

28

28

Online Analytical Processing (OLAP)

Allows changes to aggregation level for multiple dimensions.

Generally associated with a Data Warehouse.

Advantages & Drawbacks

Very flexible

Requires significantly more resources than static reporting.

Page Number of Average View Count View Sessions per Session

Kid's Stuff Products 2,000 5.9

Page Number of Average View Count View Sessions per Session

Kid's Stuff Products Electronics Educational 63 2.3 Radio-Controlled 93 2.5

Page 29: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

29

29

29

Data Mining: Going deeper

Sequence mining

Sequence mining

Markov chainsMarkov chains

Association rules

Association rules

ClusteringClustering

Session ClusteringSession

Clustering

ClassificationClassification

Prediction of next eventPrediction of next event

Discovery of associated events or application objectsDiscovery of associated events or application objects

Discovery of visitor groups with common properties and interests

Discovery of visitor groups with common properties and interests

Discovery of visitor groups with common behaviourDiscovery of visitor groups with common behaviour

Characterization of visitors with respect to a set of predefined classes

Characterization of visitors with respect to a set of predefined classes

Card fraud detectionCard fraud detection

Page 30: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

30

30

30

KDD Techniques for Web Applications: Examples (1)

Calibration of a Web server:

Prediction of the next page invocation over a group of concurrent Web users under certain constraints

Sequence mining, Markov chains

Cross-selling of products:

Mapping of Web pages/objects to products

Discovery of associated products

Association rules, Sequence Mining

Placement of associated products on the same page

Page 31: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

31

31

31

KDD Techniques for Web Applications: Examples (2)

Sophisticated cross-selling and up-selling of products:

Mapping of pages/objects to products of different price groups

Identification of Customer Groups

Clustering, Classification

Discovery of associated products of the same/different price categories

Association rules, Sequence Mining

Formulation of recommendations to the end-user

Suggestions on associated products

Suggestions based on the preferences of similar users

Page 32: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

32

32

32

Agenda

Data Acquisition, Understanding, and Preparation

Forms of analysis; mining techniques

Case study: A multi-channel retailer method: Association-rule discovery

Free tools for logfile analysis

Page 33: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

33

33

33

CRM questions example:Why go to a shop ...

... if everything is available on the Internet?

Page 34: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

34

34

34

A multi-channel retailer, its business goals, and analysis questions

General goals: “Standard e-tailer goals“ – attract users/shoppers and convert them into customers

Specific goals: assess the success of the Web site – in relation to other distribution channels

Questions of the evaluation:

• What business metrics can be calculated from Web usage data, transaction and demographic data for determining online success?

• Are there cross-channel effects between a company‘s e-shop and its physical stores?

52 5467 69

48 4633 31

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1999 2000 2001 2002 (proj.)

Pure Internetcompanies

Multi-channelbusinesses

Background: Internet market shares [BCG 2002]

Page 35: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

35

35

35

The site

Page 36: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

36

36

36

Outline of the KDD process

Data preparation: Session IDs; usual data cleaning steps Linking of sessions & transaction information (anonymized)

Modelling / pattern discovery:

Web metrics, cluster analysis, association rules, sequence mining + correlation analysis, questionnaire study, qualitative market analysis

Evaluation: Interesting patterns

Business underst.: customer buying process

Data:

Web server sessions, transaction info.

Data understanding – main step:

modelling the semantics of the site in terms of a hierarchy of service concepts

Page 37: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

37

37

37

Agenda – Case Study

Business Understanding

Data understanding and preparation

Pattern discovery + evaluation: Success metrics

Pattern disc. + eval.: Behavioural patterns

Pattern disc. + eval.: User types

Pattern disc. + eval.: Behaviour & demographics

Page 38: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

38

38

38

Agenda – Case Study

Business Understanding

Data understanding and preparation

Pattern discovery + evaluation: Success metrics

Pattern disc. + eval.: Behavioural patterns

Pattern disc. + eval.: User types

Pattern disc. + eval.: Behaviour & demographics

Page 39: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

39

39

39

Description of the site and its services

The retailer operates an e-shop and more than 5000 retail shops in over 10 European countries

It sells a wide range of consumer electronics

Online customers can pay, pick-up/deliver and return both online and offline

Web pages provide for all tasks in the customer buying process

Page 40: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

40

40

40

Purchase Phases (Page Concepts) at Large MC Retailers

1. Acquisition (home): All Web pages that are semantically related to the initial acquisition of a visitor

Home (Acquisition)

Page 41: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

41

41

41

Purchase Phases (Page Concepts) at Large MC Retailers

Home (Acquisition)

2. Catalogue information: pages providing an overview of product categories.

Product Impression

Page 42: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

42

42

42

Purchase Phases (Page Concepts) at Large MC Retailers

Product Click-

ThroughHome

(Acquisition)

3. Information product (infprod): pages displaying information about a specific product

Product Impression

Page 43: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

43

43

43

Purchase Phases (Page Concepts) at Large MC Retailers

OfflineinfoHome (Acquisition)

4. offline information (offinfo): All pages related to any offline information: store locator (pages for finding physical stores in one’s neighbourhood), information about offline services, offline referrers etc.

Product Click-

Through Product

Impression

Page 44: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

44

44

44

Purchase Phases (Page Concepts) at Large MC Retailers

TransactionOfflineinfoHome (Acquisition)

5. transaction (transact): steps before an actual purchase, starting with a customer entering the order process: check-out, input of customer data, payment and delivery preferences (online or offline), etc.

Product Click-

Through Product

Impression

Page 45: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

45

45

45

Purchase Phases (Page Concepts) at Large MC Retailers

Transaction PurchaseOfflineinfoHome (Acquisition)

6. purchase: indicates if a visitor completed the transaction process and bought a product, e.g. invocation of an order confirmation page.

Product Click-

Through Product

Impression

Page 46: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

46

46

46

Agenda – Case Study

Business Understanding

Data understanding and preparation

Pattern disc. + eval.: Behavioural patterns

Page 47: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

47

47

47

Data and data preparation

Data sources and sample:

92,467 sessions from the company’s Web logs from 21 days in 2002

anonymized transaction information of 13,653 customers who bought online over a period of 8 months in 2001/02.

621 transaction records (21 days) were linked to Web-usage records

Data preparation:

Sessions were determined by session IDs

Robot visits eliminated, usual data cleaning steps

Each URL request mapped to a service concept from {c1,...,cn}

Session representation: s = [w1, ...wn], with wi = weight of ci, indicating whether or not the concept was visited (1/0), or how often it was visited

Customer record: feature vector incl. session and transaction data

Page 48: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

48

48

48

Site semantics: A service concept hierarchy

Any

Information

Transaction

Services

Information Product

Fulfillment/ Service

Customer Data

Shopping Cart Payment

Company Infos

Registration

Other

Acquisition

Offline Referrer

Advertiser Other

Store Locator

Information Catalog

Home

Game Offline Service

and Support

= Multi-Channel Concept

760,535 page requests were mapped onto the concepts from this hierarchy:

Page 49: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

49

49

49

Types of patterns

Conversion rates (~ confidence of content-specified sequential association rules) for assessing business success

Association rule and sequence analysis for understanding online/offline preferences and their temporal development

Cluster analysis for customer segmentation

Correlation analysis for investigating the relationship between demographic indicators and online/offline preferences

Page 50: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

50

50

50

>> Session representation

Each session represented as a feature vector on the multi-channel concepts

Two methods used for definition of new conversion metrics:

weighted-concept method (number of visits to a concept)

dichotomized concept method (whether or not concept was visited)

Session home infcat infprod service

transact

purch. offinfo

A 0 3 7 4 2 1 0B 1 3 5 0 0 0 2...

Session home infcat infprod service

transact

purch. offinfo

A 0 1 1 1 1 1 0B 1 1 1 0 0 0 1...

Page 51: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

51

51

51

Agenda – Case Study

Business Understanding

Data understanding and preparation

Pattern disc. + eval.: Behavioural patterns

Page 52: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

52

52

52

“Internal consistency“ of preferences – payment and delivery preferences

Online payment Direct delivery (s=0.27, c=0.97) < 1/3 traditional onl.users!

Online payment In-store pickup (s=0.02, c=0.03)

Cash on delivery Direct delivery (s=0.02, c=0.03)

In-store payment In-store pickup (s=0.69, c=0.94)

Site is primarily used to collect information.

s: support, c: confidence of the sequence

s: support, c: confidence of the sequence

Page 53: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

53

53

53

“Internal consistency“ of preferences – return preferences

Return In-store (s=0.06, c=0.87)

Return Mail-in (s=0.04, c=0.13)

Customers may wish personal assistance.

(a result supported by the service mix analysis of different multi-channel retailers and by questionnaire results)

s: support, c: confidence of the association rule

s: support, c: confidence of the association rule

Page 54: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

54

54

54

Development of preferences over time

Direct delivery In-store pickup in 1 following transaction (s=0.001,c=0.15)

Direct delivery Direct delivery in all following transactions (s=0.003,c=0.85)

In-store pickup Direct delivery in 1 foll. transaction (s=0.001, c=0.10) (*)

In-store pickup In-store pickup in all foll. transactions (s=0.004, c=0.90)

Results for payment migration are similar.

90% of repeat customers did not change transaction preferences at all.

Rule (*) as an indicator of the development of trust?!

s: support, c: confidence of the sequence

s: support, c: confidence of the sequence

Page 55: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

55

55

55

Agenda

Data Acquisition, Understanding, and Preparation

Forms of analysis; mining techniques

Case study: A multi-channel retailer method: Association-rule discovery

Free tools for logfile analysis

Page 56: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

56

56

56

Association-rule mining

A great tutorial is available here:

S. Parthasarathy (2006). Association rules.

http://www.cse.ohio-state.edu/~srini/674/assoc1.ppt

pp. 1 – 17, covering

What is an association rule?

What are interestingness measures for association rules?

support, confidence (there are many further measures)

How is association-rule mining performed?

the basic apriori algorithm

Page 57: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

57

57

57

Agenda

Data Acquisition, Understanding, and Preparation

Forms of analysis; mining techniques

Case study: A multi-channel retailer method: Association-rule discovery

Free tools for logfile analysis

Page 58: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

58

58

58

In the preparation of a log file(recommendations for open-source tools are shown in green)

1. Use qualitative methods for application understanding (read!)

2. Inspect the site and the URLs for data understanding

1. Generate Analog reports for getting base statistics of usage

2. Build concept system / hierarchy and mapping: URLs concepts (notation: WUMprep regex)

3. Use WUMprep for data preparation

1. Remove unwanted entries (pictures etc.)

2. Sessionize

3. Remove robots

4. Replace URLs by concepts

5. (Build a database)

4. Use WEKA for modelling

1. [ Transform log file into ARFF (WUMprep4WEKA) ]

2. Cluster, classify, find association rules, ...

5. Use WUM for modelling

6. Select patterns based on objective interestingness measures (support, confidence, lift, ...) and on subjective interestingness measures (unexpected? Application-relevant?)

7. Present results in tabular, textual and graphical form (use Excel, ...)

8. Interpret the results

9. Make recommendations for site improvement etc.

Page 59: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

59

59

59

URLs of the tools

Analog: http://www.analog.cx/

WUMprep: http://www.hypknowsys.de/

WEKA: http://www.cs.waikato.ac.nz/ml/weka/

WUM: http://www.hypknowsys.de/

Page 60: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

60

60

60

Short introductions to WUMprep

Lüderitz, S. (2006). Pre-processing of webserver logs for data mining. http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/Lecture/OtherSlides/luederitz-presentation1-slides_2006_07_10.pdf

(pp. 30-32)

Dettmar, G. (2003). Logfile-Preprocessing using WUMprep. http://warhol.wiwi.hu-berlin.de/~berendt/lehre/2003w/wmi/Student_Presentations/Gebhard_WUMprep.pdf

Page 61: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

61

61

61

References / background reading (1)

Data preparation Cooley, R., B. Mobasher, J. Srivastava. 1999. Data preparation for mining world wide

web browsing patterns. J.of Knowledge and Inform.Systems 1 5–32. http://citeseer.ist.psu.edu/cooley99data.html

Spiliopoulou, M., Mobasher, B., Berendt, B., & Nakagawa, M. (2003). A framework for the evaluation of session reconstruction heuristics in Web-usage analyis. INFORMS Journal on Computing, 15, 171-190.

http://warhol.wiwi.hu-berlin.de/~berendt/Papers/spiliopoulou_etal_2003.pdf

Web mining Baldi, P., Frasconi, P., & Smyth, P. (2003). Modeling the Internet and the Web.

Probabilistic Methods and Algorithms. Chichester, UK: John Wiley & Sons. http://ibook.ics.uci.edu/

Bing Liu (2006). Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications). Springer. http://www.cs.uic.edu/%7Eliub/WebMiningBook.html

A general overview of Web usage mining Srivastava, J., Desikan, P., & Kumar, V. (2004). Web Mining - Concepts, Applications

and Research Directions. In H. Kargupta, A. Joshi, K. Sivakumar, & Y. Yesha (Eds.), Data Mining: Next Generation Challenges and Future Directions (pp. 405-423). Menlo Park, CA: AAAI/MIT Press. (earlier, longer version: http://www.ieee.org.ar/downloads/Srivastava-tut-paper.pdf

Page 62: 1 1 1 Web Mining – An introduction to server-log-based Web usage mining Bettina Berendt Universidad Politécnica de Madrid, Department of Computer Science.

62

62

62

References / background reading (2)

Case study Teltzrow, M., & Berendt, B. (2003). Web-Usage-Based Success Metrics for Multi-

Channel Businesses. In Proceedings of the WebKDD 2003 Workshop - Webmining as a Premise to Effective and Intelligent Web Applications.. August 27th, 2003, Washington DC, USA. Held in conjunction with The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

http://warhol.wiwi.hu-berlin.de/~teltzrow/teltzrow_berendt_webkdd03.pdf Teltzrow, M., Berendt, B., & Günther, O. (2003). Consumer behaviour at multi-channel

retailers. In Proceedings of the 4th IBM eBusiness Conference, School of Management, University of Surrey, 9th December 2003.

http://warhol.wiwi.hu-berlin.de/~berendt/Papers/teltzrow_berendt_guenther_2003.pdf


Recommended