Web Mining and IA_MS..

There are only There are only 1010 types of types of people in the world:people in the world:

Those who understand binary, Those who understand binary, and those who don't. and those who don't.

Web Mining and Web Mining and Information Assurance Information Assurance

Dr. Xueping Li

Dept. of Industrial & Information EngineeringUniversity of Tennessee

Outline

Introduction to Web Mining Web content mining Web usage mining Web structure mining Complex Networks

What Is Data Mining?

Data mining (knowledge discovery in databases):

Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases

Alternative names and their “inside stories”: Data mining: a misnomer? Knowledge discovery(mining) in databases

(KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Source: Jiawei Han and Micheline Kamber: Data Mining: concepts and Techniques

Why Data Mining? — Potential Applications

Database analysis and decision support Market analysis and management

target marketing, customer relation management, market

basket analysis, cross selling, market segmentation

Risk analysis and management

Forecasting, customer retention, improved underwriting, quality

control, competitive analysis

Fraud detection and management

Other Applications Text mining (news group, email, documents) and Web analysis.

Intelligent query answering

Jiawei Han and Micheline Kamber: Data Mining: concepts and Techniques

Motivation: “Necessity is the Mother of Invention”

What is Web Mining?

Discovering useful information from the World-Wide Web and its usage patterns

Applications Web search e.g., Google, Yahoo,… Vertical Search e.g., FatLens, Become,… Recommendations e.g., Amazon.com Advertising e.g., Google, Yahoo Web site design e.g., landing page

optimization

Structured vs. Web data mining

Traditional data mining data is structured and relational well-defined tables, columns, rows, keys, and

constraints. Web data

Readily available data rich in features and patterns Text, image, audio, video

Spontaneous formation and evolution of topic-induced graph clusters hyperlink-induced communities

Challenges Content includes truth, lies, obsolete information, contradictions,

… Uncontrolled quality, widely distributed, rapidly changing,

heterogeneous/complex data types, no consistent semantics or structure within or across objects, etc. (XHTML & XML?)

Size of the Web

Number of pages Technically, infinite

Because of dynamically generated content Lots of duplication (30-40%)

Best estimate of “unique” static HTML pages comes from search engine claims Google = 8 billion, Yahoo = 20 billion Lots of marketing hype

Number of unique web sites Netcraft survey says 72 million sites

(http://news.netcraft.com/archives/web_server_survey.html)

Growth of the Internet

* Fig. source: Douglas E. Comer, Computers Networks and Internets with Internet Applications, 4e, Person Prentice Hall, 2004

Web Mining Taxonomy

Web content mining (WCM) Web usage mining (WUM) Web structure mining (WSM)

Web Mining

Web Content Mining

Web Usage Mining

Web Structure

Mining

WCM & WUM

Main source of the data: Log files

Main source of the data about the activity of our web server are Log files

Typical line of a Log file: 2005-05-29 04:13:40 128.2.215.4 - W3SVC1 WM

160.36.231.167 80 GET /Kdd/wm/wm.zip - 206 64 1507568 551 1816312 HTTP/1.1 www.utk.edu Mozilla/4.0+(compatible;+MSIE+5.5;+Windows+NT+5.0) - http://li.utk.edu/kdd/wm

E.g. Log files on WinNT/2000 reside at the \winnt\system32\logfiles\ system directory

What kind of problems do we solve?

Personalization of web services: Preparing offers (discounts, products, contents) customized

for each particular user Understanding of what is going on at the web server:

Customer groups identification, behavioral patterns …the goal is to better organize web services …optimization of site navigation

Better “Banner Adds” selection to increase the probability to be clicked by the user

…it is not hard to increase the probability Building the psychological profiles based on the texts

read by the user …to get more info about the user than he has about

himself Etc. etc. etc.

Data analysis methods

Log files include sequences of events (click-streams):

…methods for analyzing event sequences are usually modified classical methods from the area of Data-Mining for analysis of very large databases

Basic methods are modified methods for induction of association rules, clustering, decision trees

Other analytic methods are from the areas of Text-Mining, Statistics and Machine-Learning

Fig. A General Architecture for Web Usage Mining

WUM - web usage miner

main goal: navigation pattern discovery

sequence of pages through the website typical patterns optimization of site navigation

three steps log file cleaning pattern analysis visualization

Association rules example

Items={milk, coke, pepsi, beer, juice}. Support = 3 baskets.

B1 = {m, c, b} B2 = {m, p, j}B3 = {m, b} B4 = {c, j}B5 = {m, p, b} B6 = {m, c, b, j}B7 = {c, b, j} B8 = {b, c}

Frequent itemsets: {m}, {c}, {b}, {j}, {m, b}, {c, b}, {j, c}.

Association Rules

If-then rules about the contents of baskets.

{i1, i2,…,ik} → j means: “if a basket contains all of i1,…,ik then it is likely to contain j.

Confidence of this association rule is the probability of j given i1,…,ik.

Association rules example (cont.)

B1 = {m, c, b} B2 = {m, p, j}

B3 = {m, b} B4 = {c, j}

B5 = {m, p, b} B6 = {m, c, b, j}

B7 = {c, b, j} B8 = {b, c}

An association rule: {m, b} → c. Confidence = 2/4 = 50%.

+__ +

Association rules in Web-logs

Searching for rules that connect two or more events, e.g.

60% of the users that visited URL/company/product, also visited company/product/product1.html

30% of the users that visited URL/company/special-offer/ also visited company/product2.html

Profiling using time dimension

Searching for rules that connect two or more events taking into account time dimension:

30% of the users that visited URL/company/product/product1.html also searched in the last week words W1 and W2 on Yahoo

60% of the users that ordered product1 in the next 15 days also ordered product2

Classification rules

Identification of behavior for groups of users - additional information can be obtained from cookies, registration,etc.:

Users that frequently visit page /company/products/product3.html are from educational institutions

50% of the users that visited /company/products/product4.html are in age group of 20-25 and live at the sea coast

Real-Time Data-Analysis

At some web servers there are too many hits to be saved and analyzed off-line:

…we have a data stream – no time or space for off-line data analysis (e.g. search engines, shops, banks, news, …)

…we would like to understand what is going on to detect e.g. anomalies or changes in trends

The solution is in using special type of methods for online event analysis:

Methods are able to analyze non-stationary data At each moment results (models) are in human readable

form (e.g. decision trees, rules, …) …no need to save Log files

Document visualization

Application:From Web log to Web Loyalty

A study done by the Harvard Business A study done by the Harvard Business School indicates that an increase of School indicates that an increase of 5%5% in customer loyalty can increase in customer loyalty can increase profitability from profitability from 25%25% to a much as to a much as 80%80%. (Multimedia Live, 2001). (Multimedia Live, 2001)

Web search

The Web

Ad indexes

Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages

Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages

Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages

Sponsored Links

CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com

Web crawler

Indexer

Indexes

Search

User

Search engine components

Spider (a.k.a. crawler/robot) – builds corpus Collects web pages recursively

For each known URL, fetch the page, parse it, and extract new URLs

Repeat Additional pages from direct submissions & other

sources The indexer – creates inverted indexes

Various policies wrt which words are indexed, capitalization, support for Unicode, stemming, support for phrases, etc.

Query processor – serves query results Front end – query reformulation, word stemming,

capitalization, optimization of Booleans, etc. Back end – finds matching documents and ranks them

Typical anatomy of a large-scale crawler.

PageRank

Used by Google Prioritize pages returned from search

by looking at Web structure. Importance of page is calculated

based on number of pages which point to it – Backlinks.

Weighting is used to provide more importance to backlinks coming form important pages.

PageRank (cont’d)

PR(p) = c (PR(1)/N1 + … + PR(n)/Nn) PR(i): PageRank for a page i which

points to target page p. Ni: number of links coming out of page i

WSM

Self-Similarity of Internet Traffic Internet Invariant Scale Free Network

Self-Similarity of Internet Traffic (Measured) and Not in Poisson or Ordinary Telephone Traffic

Internet Invariant

FTP transfers, Pareto tail Interarrival time of packets, Heavy-tailed Connection duration, Lognormal TCP connections/Web session, Heavy-

tailed

Session duration, Pareto …

Martin J. Fischer etc., “Analyzing the Waiting Time Process in Internet Queueing Systems With the Transform Approximation Method”

Random Networks (Erdos/Renyi, 1960)

Average path length L ~ LnN, small;

Clustering coefficient C~0; C: probability that any two nodes are connected to each other, given that they are both connected to a common node ( probability that friends of friends are friends)

Regular Networks

High degree of clustering: C~1 Average path length L: large

Small-World Networks

High degree of clustering: C~1 Average path length L: Small (due to

shortcuts);

D.J.Watts and S.H. Strogatz, Nature 393, pp. 440-442 (1998)

Random, Small-World, and Regular Networks

C L

Regular High Large

Small-World High Small

Random Low Small

Examples of small-world networks: power grid, internet, social network, scientific citation network, movie-actor network et al.

Complex Networks: How are they formed?

Growth Starting with a small number of nodes, at every

time step a new node with a number 9m) of links is added

Preferential Attachment Barabasi-Albert (BA) model: probability for node i

to acquire a new link is

This results in an algebraic degree distribution

j jii kkk /)(

KkP ~)(

A. L. Barabasi and R. Albert, Science 286, 509 (1999)

Consequence of Algebraic Degree Distribution

Statistical moments

0

~k

nn dkkkk

Do not exist for n=[r]-1, [r],… where [r] is the smallest integer greater than r: networks have no characteristic scales (Scale-Free Networks)

Examples of SFN: (1)WWW, r(in)~2.1, r(out)~2.4; (2)Interent (r~2.5) (3)Network of movie actors (r~2.3); (4) Electrical power-grid of western US (r~4) (5) Scientific citation network (r~3.0)

Alternative Models

For scale-free networks, preferential attachment probability IIi(ki)~ki leads to an algebraic degree distribution;

For random networks, the attachment probability does not depend on ki: i(ki) = constant, which leads to an exponential degree distribution: P(k)~e^(-ak);

Many realistic networks exhibit scale-free feature only to certain extent. Often, algebraic and exponential distributions are observed in different ranges of k.

How robust is the Internet?

SFN is robust against random attacks while

vulnerable to malicious intentional attacks

Yuhai Tu, How robust is the Internet? Nature, Vol 406, July 2000

More topics

Privacy Issues In Web Mining Crawling the web Web graph analysis Structured data extraction Classification and vertical search Collaborative filtering Web advertising and optimization Mining web logs Systems Issues

Hmm, conclusion?

Web-Mining should be used by everybody offering services on the web and not being satisfied by simple access statistics!

The idea is to make something more out of the data already collected by your computer.

It is expected that Web-Mining will become soon a standard part of a typical web-solution.

Marko Grobelnik http://www-ai.ijs.si/MarkoGrobelnik/ Institut Jožef Stefan

Acknowledgements & References

Fowler, T. B., “A Short Tutorial on Fractals and Internet Traffic,” The Telecommuni-cation Review, Volume 10, Mitretek Systems, McLean, VA, pp. 1-14, 1999.

Bastian Germershaus, “ Integration of association rules into WUM”

Gao Kun, “Analysis Techniques of Discovered Patterns”

Mengdan Yu, “Mining E-Business Gold” Stanford CS345 “Data Mining”

Thanks~~~

Date post:	17-Dec-2014
Category:	Documents
Upload:	tommy96
View:	867 times
Download:	0 times

Web Mining and IA_MS..

Documents