+ All Categories
Home > Documents > Web Mining

Web Mining

Date post: 12-Aug-2015
Category:
Upload: mudit-dholakia
View: 16 times
Download: 1 times
Share this document with a friend
Popular Tags:
38
Web Mining By:-Mudit Dholakia Guide:-Dr. Amit Ganatra Sir
Transcript
Page 1: Web Mining

Web MiningBy:-Mudit Dholakia

Guide:-Dr. Amit Ganatra Sir

Page 2: Web Mining

What is web mining?

• Web mining is the use of the data mining techniques to automatically discover and extract information from web documents/services.• Discovering Knowledge from and about WWW - is one of the basic

abilities of an intelligent agent.

Page 3: Web Mining

Knowledge

WWW

Page 4: Web Mining

Web Mining .vs. Data Mining

• Structure (or lack of it)• Textual information and linkage structure

• Scale• Data generated per day is comparable to largest conventional data

warehouses

• Speed• Often need to react to evolving usage patterns in real-time (e.g.,

merchandising)

Page 5: Web Mining

Web Mining topics

• Web graph analysis• Power Laws and The Long Tail• Structured data extraction• Web advertising • Systems Issues

Page 6: Web Mining

Size of the Web

• Number of pages• Technically, infinite• Much duplication (30-40%)• Best estimate of “unique” static HTML pages comes from search engine

claims• Until last year, Google claimed 8 billion(?), Yahoo claimed 20 billion• Google recently announced that their index contains 1 trillion pages

• How to explain the discrepancy?

Page 7: Web Mining

The web as a graph

• Pages = nodes, hyperlinks = edges• Ignore content• Directed graph

• High linkage• 10-20 links/page on average• Power-law degree distribution

Page 8: Web Mining

Structure of Web graph

Page 9: Web Mining

Power-law degree distribution

Page 10: Web Mining

Measures

• Structure• In-degrees• Out-degrees• Number of pages per site

• Usage patterns• Number of visitors• Popularity e.g., products, movies, music

Page 11: Web Mining

The Long Tail

Page 12: Web Mining

Measures

• Shelf space is a scarce commodity for traditional retailers • Also: TV networks, movie theaters,…

• The web enables near-zero-cost dissemination of information about products• More choice necessitates better filters

• Recommendation engines (e.g., Amazon)• How Into Thin Air made Touching the Void a bestseller

Page 13: Web Mining

Searching the Web

Content aggregatorsThe Web Content consumers

Page 14: Web Mining

Two approaches for analyzing data

• Machine Learning approach• Emphasizes sophisticated algorithms e.g., Support Vector Machines• Data sets tend to be small, fit in memory

• Data Mining approach• Emphasizes big data sets (e.g., in the terabytes)• Data cannot even fit on a single disk!• Necessarily leads to simpler algorithms

Page 15: Web Mining

View of mining system

Mem

Disk

CPU

Mem

Disk

CPU

Mem

Disk

CPU…

Page 16: Web Mining

Issues

• Web data sets can be very large • Tens to hundreds of terabytes

• Cannot mine on a single server!• Need large farms of servers

• How to organize hardware/software to mine multi-terabyte data sets• Without breaking the bank!

Page 17: Web Mining

What it should do?

• Finding relevant information • Low precision and unindexed information

• Creating new knowledge out of available information on the web• A data-triggered process

• Personalizing the information• Personal preference in content and presentation of the information

• Learning about the consumers • What does the customer want to do?

Page 18: Web Mining

Direct vs Indirect web mining

• Web mining techniques can be used to solve the information overload problems:

DirectlyAddress the problem with web mining techniques

E.g. newsgroup agent classifies whether the news as relevantIndirectly

Used as part of a bigger application that addresses problemsE.g. used to create index terms for a web search service

Page 19: Web Mining

Web Mining Categories

• Web Content MiningDiscovering useful information from web page

contents/data/documents.

• Web Structure MiningDiscovering the model underlying link structures (topology)

on the Web. E.g. discovering authorities and hubs

• Web Usage MiningExtraction of interesting knowledge from logging information

produced by web servers.Usage data from logs, user profiles, user sessions, cookies, user

queries, bookmarks, mouse clicks and scrolls, etc.

Page 20: Web Mining

Types

• Web Mining• Web Content Mining• Web Structure Mining• Web Usage Mining

Page 21: Web Mining

IRSystem

Query

Documentssource

RankedDocuments

Document

DocumentDocument

ClusteringSystem

Similarity measure

Documentssource

DocDo

cDoc

Doc

Doc

DocDoc

Doc

DocDoc

Page 22: Web Mining

Web Content Data Structure

• Web content consists of several types of data• Text, image, audio, video, hyperlinks.

• Unstructured – free text• Semi-structured – HTML• More structured – Data in the tables or database generated HTML

pagesNote: much of the Web content data is unstructured text data.

Page 23: Web Mining

Web Content Mining

• Unstructured DocumentsBag of words to represent unstructured documents

Takes single word as feature Ignores the sequence in which words occur

Features could be Boolean

Word either occurs or does not occur in a document Frequency based

Frequency of the word in a documentVariations of the feature selection include

Removing the case, punctuation, infrequent words and stop wordsFeatures can be reduced using different feature selection techniques:

Information gain, mutual information, cross entropy. Stemming: which reduces words to their morphological roots.

Page 24: Web Mining

Web Content Mining

• Semi-Structured DocumentsUses richer representations for features

Due to the additional structural information in the hypertext document (typically HTML and hyperlinks)

Uses common data mining methods (whereas unstructured might use more text mining methods)

Application: Hypertext classification or categorization and clustering, learning relations between web documents, learning extraction patterns or rules, and finding patterns in semi-structured data.

Page 25: Web Mining

Web Content Mining: DB View

• The database techniques on the Web are related to the problems of managing and querying the information on the Web.• DB view tries to infer the structure of a Web site or transform a Web site to

become a database Better information managementBetter querying on the Web

• Can be achieved by:Finding the schema of Web documentsBuilding a Web warehouseBuilding a Web knowledge baseBuilding a virtual database

Page 26: Web Mining

Web Content Mining: DB View• DB view mainly uses the Object Exchange Model (OEM)

Represents semi-structured data by a labeled graphThe data in the OEM is viewed as a graph, with objects as the vertices

and labels on the edges Each object is identified by an object identifier [oid] and Value is either atomic or complex

• Process typically starts with manual selection of Web sites for doing Web content mining• Main application:

• The task of finding frequent substructures in semi-structured data• The task of creating multi-layered database

Page 27: Web Mining
Page 28: Web Mining

Taxonomies

• Ranking• Graph Search• Communities• Hyperlink Induced Topic Search• SEO• Hub & Authorities

Page 29: Web Mining

Web Structure Mining

• Interested in the structure of the hyperlinks within the Web• Inspired by the study of social networks and citation analysis• Can discover specific types of pages(such as hubs, authorities, etc.) based on

the incoming and outgoing links.

• Application: • Discovering micro-communities in the Web , • measuring the “completeness” of a Web site

Page 30: Web Mining

Web Usage Mining• Tries to predict user behavior from interaction

with the Web• Wide range of data (logs)

Web client data Proxy server data Web server data

• Two common approaches Maps the usage data of Web server into relational tables before

an adapted data mining techniques Uses the log data directly by utilizing special pre-processing

techniques

Page 31: Web Mining

Web Usage Mining

Pre-Processing Pattern Discovery Pattern Analysis

User sessionFile Rules and Patterns Interesting

Knowledge

Page 32: Web Mining

XML View

Generalized Descriptions

More Generalized Descriptions

Layer0

Layer1

Layern

...

Page 33: Web Mining

33

Use of Multi-Layer Meta Web• Benefits of Multi-Layer Meta-Web: • Multi-dimensional Web info summary analysis• Approximate and intelligent query answering• Web high-level query answering (WebSQL, WebML)• Web content and structure mining• Observing the dynamics/evolution of the Web

• Is it realistic to construct such a meta-Web?• Benefits even if it is partially constructed• Benefits may justify the cost of tool development,

standardization and partial restructuring

Page 34: Web Mining

Web Search Products and ServicesAlta VistaDB2 text extenderExciteFulcrumGlimpse (Academic)Google! Inforseek Internet Inforseek Intranet Inktomi (HotBot) Lycos

PLSSmart (Academic)Oracle text extender Verity Yahoo!

Page 35: Web Mining

Web Usage Mining

• Typical problems: • Distinguishing among unique users, server sessions,

episodes, etc. in the presence of caching and proxy servers

• Often Usage Mining uses some background or domain knowledge

E.g. site topology, Web content, etc.

Page 36: Web Mining

Web Usage Mining

• Applications:• Two main categories:

Learning a user profile (personalized)Web users would be interested in techniques that learn their needs and preferences automatically

Learning user navigation patterns (impersonalized)Information providers would be interested in techniques that

improve the effectiveness of their Web site

Page 38: Web Mining

Thank You


Recommended