+ All Categories
Home > Documents > 5DM_kdse.ppt

5DM_kdse.ppt

Date post: 28-Nov-2014
Category:
Upload: tommy96
View: 740 times
Download: 1 times
Share this document with a friend
Description:
 
Popular Tags:
63
mining Towards KD Support Environments Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa http://www-kdd.di.unipi.it/ A tutorial @ EDBT2000
Transcript
Page 1: 5DM_kdse.ppt

Knowledge discovery & data mining

Towards KD Support Environments

Fosca Giannotti and Dino PedreschiPisa KDD Lab

CNUCE-CNR & Univ. Pisahttp://www-kdd.di.unipi.it/

A tutorial @ EDBT2000

Page 2: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 2

Module outline

Data analysis and KD Support Environments

Data mining technology trends from tools … … to suites … to solutions

Towards data mining query languagesDATASIFT: a logic-based KDSEFuture research challenges

Page 3: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 3

Vertical applications

We outlined three classes of vertical data analysis applications that can be tackled using KDD & DM techniques

Fraud detection Market basket analysis Customer segmentation

Page 4: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 4

Why are these applications challenging?

Require manipulation and reasoning over knowledge and data at different abstraction levels conceptual

semantic integration of domain knowledge, expert (business) rules and extracted knowledge

semantic integration of different analysis paradigms logical/physical

interoperability with external components: DBMS’s, data mining tools, desktop tools

querying/mining optimization: loose vs. tight coupling between query language and specialized mining tools

Page 5: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 5

Why are these applications challenging?

The associated KDD processneeds to be carefully specified, tuned and controlled

Selection and Preprocessing

Data Mining

Interpretation and Evaluation

Data Consolidation

Knowledge

p(x)=0.02

Warehouse

Data Sources

Patterns & Models

Prepared Data

ConsolidatedData

Page 6: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 6

Why are these applications challenging?

Still not properly supported by available KDD technology

what is offered: horizontal, customizable

toolkits/suites of data mining primitives

what is needed: KD support environments for

vertical applications

Page 7: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 7

Traditional Focus on knowledge

transfer, design and coding

30% - analysis and design

70% - program design, coding and testing

Prototyping - expensive Development process

has few loops Maintenance requires

human analysis

Data mining Focus on data

selection, representation and search

70% - data preparation 30% - model generation

and testing Prototyping - cheap Development process is

inherently iterative Maintenance requires

re-learning model

Datamining vs. traditional Sw development process

Page 8: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 8

From R. Agrawal’s invited lecture @ KDD’99

The greatest peril in the development of a high-tech market lies in making the transition from an early market dominated by a few visionaries to a mainstream market dominated by pragmatists.

Early MarketMainstream Market

Chasm

Page 9: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 9

Is data mining in the chasm?

Perceived to be sophisticated technology, usable only by specialists

Long, expensive projectsStand-alone, loosely-coupled with

data infrastructuresDifficult to infuse into existing

mission-critical applications

Page 10: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 10

Module outline

Data analysis and KD Support Environments

Data mining technology trends from tools … … to suites … … to solutions

Towards data mining query languagesDATASIFT: a logic-based KDSEFuture research challenges

Page 11: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 11

Generation 1: data mining tools

~1980: first generation of DM systemsresearch-driven tools for single tasks,

e.g. build a decision tree - say C4.5 find clusters - say Autoclass (Cheeseman 88) …

Difficult to use more than one tool on the same data – lots of data/metadata transformation

Intended user: a specialist, technically sophisticated.

Page 12: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 12

Generation 2: data mining suites

~1995: second generation of DM systems

toolkits for multiple tasks with support for data preparation and interoperability with DBMS, e.g. SPSS Clementine IBM Intelligent Miner SAS Enterprise Miner SFU DBMiner

Intended user: data analyst – suites require significant knowledge of statistics and databases

Page 13: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 13

Growth of DM tools (source: kdnuggets.com)

From G. Piatetsky-Shapiro. The data-mining industry coming of age. IEEE Intelligent Systems, Dec. 1999.

Page 14: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 14

Generation 3: data mining solutions

Beginning end of 1990svertical data mining-based applications

and solutions oriented to solving one specific business problem, e.g. detecting credit card fraud customer retention …

Address entire KDD process, and push result into a front-end application

Intended user: business user – the interfaces hid the data mining complexity

Page 15: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 15

Emerging short-term technology trends

Tighter interoperability by means of standards which facilitate the integration of data mining with other applications: KDD process, e.g. the Cross-Industry

Standard Process for Data Mining model (www.crisp-dm.org)

representation of mining models: e.g., the PMML - predictive modeling markup language (www.dmg.org)

DB interoperability: the Microsoft OLE DB for data mining interface

Page 16: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 16

Approaches in data mining suites

Database-oriented approach IBM Intelligent Miner

OLAP-based mining DBMiner - Jiawei Han’s group @ SFU

Machine learning CART, ID3/C4.5/C5.0, Angoss Knowledge

StudioStatistical approaches

The SAS Institute Enterprise Miner.

Visualization approach: SGI MineSet, VisDB (Keim et al. 94).

Page 17: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 17

Other approaches in data mining suites

Neural network approach: Cognos 4thoughts, NeuroRule (Lu et al.’95).

Deductive DB integration: KnowlegeMiner (Shen et al.’96)

Datasift (Pisa KDD Lab - see refs).

Rough sets, fuzzy sets: Datalogic/R, 49er

Multi-strategy mining: INLEN, KDW+, Explora

Page 18: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 18

SFU DBMiner: OLAP-centric mining

Warehouse

Workplace

Active Object

Elements

Active Object

Page 19: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 19

IBM Intelligent Miner – DB-centric mining

Mining Base Container

Contents Container

Work Area

Page 20: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 20

IBM – IM architecture

Page 21: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 21

Angoss Knowledge Studio: ML-centric mining

Project Outline

Work Area

Additional Visualizatio

ns

Page 22: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 22

KS project outline tool

(Limited) support to the KDD process

Page 23: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 23

Support for data consolidation stepDBMiner

ODBC databases – SQL + SmartDrives Single database – multiple tables Consolidation of heterogeneous sources

unsupported Intelligent Miner

DB2 and text – SQL without SmartDrives Multiple databases Consolidation of heterogeneous sources

supportedKnowledge Studio

ODBC databases and text Single table Consolidation of heterogeneous sources

unsupported

Page 24: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 24

Support for sselection and election and ppreprocessingreprocessing

DBMiner SQL only

Intelligent Miner SQL + standard and advanced

statistical functionalities

Knowledge Studio descriptive statistics

Page 25: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 25

Support for data mining step

DBMiner Association rules Decision trees Prediction

Intelligent Miner Associations rules Sequential

patterns Clustering Classification Prediction Similar time series

Knowledge StudioDecision treesClusteringPrediction

Page 26: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 26

Support for interpretation and evaluation

Predefined interestingness measuresEmphasis on visualizationLimited export capability of analysis

resultsGain charts for comparison of

predictive models (KS and IM)Limited model combination capabilities

(KS)

Page 27: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 27

Module outline

Data analysis and KD Support Environments

Data mining technology trends from tools … … to suites … … to solutions

Towards data mining query languagesDATASIFT: a logic-based KDSEFuture research challenges

Page 28: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 28

Data Mining Query Languages

A DMQL can provide the ability to support ad-hoc and interactive data mining

Hope: achieve the same effect that SQL had on relational databases.

Various proposals: DMQL (Han et al 96) mine operator (Meo et el 96) M-SQL (Imielinski et al 99) query flocks (Tsur et al 98)

Page 29: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 29

MINE operator of (Meo et al 96)

Page 30: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 30

References - DMQL

J. Han, Y. Fu, W. Wang, K. Koperski, and O. R. Zaiane. DMQL: A Data Mining Query Language for Relational Databases. In Proc. 1996 SIGMOD'96 Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'96), pp. 27-33, Montreal, Canada, June 1996.

R. Meo, G. Psaila, S. Ceri. A New SQL-like Operator for Mining Association Rules. In Proc. VLDB96, 1996 Int. Conf. Very Large Data Bases, Bombay, India, pp. 122-133, Sept. 1996.

T. Imielinski and A. Virmani. MSQL: a query language for database mining. Data Mining and Knowledge Discovery, 3:373-408, 1999.

S. Tsur, J. Ulman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov. Query flocks: a generalization of association rule mining. In Proc. 1998 ACM-SIGMOD, p. 1-12, 1998.

Page 31: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 31

Module outline

Data analysis and KD Support Environments

Data mining technology trends from tools … … to suites … … to solutions

Towards data mining query languagesDATASIFT: a logic-based KDSEFuture research challenges

Page 32: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 32

DATASIFT - towards a logic-based KDSE

DATASIFT is LDL++ (Logic Data Language, MCC & UCLA) extended with mining primitives (decision trees & association rules)

LDL++ syntax: Prolog-like deductive rules

LDL++ semantics: SQL extended with recursion (and more)

Integration of deduction and inductionEmployed to systematically develop the

methodology for MBA and audit planningSee Pisa KDD Lab references

Page 33: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 33

Our position

A suitable integration of deductive reasoning (logic database

languages) inductive reasoning (association rules &

decision trees)

provides a viable solution to high-level problems in knowledge-intensive data analysis applications

Page 34: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 34

Our goal

Demonstrate how we support design and control of the overall KDD process and the incorporation of background knowledge data preparation knowledge extraction post-processing and knowledge

evaluation business rules autofocus datamining

Page 35: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 35

With respect to other DMQL’s

extending logic query languages yields extra expressiveness, needed to bridge the gap between data mining (e.g., association rule mining) vertical applications (e.g., market basket

analysis)

Page 36: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 36

Architecture - client agent

User interface

Access to business rules and visualization of results through

web browser to control interaction MS Excel objects (sheets and charts) to

represent output of analysis (association rules)

Page 37: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 37

Architecture - server agent

A query engine (mediator) record previous analyses Metadata/meta knowledge interaction with other components

LDL++ server extended with external calls to DBMSs and to …

Inductive modules Apriori classifiers (decision trees)

Coupling with DBMS using the Cache-mine approach Performance comparable with SQL-based

approaches on same mining queries (Giannotti at el 2000)

Page 38: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 38

Deductive rules in LDL++

E.g.: select transactions involving milkmilk_basket(T,I) basket(T,I),basket(T,milk).

Querying ?- milk_basket(T,I)milk_basket(2,bread). milk_basket(3,bread).milk_ basket(2,milk). milk_basket(3,orange).milk_ basket(2,onions). milk_basket(3,milk). milk_ basket(2,fish).

A small database of cash register transactions

basket(1,fish). basket(2,bread). basket(3,bread).basket(1,bread). basket(2,milk). basket(3,orange).

basket(2,onions). basket(3,milk).basket(2,fish).

Page 39: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 39

Aggregates in LDL++

E.g.: count occurrences of pairs of distinct items in all transactionspair(I1,I2,count<T>) basket(T,I1),basket(T,I2),I1 I2.

A small database of cash register transactions

basket(1,fish). basket(2,bread). basket(3,bread).basket(1,bread). basket(2,milk). basket(3,orange).

basket(2,onions). basket(3,milk).basket(2,fish).

aggregate

Querying ?- pair(fish,bread,N)pair(fish,bread,2) (i.e., N=2)

Aggregates are the logical interface between deductive and inductive environment.

Page 40: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 40

Association rules in LDL++

E.g., compute one-to-one association rules with at least 40% supportrules(patterns<0.4,0,{I1,I2}>)basket(T,I1),basket(T,I2).

basket(1,fish). basket(2,bread). basket(3,bread).basket(1,bread). basket(2,milk). basket(3,orange).

basket(2,onions). basket(3,milk).basket(2,fish).

patterns is the aggregate interfacing the computation of association rules

patterns<min_supp, min_conf, trans_set>

Page 41: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 41

Association rules in LDL++

Result of the query ?- rules(X,Y,S,C)rules({milk},{bread},0.66,1)

i.e. milk bread [0.66,1]

rules({bread},{milk},0.66,0.66)

rules({fish},{bread},0.66,1)

rules({bread},{fish},0.66,0.66)

Same status for data and induced rules

basket(1,fish). basket(2,bread). basket(3,bread).basket(1,bread). basket(2,milk). basket(3,orange).

basket(2,onions). basket(3,milk).basket(2,fish).

Page 42: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 42

Reasoning on item hierarchies

P ro d u ct (ite m )

F a m ily

S e c to r

D e pa rtm e nt Which rules survive/decay up/down the item hierarchy?

rules_at_level(I,pattern<S,C,Itemset>) itemset_abstraction(I,Tid,Itemset).

preserved_rules(Left,Right)

rules_at_level(I,Left,Right,_,_),rules_at_level(I+1,Left,Right,_,_).

Page 43: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 43

Business rules: reasoning on promotions

Which rules are established by a promotion?

interval(before, -, 3/7/1998).interval(promotion, 3/8/1998, 3/30/1998).

interval(after, 3/31/1998, +).

established_rules(Left, Right) not rules_partition(before, Left, Right, _, _),

rules_partition(promotion, Left, Right, _, _),

rules_partition(after, Left, Right, _, _).

Page 44: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 44

Business rules: temporal reasoning

How does rule support change along time?

25

/11

/97

26

/11

/97

27

/11

/97

28

/11

/97

29

/11

/97

30

/11

/97

01

/12

/97

02

/12

/97

03

/12

/97

04

/12

/97

05

/12

/97

0

5

10

15

20

25

30

35

Support Pasta => Fresh Cheese 14

Bread Subsidiaries => Fresh Cheese 28

Biscuits => Fresh Cheese 14

Fresh Fruit => Fresh Cheese 14

Frozen Food => Fresh Cheese 14

Page 45: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 45

Decision tree construction in DATASIFT

construct training and test set using rulestraining_set(P,Case_list) ...test_tuple(ID,F1,...,F20,Rec,Act_rec,CAR)

...construct classifier using external call to C5.0tree_rules(Tree_name,P,PF,MC,BO,Rule_list) training_set(P,Case_list),tree_induction(Case_list,PF,MC,BO,Rule_list).

parameters pruning factor PF misclassification costs MC boosting BO external call

induced classifier

Page 46: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 46

Putting decision trees at work

prediction of target variableprediction(Tree_name,ID,CAR,Predicted_CAR)

tree_rules(Tree_name, _ ,_ , _ , Rule_list),test_subject(ID, F1, …, F20, _, _, CAR),classify(Rule_list ,[F1, …, F20], Predicted_CAR).

Model evaluation: actual recovery of a classifier (=sum recovery of tuples classified as positive)

actual_recovery(Tree_name,sum<Actual_Recovery>) prediction(Tree_name, ID, _ , pos),test_subject(ID, F1, …, F20, _,Actual_Recovery, _).

aggregate

Page 47: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 47

Combining decision trees

Model conjunction: tree_conjunction(T1,T2,ID,CAR,pos)

prediction(T1, ID, CAR, pos),prediction(T2, ID, CAR, pos).

tree_conjunction (T1, T2, ID, CAR, neg) test_subject(ID, F1, …, F20, _, _, CAR),~ tree_conjunction(T1, T2, ID, CAR, pos).

More interesting combinations readily expressible: e.g. meta learning (Chan and Stolfo 93)

Page 48: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 48

We proposed ...

a KDD methodology for audit planning: define an audit cost model monitor training- and test-set construction assess the quality of a classifier tune classifier construction to specific policies

and its formalization in a prototype logic-based KDSE, supporting: integration of deduction and induction integration of domain and induced knowledge separation of conceptual and implementation

level

Page 49: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 49

Module outline

Data analysis and KD Support Environments

Data mining technology trends from tools … … to suites … … to solutions

Towards data mining query languagesDATASIFT: a logic-based KDSEFuture research challenges

Page 50: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE

1. Integration with data warehouse and relational DB

2. Scalable, parallel/distributed and incremental mining

3. Data mining query language optimization

4. Multiple, integrated data mining methods

5. KDSE and methodological support for vertical appl.

6. Interactive, exploratory data mining environments

7. Mining on other forms of data: spatio-temporal databases text multimedia web

A data mining research agenda

Page 51: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 51

Scale up!

Scaling up existing algorithms (AI, ML, IR) Association rules Correlation rules Causal relationship Classification Clustering Bayesian networks

Page 52: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 52

Background knowledge & constraints

Incorporating background knowledge and constraints into existing data mining techniques

Double benefit for DMQL: semantics and optimization! traditional algorithms

Disproportionate computational cost for selective users

Overwhelming volume of potentially useless results need user-controlled focus in mining process

Association rules containing certain itemsSequential patterns containing certain patternsClassification?

Page 53: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 53

Vertical applications of data mining

More success stories needed! Current data mining systems lack a

thick semantic layer (similarly to the early relational database systems)

Verticalized data mining systems, e.g. Market analysis systems Fraud detection systems

Automated mining and interactive mining: how far are they?

Page 54: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 54

Autofocus data mining

policy options, business rules

selection of data mining functionfine parameter tuning of mining

function

Page 55: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 55

DBMS coupling

Tight-coupling with DBMS

Most data mining algorithms are based on flat file data (i.e. loose-coupling with DBMS)

A set of standard data mining operators (e.g. sampling operator)

Page 56: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 56

Web mining – why?

No standards on the web, enormous blob of unstructured and heterogeneous info

Very dynamic One new WWW server every 2 hours

5 million documents in 1995

320 million documents in 1998

Indices get obsolete very quicklyBetter means needed for discovering

resources and extracting knowledge

Page 57: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 57

Web mining: challenges

Today`s search engines are plagued by problems– the abundance problem:

99% of info of no interest to 99% of people!

– limited coverage of the Web– limited query interface based on

keyword-oriented search– limited customization to individual users

Page 58: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 58

Web mining Web content mining

mining what Web search engines find Web document classification (Chakrabarti et al 99) warehousing a Meta-Web (Zaïane and Han 98) intelligent query answering in Web search

Web usage mining Web log mining: find access patterns and trends

(Zaiane et al 98) customized user tracking and adaptive sites

(Perkowitz et al 97)

Web structure mining discover authoritative pages: a page is important if

important pages point to it (Chakrabarti et al 99, Kleinberg 98)

Page 59: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 59

Warehousing a Meta-Web (Zaïane & Han 98)

Meta-Web: summarizes the contents and structure of the Web, which evolves with the Web

Layer0: the Web itself

Layer1: the lowest layer of the Meta-Web an entry: a Web page summary, including class,

time, URL, contents, keywords, popularity, weight, links, etc.

Layer2 and up: summary/classification/clustering Meta-Web is warehoused and incrementally

updated Querying and mining is performed on or assisted

by meta-Web Is it feasible/sustainable? Is XML of any help?

Page 60: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 60

Meta-Web from Jiawei Han’s panel talk @ SIGMOD99

Generalized Descriptions

More Generalized Descriptions

Layer0

Layer1

Layern

...

Page 61: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 61

Weblog mining

Web servers register a log entry for every single access.

A huge number of accesses (hits) are registered and collected in an ever-growing web log.

Why warehousing/mining web logs?

Enhance server performance by learning access patterns of general or particular users (guess what user will ask next and pre-cache!)

Improve system design of web applications Identify potential prime advertisement locations

Greatest peril: the privacy pitfall See e.g. (Markoff 99) the rise of the Little Brother.

Page 62: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 62

Some web mining references M. Perkowitz and O. Etzioni. Adaptive sites: Automatically learning from user access

patterns. In Proc. 6th Int. World Wide Web Conf., Santa Clara, California, April 1997. J. Pitkow. In search of reliable usage data on the www. In Proc. 6th Int. World Wide Web

Conf., Santa Clara, California, April 1997. T. Sullivan. Reading reader reaction : A proposal for inferential analysis of web server

log files. In Proc. 3rd Conf. Human Factors & the Web, Denver, Colorado, June 1997. O. R. Zaiane, M. Xin, and J. Han. Discovering Web access patterns and trends by

applying OLAP and data mining technology on Web logs. In Proc. Advances in Digital Libraries Conf. (ADL'98), pages 19-29, Santa Barbara, CA, April 1998.

O. R. Zaiane, and J. Han. Resource and knowledge discovery in global information systems: a preliminary design and experiment. In Proc. KDD’95, p.331-336, 1995.

O. R. Zaiane, and J. Han. WebML: querying the world-wide web for resources and knowledge. In Proc. Int. Workshop on Web informtion and Data management (WIDM98), p. 9-12, 1998.

S. Chakrabarti, B. E. Dom, S. R. Kumar, P. Raghavan, et al. Mining the web’s link structure. COMPUTER, 32:60-67, 1999.

S. Chakrabarti, B. E. Dom, P. Indik. Enhanced hypertext classification using hyperlinks. In Proc. 1998 ACM-SIGMOD, p. 307-318, 1999.

J. Kleinberg. Autohoritative sources in a hyperlinked environment. In Proc. ACM-SIAM Symp. on Discrete Algorithms, 1998.

J. Markoff. The Rise of Little Brother. Upside, Apr. 1999; http://www.upside.com/texis/mvm/story?id=36d4613c0

Page 63: 5DM_kdse.ppt

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 63

Pisa KDD Lab references

F. Giannotti and G. Manco. Making Knowledge Extraction and Reasoning Closer. In Proc. PAKDD'99, The Fourth Pacific-Asia Conference on Knowledge Discovery and Data Mining, Kyoto, 2000.

F. Giannotti and G. Manco. Querying Inductive Databases via Logic-Based User Defined Aggregates. In Proc. PKDD'99, The Third Europ. Conf. on Principles and Practice of Knowledge Discovery in Databases. Prague, Sept. 1999.

F. Bonchi, F. Giannotti, G. Mainetto, D. Pedreschi. Using Data Mining Techniques in Fiscal Fraud Detection. In Proc. DaWak'99, First Int. Conf. on Data Warehousing and Knowledge Discovery. Florence, Italy, Sept. 1999.

F. Bonchi , F. Giannotti, G. Mainetto, D. Pedreschi. A Classification-based Methodology for Planning Audit Strategies in Fraud Detection. In Proc. KDD-99, ACM-SIGKDD Int. Conf. on Knowledge Discovery & Data Mining, San Diego (CA), August 1999.

F. Giannotti, G. Manco, D. Pedreschi and F. Turini. Experiences with a logic-based knowledge discovery support environment. In Proc. 1999 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (SIGMOD'99 DMKD). Philadelphia, May 1999.

F. Giannotti, M. Nanni, G. Manco, D. Pedreschi and F. Turini. Integration of Deduction and Induction for Mining Supermarket Sales Data. In Proc. PADD'99, Practical Application of Data Discovery, Int. Conference, London, April 1999.

F. Giannotti, G. Manco, M. Nanni, D. Pedreschi. Nondeterministic, Nonmonotonic Logic Databases. IEEE Trans. on Knowledge and Data Engineering. 2000.

F. Giannotti, M. Nanni, G. Manco, D. Pedreschi and F. Turini. Using deduction for intelligent data analysis. Submitted, 2000. http://www-kdd.di.unipi.it/

P. Becuzzi, M. Coppola, S. Ruggieri and M. Vanneschi. Parallelisation of C4.5 as a particular divide and conquer computation. Proc.3rd Workshop on High Performance Data Mining, Springer-Verlag LNCS, 2000.


Recommended