5DM_kdse.ppt

transcript

Knowledge discovery & data mining

Towards KD Support Environments

Fosca Giannotti and Dino PedreschiPisa KDD Lab

CNUCE-CNR & Univ. Pisahttp://www-kdd.di.unipi.it/

A tutorial @ EDBT2000

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE 2

Module outline

Data analysis and KD Support Environments

Data mining technology trends from tools … … to suites … to solutions

Towards data mining query languagesDATASIFT: a logic-based KDSEFuture research challenges

Vertical applications

We outlined three classes of vertical data analysis applications that can be tackled using KDD & DM techniques

Fraud detection Market basket analysis Customer segmentation

Why are these applications challenging?

Require manipulation and reasoning over knowledge and data at different abstraction levels conceptual

semantic integration of domain knowledge, expert (business) rules and extracted knowledge

semantic integration of different analysis paradigms logical/physical

interoperability with external components: DBMS’s, data mining tools, desktop tools

querying/mining optimization: loose vs. tight coupling between query language and specialized mining tools

The associated KDD processneeds to be carefully specified, tuned and controlled

Selection and Preprocessing

Data Mining

Interpretation and Evaluation

Data Consolidation

Knowledge

p(x)=0.02

Warehouse

Data Sources

Patterns & Models

Prepared Data

ConsolidatedData

Still not properly supported by available KDD technology

what is offered: horizontal, customizable

toolkits/suites of data mining primitives

what is needed: KD support environments for

vertical applications

Traditional Focus on knowledge

transfer, design and coding

30% - analysis and design

70% - program design, coding and testing

Prototyping - expensive Development process

has few loops Maintenance requires

human analysis

Data mining Focus on data

selection, representation and search

70% - data preparation 30% - model generation

and testing Prototyping - cheap Development process is

inherently iterative Maintenance requires

re-learning model

Datamining vs. traditional Sw development process

From R. Agrawal’s invited lecture @ KDD’99

The greatest peril in the development of a high-tech market lies in making the transition from an early market dominated by a few visionaries to a mainstream market dominated by pragmatists.

Early MarketMainstream Market

Is data mining in the chasm?

Perceived to be sophisticated technology, usable only by specialists

Long, expensive projectsStand-alone, loosely-coupled with

data infrastructuresDifficult to infuse into existing

mission-critical applications

Module outline

Data mining technology trends from tools … … to suites … … to solutions

Generation 1: data mining tools

~1980: first generation of DM systemsresearch-driven tools for single tasks,

e.g. build a decision tree - say C4.5 find clusters - say Autoclass (Cheeseman 88) …

Difficult to use more than one tool on the same data – lots of data/metadata transformation

Intended user: a specialist, technically sophisticated.

Generation 2: data mining suites

~1995: second generation of DM systems

toolkits for multiple tasks with support for data preparation and interoperability with DBMS, e.g. SPSS Clementine IBM Intelligent Miner SAS Enterprise Miner SFU DBMiner

Intended user: data analyst – suites require significant knowledge of statistics and databases

Growth of DM tools (source: kdnuggets.com)

From G. Piatetsky-Shapiro. The data-mining industry coming of age. IEEE Intelligent Systems, Dec. 1999.

Generation 3: data mining solutions

Beginning end of 1990svertical data mining-based applications

and solutions oriented to solving one specific business problem, e.g. detecting credit card fraud customer retention …

Address entire KDD process, and push result into a front-end application

Intended user: business user – the interfaces hid the data mining complexity

Emerging short-term technology trends

Tighter interoperability by means of standards which facilitate the integration of data mining with other applications: KDD process, e.g. the Cross-Industry

Standard Process for Data Mining model (www.crisp-dm.org)

representation of mining models: e.g., the PMML - predictive modeling markup language (www.dmg.org)

DB interoperability: the Microsoft OLE DB for data mining interface

Approaches in data mining suites

Database-oriented approach IBM Intelligent Miner

OLAP-based mining DBMiner - Jiawei Han’s group @ SFU

Machine learning CART, ID3/C4.5/C5.0, Angoss Knowledge

StudioStatistical approaches

The SAS Institute Enterprise Miner.

Visualization approach: SGI MineSet, VisDB (Keim et al. 94).

Other approaches in data mining suites

Neural network approach: Cognos 4thoughts, NeuroRule (Lu et al.’95).

Deductive DB integration: KnowlegeMiner (Shen et al.’96)

Datasift (Pisa KDD Lab - see refs).

Rough sets, fuzzy sets: Datalogic/R, 49er

Multi-strategy mining: INLEN, KDW+, Explora

SFU DBMiner: OLAP-centric mining

Warehouse

Workplace

Active Object

Elements

Active Object

IBM Intelligent Miner – DB-centric mining

Mining Base Container

Contents Container

Work Area

IBM – IM architecture

Angoss Knowledge Studio: ML-centric mining

Project Outline

Work Area

Additional Visualizatio

KS project outline tool

(Limited) support to the KDD process

Support for data consolidation stepDBMiner

ODBC databases – SQL + SmartDrives Single database – multiple tables Consolidation of heterogeneous sources

unsupported Intelligent Miner

DB2 and text – SQL without SmartDrives Multiple databases Consolidation of heterogeneous sources

supportedKnowledge Studio

ODBC databases and text Single table Consolidation of heterogeneous sources

unsupported

Support for sselection and election and ppreprocessingreprocessing

DBMiner SQL only

Intelligent Miner SQL + standard and advanced

statistical functionalities

Knowledge Studio descriptive statistics

Support for data mining step

DBMiner Association rules Decision trees Prediction

Intelligent Miner Associations rules Sequential

patterns Clustering Classification Prediction Similar time series

Knowledge StudioDecision treesClusteringPrediction

Support for interpretation and evaluation

Predefined interestingness measuresEmphasis on visualizationLimited export capability of analysis

resultsGain charts for comparison of

predictive models (KS and IM)Limited model combination capabilities

Module outline

Data Mining Query Languages

A DMQL can provide the ability to support ad-hoc and interactive data mining

Hope: achieve the same effect that SQL had on relational databases.

Various proposals: DMQL (Han et al 96) mine operator (Meo et el 96) M-SQL (Imielinski et al 99) query flocks (Tsur et al 98)

MINE operator of (Meo et al 96)

References - DMQL

J. Han, Y. Fu, W. Wang, K. Koperski, and O. R. Zaiane. DMQL: A Data Mining Query Language for Relational Databases. In Proc. 1996 SIGMOD'96 Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'96), pp. 27-33, Montreal, Canada, June 1996.

R. Meo, G. Psaila, S. Ceri. A New SQL-like Operator for Mining Association Rules. In Proc. VLDB96, 1996 Int. Conf. Very Large Data Bases, Bombay, India, pp. 122-133, Sept. 1996.

T. Imielinski and A. Virmani. MSQL: a query language for database mining. Data Mining and Knowledge Discovery, 3:373-408, 1999.

S. Tsur, J. Ulman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov. Query flocks: a generalization of association rule mining. In Proc. 1998 ACM-SIGMOD, p. 1-12, 1998.

Module outline

DATASIFT - towards a logic-based KDSE

DATASIFT is LDL++ (Logic Data Language, MCC & UCLA) extended with mining primitives (decision trees & association rules)

LDL++ syntax: Prolog-like deductive rules

LDL++ semantics: SQL extended with recursion (and more)

Integration of deduction and inductionEmployed to systematically develop the

methodology for MBA and audit planningSee Pisa KDD Lab references

Our position

A suitable integration of deductive reasoning (logic database

languages) inductive reasoning (association rules &

decision trees)

provides a viable solution to high-level problems in knowledge-intensive data analysis applications

Our goal

Demonstrate how we support design and control of the overall KDD process and the incorporation of background knowledge data preparation knowledge extraction post-processing and knowledge

evaluation business rules autofocus datamining

With respect to other DMQL’s

extending logic query languages yields extra expressiveness, needed to bridge the gap between data mining (e.g., association rule mining) vertical applications (e.g., market basket

analysis)

Architecture - client agent

User interface

Access to business rules and visualization of results through

web browser to control interaction MS Excel objects (sheets and charts) to

represent output of analysis (association rules)

Architecture - server agent

A query engine (mediator) record previous analyses Metadata/meta knowledge interaction with other components

LDL++ server extended with external calls to DBMSs and to …

Inductive modules Apriori classifiers (decision trees)

Coupling with DBMS using the Cache-mine approach Performance comparable with SQL-based

approaches on same mining queries (Giannotti at el 2000)

Deductive rules in LDL++

E.g.: select transactions involving milkmilk_basket(T,I) basket(T,I),basket(T,milk).

Querying ?- milk_basket(T,I)milk_basket(2,bread). milk_basket(3,bread).milk_ basket(2,milk). milk_basket(3,orange).milk_ basket(2,onions). milk_basket(3,milk). milk_ basket(2,fish).

A small database of cash register transactions

basket(1,fish). basket(2,bread). basket(3,bread).basket(1,bread). basket(2,milk). basket(3,orange).

basket(2,onions). basket(3,milk).basket(2,fish).

Aggregates in LDL++

E.g.: count occurrences of pairs of distinct items in all transactionspair(I1,I2,count<T>) basket(T,I1),basket(T,I2),I1 I2.

A small database of cash register transactions

aggregate

Querying ?- pair(fish,bread,N)pair(fish,bread,2) (i.e., N=2)

Aggregates are the logical interface between deductive and inductive environment.

Association rules in LDL++

E.g., compute one-to-one association rules with at least 40% supportrules(patterns<0.4,0,{I1,I2}>)basket(T,I1),basket(T,I2).

patterns is the aggregate interfacing the computation of association rules

patterns<min_supp, min_conf, trans_set>

Association rules in LDL++

Result of the query ?- rules(X,Y,S,C)rules({milk},{bread},0.66,1)

i.e. milk bread [0.66,1]

rules({bread},{milk},0.66,0.66)

rules({fish},{bread},0.66,1)

rules({bread},{fish},0.66,0.66)

Same status for data and induced rules

Reasoning on item hierarchies

P ro d u ct (ite m )

F a m ily

S e c to r

D e pa rtm e nt Which rules survive/decay up/down the item hierarchy?

rules_at_level(I,pattern<S,C,Itemset>) itemset_abstraction(I,Tid,Itemset).

preserved_rules(Left,Right)

rules_at_level(I,Left,Right,_,_),rules_at_level(I+1,Left,Right,_,_).

Business rules: reasoning on promotions

Which rules are established by a promotion?

interval(before, -, 3/7/1998).interval(promotion, 3/8/1998, 3/30/1998).

interval(after, 3/31/1998, +).

established_rules(Left, Right) not rules_partition(before, Left, Right, _, _),

rules_partition(promotion, Left, Right, _, _),

rules_partition(after, Left, Right, _, _).

Business rules: temporal reasoning

How does rule support change along time?

Support Pasta => Fresh Cheese 14

Bread Subsidiaries => Fresh Cheese 28

Biscuits => Fresh Cheese 14

Fresh Fruit => Fresh Cheese 14

Frozen Food => Fresh Cheese 14

Decision tree construction in DATASIFT

construct training and test set using rulestraining_set(P,Case_list) ...test_tuple(ID,F1,...,F20,Rec,Act_rec,CAR)

...construct classifier using external call to C5.0tree_rules(Tree_name,P,PF,MC,BO,Rule_list) training_set(P,Case_list),tree_induction(Case_list,PF,MC,BO,Rule_list).

parameters pruning factor PF misclassification costs MC boosting BO external call

induced classifier

Putting decision trees at work

prediction of target variableprediction(Tree_name,ID,CAR,Predicted_CAR)

tree_rules(Tree_name, _ ,_ , _ , Rule_list),test_subject(ID, F1, …, F20, _, _, CAR),classify(Rule_list ,[F1, …, F20], Predicted_CAR).

Model evaluation: actual recovery of a classifier (=sum recovery of tuples classified as positive)

actual_recovery(Tree_name,sum<Actual_Recovery>) prediction(Tree_name, ID, _ , pos),test_subject(ID, F1, …, F20, _,Actual_Recovery, _).

aggregate

Combining decision trees

Model conjunction: tree_conjunction(T1,T2,ID,CAR,pos)

prediction(T1, ID, CAR, pos),prediction(T2, ID, CAR, pos).

tree_conjunction (T1, T2, ID, CAR, neg) test_subject(ID, F1, …, F20, _, _, CAR),~ tree_conjunction(T1, T2, ID, CAR, pos).

More interesting combinations readily expressible: e.g. meta learning (Chan and Stolfo 93)

We proposed ...

a KDD methodology for audit planning: define an audit cost model monitor training- and test-set construction assess the quality of a classifier tune classifier construction to specific policies

and its formalization in a prototype logic-based KDSE, supporting: integration of deduction and induction integration of domain and induced knowledge separation of conceptual and implementation

Module outline

Konstanz, 27-28.3.2000 EDBT2000 tutorial - KDSE

1. Integration with data warehouse and relational DB

2. Scalable, parallel/distributed and incremental mining

3. Data mining query language optimization

4. Multiple, integrated data mining methods

5. KDSE and methodological support for vertical appl.

6. Interactive, exploratory data mining environments

7. Mining on other forms of data: spatio-temporal databases text multimedia web

A data mining research agenda

Scale up!

Scaling up existing algorithms (AI, ML, IR) Association rules Correlation rules Causal relationship Classification Clustering Bayesian networks

Background knowledge & constraints

Incorporating background knowledge and constraints into existing data mining techniques

Double benefit for DMQL: semantics and optimization! traditional algorithms

Disproportionate computational cost for selective users

Overwhelming volume of potentially useless results need user-controlled focus in mining process

Association rules containing certain itemsSequential patterns containing certain patternsClassification?

Vertical applications of data mining

More success stories needed! Current data mining systems lack a

thick semantic layer (similarly to the early relational database systems)

Verticalized data mining systems, e.g. Market analysis systems Fraud detection systems

Automated mining and interactive mining: how far are they?

Autofocus data mining

policy options, business rules

selection of data mining functionfine parameter tuning of mining

function

DBMS coupling

Tight-coupling with DBMS

Most data mining algorithms are based on flat file data (i.e. loose-coupling with DBMS)

A set of standard data mining operators (e.g. sampling operator)

Web mining – why?

No standards on the web, enormous blob of unstructured and heterogeneous info

Very dynamic One new WWW server every 2 hours

5 million documents in 1995

320 million documents in 1998

Indices get obsolete very quicklyBetter means needed for discovering

resources and extracting knowledge

Web mining: challenges

Today`s search engines are plagued by problems– the abundance problem:

99% of info of no interest to 99% of people!

– limited coverage of the Web– limited query interface based on

keyword-oriented search– limited customization to individual users

Web mining Web content mining

mining what Web search engines find Web document classification (Chakrabarti et al 99) warehousing a Meta-Web (Zaïane and Han 98) intelligent query answering in Web search

Web usage mining Web log mining: find access patterns and trends

(Zaiane et al 98) customized user tracking and adaptive sites

(Perkowitz et al 97)

Web structure mining discover authoritative pages: a page is important if

important pages point to it (Chakrabarti et al 99, Kleinberg 98)

Warehousing a Meta-Web (Zaïane & Han 98)

Meta-Web: summarizes the contents and structure of the Web, which evolves with the Web

Layer0: the Web itself

Layer1: the lowest layer of the Meta-Web an entry: a Web page summary, including class,

time, URL, contents, keywords, popularity, weight, links, etc.

Layer2 and up: summary/classification/clustering Meta-Web is warehoused and incrementally

updated Querying and mining is performed on or assisted

by meta-Web Is it feasible/sustainable? Is XML of any help?

Meta-Web from Jiawei Han’s panel talk @ SIGMOD99

Generalized Descriptions

More Generalized Descriptions

Layer0

Layer1

Layern

Weblog mining

Web servers register a log entry for every single access.

A huge number of accesses (hits) are registered and collected in an ever-growing web log.

Why warehousing/mining web logs?

Enhance server performance by learning access patterns of general or particular users (guess what user will ask next and pre-cache!)

Improve system design of web applications Identify potential prime advertisement locations

Greatest peril: the privacy pitfall See e.g. (Markoff 99) the rise of the Little Brother.

Some web mining references M. Perkowitz and O. Etzioni. Adaptive sites: Automatically learning from user access

patterns. In Proc. 6th Int. World Wide Web Conf., Santa Clara, California, April 1997. J. Pitkow. In search of reliable usage data on the www. In Proc. 6th Int. World Wide Web

Conf., Santa Clara, California, April 1997. T. Sullivan. Reading reader reaction : A proposal for inferential analysis of web server

log files. In Proc. 3rd Conf. Human Factors & the Web, Denver, Colorado, June 1997. O. R. Zaiane, M. Xin, and J. Han. Discovering Web access patterns and trends by

applying OLAP and data mining technology on Web logs. In Proc. Advances in Digital Libraries Conf. (ADL'98), pages 19-29, Santa Barbara, CA, April 1998.

O. R. Zaiane, and J. Han. Resource and knowledge discovery in global information systems: a preliminary design and experiment. In Proc. KDD’95, p.331-336, 1995.

O. R. Zaiane, and J. Han. WebML: querying the world-wide web for resources and knowledge. In Proc. Int. Workshop on Web informtion and Data management (WIDM98), p. 9-12, 1998.

S. Chakrabarti, B. E. Dom, S. R. Kumar, P. Raghavan, et al. Mining the web’s link structure. COMPUTER, 32:60-67, 1999.

S. Chakrabarti, B. E. Dom, P. Indik. Enhanced hypertext classification using hyperlinks. In Proc. 1998 ACM-SIGMOD, p. 307-318, 1999.

J. Kleinberg. Autohoritative sources in a hyperlinked environment. In Proc. ACM-SIAM Symp. on Discrete Algorithms, 1998.

J. Markoff. The Rise of Little Brother. Upside, Apr. 1999; http://www.upside.com/texis/mvm/story?id=36d4613c0

Pisa KDD Lab references

F. Giannotti and G. Manco. Making Knowledge Extraction and Reasoning Closer. In Proc. PAKDD'99, The Fourth Pacific-Asia Conference on Knowledge Discovery and Data Mining, Kyoto, 2000.

F. Giannotti and G. Manco. Querying Inductive Databases via Logic-Based User Defined Aggregates. In Proc. PKDD'99, The Third Europ. Conf. on Principles and Practice of Knowledge Discovery in Databases. Prague, Sept. 1999.

F. Bonchi, F. Giannotti, G. Mainetto, D. Pedreschi. Using Data Mining Techniques in Fiscal Fraud Detection. In Proc. DaWak'99, First Int. Conf. on Data Warehousing and Knowledge Discovery. Florence, Italy, Sept. 1999.

F. Bonchi , F. Giannotti, G. Mainetto, D. Pedreschi. A Classification-based Methodology for Planning Audit Strategies in Fraud Detection. In Proc. KDD-99, ACM-SIGKDD Int. Conf. on Knowledge Discovery & Data Mining, San Diego (CA), August 1999.

F. Giannotti, G. Manco, D. Pedreschi and F. Turini. Experiences with a logic-based knowledge discovery support environment. In Proc. 1999 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (SIGMOD'99 DMKD). Philadelphia, May 1999.

F. Giannotti, M. Nanni, G. Manco, D. Pedreschi and F. Turini. Integration of Deduction and Induction for Mining Supermarket Sales Data. In Proc. PADD'99, Practical Application of Data Discovery, Int. Conference, London, April 1999.

F. Giannotti, G. Manco, M. Nanni, D. Pedreschi. Nondeterministic, Nonmonotonic Logic Databases. IEEE Trans. on Knowledge and Data Engineering. 2000.

F. Giannotti, M. Nanni, G. Manco, D. Pedreschi and F. Turini. Using deduction for intelligent data analysis. Submitted, 2000. http://www-kdd.di.unipi.it/

P. Becuzzi, M. Coppola, S. Ruggieri and M. Vanneschi. Parallelisation of C4.5 as a particular divide and conquer computation. Proc.3rd Workshop on High Performance Data Mining, Springer-Verlag LNCS, 2000.

5DM_kdse.ppt

Documents