1 Data Mining Systems and Languages CS240A Notes.

1

Data Mining Systemsand Languages

CS240A Notes

2

Knowledge Discovery (KDD) Process

Data mining—core of knowledge discovery process

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

3

DM Experience for DBMS: Dreams vs. Reality

Decision Support and business intelligence: OLAP & data warehouses: resounding success for DBMS

vendors, via Simple extensions of SQL (aggregates & analytics)

relational DBMS extensions for DM queries: a flop OR-DBMS do not fare much better [Sarawagi’ 98].

Imielinski & Mannila proposed a ‘high-road’ approach [CACM’96] was suggested by who called for a quantum leap in functionality based on:

Simple declarative extensions of SQL for Data Mining (DM)

Efficiency through DM query optimization techniques (yet to be invented) The research area of Inductive DBMS was thus born,

producing Interesting language work: DMQL, Mine Rule, MSQL, …

Where implementation technology lacks generality & performance limitations Real questions if optimizers will ever take us there.

4

DBMS Limitations

DBMSs were easily and very Successfully extended for Data Warehouses with help of OLAP functions

Extending DBMSs for Mining has proven much harder Limited expressive power Flexibility of the languages Apriori in DB2 [Saravagi’ 98]

Because of lack of suitable primitives task proved extremely difficult and not as efficient as the cache-mining task

Cache-mining: move data from the database to cache and then use PL algorithms to mine the cache.

5

Mining Systems Desiderata

Problem: How to efficiently support the vast variety of online mining algorithms in an integrated framework? Generality over a wide spectrum of mining tasks Ease of use for naïve users and flexibility and

customizability for experts Efficiency, scalability

Databases: where the data is. But DBMS do not support well the KDD tasks. Three approaches1. Inductive DBMS 2.Commercial DBMS extensions3. Dedicated KDD systems with DBMS connections.

6

Inductive DBMSs vs. Vendor Extensions

Imielinski & Manilla introduced the notion of A high-level Data Mining Query Language for DBMS Optimization techniques for

Inductive DBMS a new research field MSQL, DMQL, Mine Rule: DM query language Performance and generality an open problem.

DBMS Vendors Ad-hoc approaches based on mining libraries

7

DBMS extensions: DB2 Intelligent Miner

Model creation Training

CALL IDMMX.DM_buildClasModelCmd('IDMMX.CLASTASKS', 'TASK', 'ID', 'HeartClasTask', 'IDMMX.CLASSIFMODELS', 'MODEL', 'MODELNAME', 'HeartClasModel' );

Prediction Stored procedures and virtual mining views Most of the implementation outside the

DBMS (Cache Mining) Data transfer delays

http://www-306.ibm.com/software/data/iminer/

21-Mar-08 8

Oracle Data Miner

Algorithms Adaptive Naïve Bayes SVM regression K-means clustering Association rules, text, mining, etc., etc.

PL/SQL with extensions for mining Models as first class objects

Create_Model, Prediction, Prediction_Cost, Prediction_Details, etc.

http://www.oracle.com/technology/products/bi/odm/index.html

9

OLE DB for DM by Microsoft

Model creation. Descriptive phasePrediction joinsOther features

Nested cases http://research.microsoft.com/dmx/DataMining/

PMML a descriptive XML language for exchanging information between systems

http://research.microsoft.com/dmx/DataMining/

21-Mar-08 10

OLE DB for DM (DMX) (cont.)

Mining objects as first class objects Schema rowsets

Mining_ModelsMining_Model_ContentMining_Functions

Other features Column value distribution Nested cases

http://research.microsoft.com/dmx/DataMining/

21-Mar-08 11

OLE DB for DM (DMX): 3 steps

Model creationCreate mining model MemCard_Pred ( CustomerId long key, Age long continuous, Profession text discrete, Income long continuous, Risk text discrete predict)Using Microsoft_Decision_Tree;

Training Insert into MemCard_Pred OpenRowSet(

“‘sqloledb’, ‘sa’, ‘mypass’”, ‘SELECT CustomerId, Age,

Profession, Income, Risk from Customers’) Prediction Join

Select C.Id, C.Risk, PredictProbability(MemCard_Pred.Risk)From MemCard_Pred AS MP Prediction Join Customers AS CWhere MP.Profession = C.Profession and AP.Income = C.Income AND MP.Age = C.Age;

12

Defining a Mining Model:

E.g., a model to predict students’ plan to attend college

The format of “training cases” (top-level entity) Attributes, Input/output type, distribution Algorithms and parameters

Example

CREATE MINING MODEL CollegePlanModel

( StudentID LONG KEY,Gender TEXT DISCRETE,ParentIncome LONG NORMAL CONTINUOUS,Encouragement TEXT DISCRETE, CollegePlans TEXT DISCRETE PREDICT

) USING Microsoft_Decision_Trees

21-Mar-08 13

INSERT INTO CollegePlanModel(StudentID, Gender, ParentIncome,

Encouragement, CollegePlans)

OPENROWSET(‘<provider>’, ‘<connection>’,‘SELECT StudentID,

Gender, ParentIncome,Encouragement,CollegePlans

FROM CollegePlansTrainData’)

Training

21-Mar-08 14

SELECT t.ID, CPModel.PlanFROM CPModel PREDICTION JOIN OPENQUERY(…,‘SELECT * FROM

NewStudents’) AS tON CPModel.Gender = t.Gender AND CPModel.IQ = t.IQ

ID Gender IQID Gender IQ PlanCPModel NewStudents

Prediction Join

21-Mar-08 15

Summary of Vendors’ Approaches

Built-in library of mining methods Script language or GUI tools

Limitations Closed systems (internals hidden from users) Adding new algorithms or customizing old

ones -- Difficult Poor integration with SQL Limited interoperability across DBMSs

Predictive Markup Modeling Language (PMML) as a palliative

21-Mar-08 16

PMML

Predictive Markup Model Language XML based language for vendor independent

definition of statistical and data mining models

Share models among PMML compliant products

A descriptive language

Supported by all major vendors

21-Mar-08 17

PMML Example

Much Competion

Platforms IBM Oracle SAS,

Tools SPSS Angoss KXEN Megaputer FairIsaac Insightful

SAS Institute (Enterprise Miner) IBM (DB2 Intelligent Miner for

Data) Oracle (ODM option to Oracle

10g) SPSS (Clementine) Unica Technologies, Inc.

(Pattern Recognition Workbench)

Insightsful (Insightful Miner) KXEN (Analytic Framework) Prudsys (Discoverer and its

family) Microsoft (SQL Server 2005) Angoss (KnowledgeServer and

its family) DBMiner (DB2)

Vendors

19

Stand Alone Systems

WEKA is open-source java code created by researchers at the University of Waikato in New Zealand.

It provides many different machine learning algorithms

Applicable to generic data described in Attribute-Relation File Format (ARFF)

21-Mar-08 20

A comprehensive set of DM algorithms, and tools.

Generic algorithms over arbitrary data sets. Independent on the number of columns in tables.

Open and extensible system based on Java.

* Also free …

Weka

Date post:	18-Jan-2016
Category:	Documents
Upload:	marshall-perkins
View:	219 times
Download:	0 times

1 Data Mining Systems and Languages CS240A Notes.

Documents