Date post: | 18-Jan-2016 |
Category: |
Documents |
Upload: | marshall-perkins |
View: | 219 times |
Download: | 0 times |
1
Data Mining Systemsand Languages
CS240A Notes
2
Knowledge Discovery (KDD) Process
Data mining—core of knowledge discovery process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
3
DM Experience for DBMS: Dreams vs. Reality
Decision Support and business intelligence: OLAP & data warehouses: resounding success for DBMS
vendors, via Simple extensions of SQL (aggregates & analytics)
relational DBMS extensions for DM queries: a flop OR-DBMS do not fare much better [Sarawagi’ 98].
Imielinski & Mannila proposed a ‘high-road’ approach [CACM’96] was suggested by who called for a quantum leap in functionality based on:
Simple declarative extensions of SQL for Data Mining (DM)
Efficiency through DM query optimization techniques (yet to be invented) The research area of Inductive DBMS was thus born,
producing Interesting language work: DMQL, Mine Rule, MSQL, …
Where implementation technology lacks generality & performance limitations Real questions if optimizers will ever take us there.
4
DBMS Limitations
DBMSs were easily and very Successfully extended for Data Warehouses with help of OLAP functions
Extending DBMSs for Mining has proven much harder Limited expressive power Flexibility of the languages Apriori in DB2 [Saravagi’ 98]
Because of lack of suitable primitives task proved extremely difficult and not as efficient as the cache-mining task
Cache-mining: move data from the database to cache and then use PL algorithms to mine the cache.
5
Mining Systems Desiderata
Problem: How to efficiently support the vast variety of online mining algorithms in an integrated framework? Generality over a wide spectrum of mining tasks Ease of use for naïve users and flexibility and
customizability for experts Efficiency, scalability
Databases: where the data is. But DBMS do not support well the KDD tasks. Three approaches1. Inductive DBMS 2.Commercial DBMS extensions3. Dedicated KDD systems with DBMS connections.
6
Inductive DBMSs vs. Vendor Extensions
Imielinski & Manilla introduced the notion of A high-level Data Mining Query Language for DBMS Optimization techniques for
Inductive DBMS a new research field MSQL, DMQL, Mine Rule: DM query language Performance and generality an open problem.
DBMS Vendors Ad-hoc approaches based on mining libraries
7
DBMS extensions: DB2 Intelligent Miner
Model creation Training
CALL IDMMX.DM_buildClasModelCmd('IDMMX.CLASTASKS', 'TASK', 'ID', 'HeartClasTask', 'IDMMX.CLASSIFMODELS', 'MODEL', 'MODELNAME', 'HeartClasModel' );
Prediction Stored procedures and virtual mining views Most of the implementation outside the
DBMS (Cache Mining) Data transfer delays
http://www-306.ibm.com/software/data/iminer/
21-Mar-08 8
Oracle Data Miner
Algorithms Adaptive Naïve Bayes SVM regression K-means clustering Association rules, text, mining, etc., etc.
PL/SQL with extensions for mining Models as first class objects
Create_Model, Prediction, Prediction_Cost, Prediction_Details, etc.
http://www.oracle.com/technology/products/bi/odm/index.html
9
OLE DB for DM by Microsoft
Model creation. Descriptive phasePrediction joinsOther features
Nested cases http://research.microsoft.com/dmx/DataMining/
PMML a descriptive XML language for exchanging information between systems
21-Mar-08 10
OLE DB for DM (DMX) (cont.)
Mining objects as first class objects Schema rowsets
Mining_ModelsMining_Model_ContentMining_Functions
Other features Column value distribution Nested cases
http://research.microsoft.com/dmx/DataMining/
21-Mar-08 11
OLE DB for DM (DMX): 3 steps
Model creationCreate mining model MemCard_Pred ( CustomerId long key, Age long continuous, Profession text discrete, Income long continuous, Risk text discrete predict)Using Microsoft_Decision_Tree;
Training Insert into MemCard_Pred OpenRowSet(
“‘sqloledb’, ‘sa’, ‘mypass’”, ‘SELECT CustomerId, Age,
Profession, Income, Risk from Customers’) Prediction Join
Select C.Id, C.Risk, PredictProbability(MemCard_Pred.Risk)From MemCard_Pred AS MP Prediction Join Customers AS CWhere MP.Profession = C.Profession and AP.Income = C.Income AND MP.Age = C.Age;
12
Defining a Mining Model:
E.g., a model to predict students’ plan to attend college
The format of “training cases” (top-level entity) Attributes, Input/output type, distribution Algorithms and parameters
Example
CREATE MINING MODEL CollegePlanModel
( StudentID LONG KEY,Gender TEXT DISCRETE,ParentIncome LONG NORMAL CONTINUOUS,Encouragement TEXT DISCRETE, CollegePlans TEXT DISCRETE PREDICT
) USING Microsoft_Decision_Trees
21-Mar-08 13
INSERT INTO CollegePlanModel(StudentID, Gender, ParentIncome,
Encouragement, CollegePlans)
OPENROWSET(‘<provider>’, ‘<connection>’,‘SELECT StudentID,
Gender, ParentIncome,Encouragement,CollegePlans
FROM CollegePlansTrainData’)
Training
21-Mar-08 14
SELECT t.ID, CPModel.PlanFROM CPModel PREDICTION JOIN OPENQUERY(…,‘SELECT * FROM
NewStudents’) AS tON CPModel.Gender = t.Gender AND CPModel.IQ = t.IQ
ID Gender IQID Gender IQ PlanCPModel NewStudents
Prediction Join
21-Mar-08 15
Summary of Vendors’ Approaches
Built-in library of mining methods Script language or GUI tools
Limitations Closed systems (internals hidden from users) Adding new algorithms or customizing old
ones -- Difficult Poor integration with SQL Limited interoperability across DBMSs
Predictive Markup Modeling Language (PMML) as a palliative
21-Mar-08 16
PMML
Predictive Markup Model Language XML based language for vendor independent
definition of statistical and data mining models
Share models among PMML compliant products
A descriptive language
Supported by all major vendors
21-Mar-08 17
PMML Example
Much Competion
Platforms IBM Oracle SAS,
Tools SPSS Angoss KXEN Megaputer FairIsaac Insightful
SAS Institute (Enterprise Miner) IBM (DB2 Intelligent Miner for
Data) Oracle (ODM option to Oracle
10g) SPSS (Clementine) Unica Technologies, Inc.
(Pattern Recognition Workbench)
Insightsful (Insightful Miner) KXEN (Analytic Framework) Prudsys (Discoverer and its
family) Microsoft (SQL Server 2005) Angoss (KnowledgeServer and
its family) DBMiner (DB2)
Vendors
19
Stand Alone Systems
WEKA is open-source java code created by researchers at the University of Waikato in New Zealand.
It provides many different machine learning algorithms
Applicable to generic data described in Attribute-Relation File Format (ARFF)
21-Mar-08 20
A comprehensive set of DM algorithms, and tools.
Generic algorithms over arbitrary data sets. Independent on the number of columns in tables.
Open and extensible system based on Java.
* Also free …
Weka