Download - Java Datamining Spec.

8/20/2019 Java Datamining Spec.

1/156

Maintenance Release JavaTM Data Mining (JDM) Version 1.1

June 22, 2005 1

JavaTM Specification Request 73:

JavaTM Data Mining (JDM)

JSR-73 Expert Group

Specification Lead:Mark Hornick, Oracle Corporation

[email protected]

Technical comments:

[email protected]

Version 1.1

Maintenance Release Specification

June 22, 2005


2/156


June 22, 2005 2

Copyright

Copyright (c) 2005 Oracle Corporation. All rights reserved.

Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.

This product or document is protected by copyright and distributed under licenses restrict-

ing its use, copying, distribution, and decompilation. No part of this product or documen-

tation may be reproduced in any form by any means without prior written authorization of

the copyright holders, or any of the licensors, if any. Any unauthorized use may be a viola-

tion of domestic or international law. RESTRICTED RIGHTS LEGEND: Use, duplica-

tion, or disclosure by the U.S. Government and its agents is subject to the restrictions of

FAR 52.227-14(g)(2)(6/87) and FAR 52.227-19(6/87), or DFAR 252.227-7015(b)(6/95)

and DFAR 227.7202-3(a).

Disclaimer

This document and its contents are furnished “as is” for informational purposes only, andare subject to change without notice. Oracle Corporation (Oracle) does not represent or

warrant that any product or business plans expressed or implied will be fulfilled in any

way. Any actions taken by the user of this document in response to the document or its

contents shall be solely at the risk of the user.

ORACLE MAKES NO WARRANTIES, EXPRESSED OR IMPLIED, WITH RESPECT

TO THIS DOCUMENT OR ITS CONTENTS, AND HEREBY EXPRESSLY DIS-

CLAIMS ANY AND ALL IMPLIED WARRANTIES OF MERCHANTABILITY, FIT-

NESS FOR A PARTICULAR USE OR NON-INFRINGEMENT. IN NO EVENT SHALL

ORACLE BE HELD LIABLE FOR ANY DIRECT, INDIRECT, SPECIAL OR CONSE-

QUENTIAL DAMAGES IN CONNECTION WITH OR ARISING FROM THE USER

OF ANY PORTION OF THE INFORMATION.

Trademarks

Sun, Sun Microsystems, Java, JavaBeans, and Enterprise JavaBeans are trademarks, reg-

istered trademarks, or servicemarks of Sun Microsystems, Inc. in the U.S. and other coun-

tries.

OMG, Object Management Group, CORBA, Unified Modeling Language, UML, are reg-

istered trademarks or trademarks of the Object Management Group, Inc.

All other product or company names mentioned are for identification purposes only, and

may be trademarks of their respective owners.


3/156


June 22, 2005 1

1. Overview..................................................................................................................1

1.1 Introduction..........................................................................................................................1

1.1.1 Benefits..................................................................................................................1

1.1.2 Target audience......................................................................................................2

1.1.3 Data analytics JSRs ...............................................................................................2

1.1.4 Exclusions .............................................................................................................2

1.2 Architectural components ....................................................................................................31.3 Dependencies and relationships...........................................................................................4

1.4 Organization.........................................................................................................................4

1.5 Expert group members.........................................................................................................5

1.6 Acknowledgements..............................................................................................................5

2. Use cases..................................................................................................................6

2.1 Application use cases...........................................................................................................6

2.1.1 Mining GUI ...........................................................................................................6

2.1.2 Web specialty retailer ............................................................................................7

2.1.3 Campaign management .........................................................................................7

2.1.4 Minimal top level specification.............................................................................7

2.1.5 Selecting the “best” model ....................................................................................82.1.6 Comparing vendor implementations .....................................................................8

2.1.7 Incremental learning..............................................................................................8

2.1.8 Deferred task execution.........................................................................................9

2.1.9 Explaining model behavior....................................................................................9

2.1.10 Manually enhancing a model.................................................................................9

2.1.11 OLAP schema refinement ...................................................................................10

2.1.12 Web services........................................................................................................10

2.2 Vendor use cases ................................................................................................................11

2.2.1 Broad support of JDM.........................................................................................11

2.2.2 Narrow support of JDM.......................................................................................12

3. Concepts.................................................................................................................13

3.1 Data mining functions........................................................................................................133.1.1 Classification.......................................................................................................13

3.1.2 Regression ...........................................................................................................13

3.1.3 Attribute Importance ...........................................................................................14

3.1.4 Clustering ............................................................................................................14

3.1.5 Association ..........................................................................................................14

3.2 Data mining tasks...............................................................................................................15

3.2.1 Building a model .................................................................................................15

3.2.2 Testing a model....................................................................................................16

3.2.3 Applying a model ................................................................................................17

3.2.4 Object import and export.....................................................................................18

3.2.5 Computing statistics on data................................................................................19

3.2.6 Verifying task correctness....................................................................................19

3.3 Principal objects.................................................................................................................20

3.3.1 Connection...........................................................................................................20

3.3.2 Task......................................................................................................................20

3.3.3 Execution handle and status ................................................................................20

3.3.4 Physical data set ..................................................................................................21

3.3.5 Physical data record.............................................................................................21

3.3.6 Build settings.......................................................................................................21

3.3.7 Algorithm ............................................................................................................22

3.3.8 Algorithm settings ...............................................................................................22


4/156


June 22, 2005 2

3.3.9 Model...................................................................................................................22

3.3.10 Model signature...................................................................................................22

3.3.11 Model detail.........................................................................................................23

3.3.12 Logical attribute...................................................................................................23

3.3.13 Logical data .........................................................................................................23

3.3.14 Attribute statistics set ..........................................................................................23

3.3.15 Apply settings......................................................................................................24

3.3.16 Confusion matrix.................................................................................................24

3.3.17 Lift.......................................................................................................................24

3.3.18 Cost matrix ..........................................................................................................25

3.3.19 Prior probabilities ................................................................................................25

3.3.20 Category sets .......................................................................................................26

3.3.21 Taxonomy............................................................................................................26

3.3.22 Rules....................................................................................................................27

3.3.23 Verification report................................................................................................27

3.4 Physical data representations.............................................................................................27

3.4.1 Individual record .................................................................................................27

3.4.2 Single record case table.......................................................................................28

3.4.3 Multi-record case table ........................................................................................28

3.4.4 Data preparation ..................................................................................................293.5 Attribute mapping ..............................................................................................................29

3.5.1 Direct mapping ....................................................................................................29

3.5.2 Pivot mapping......................................................................................................30

3.6 Creating physical data objects ...........................................................................................30

3.7 Persistence .........................................................................................................................30

3.8 Object references ...............................................................................................................31

3.9 Reflection / introspection...................................................................................................32

4. Packages.................................................................................................................34

4.1 Design overview ................................................................................................................34

4.2 Notation .............................................................................................................................34

4.3 Package structure...............................................................................................................36

4.4 Package javax.datamining .................................................................................................38

4.5 Package javax.datamining.base .........................................................................................40

4.6 Package javax.datamining.resource ...................................................................................43

4.7 Package javax.datamining.data..........................................................................................44

4.8 Package javax.datamining.task..........................................................................................47

4.8.1 Package task.apply ..............................................................................................49

4.9 Package javax.datamining.supervised ...............................................................................50

4.9.1 Package supervised.classification........................................................................51

4.9.2 Package supervised.regression ............................................................................54

4.9.3 Package attributeimportance ...............................................................................55

4.10 Package javax.datamining.association...............................................................................564.11 Package javax.datamining.clustering.................................................................................58

4.12 Package javax.datamining.rule ..........................................................................................61

4.13 Package javax.datamining.statistics...................................................................................62

4.14 Package javax.datamining.algorithm.................................................................................63

4.14.1 Package algorithm.tree ........................................................................................63

4.14.2 Package algorithm.naivebayes ............................................................................64

4.14.3 Package algorithm.feedforwardneuralnet............................................................65

4.14.4 Package algorithm.svm........................................................................................66


5/156


June 22, 2005 3

4.14.5 Package algorithm.kmeans ..................................................................................67

4.15 Package javax.datamining.modeldetail..............................................................................68

4.15.1 Package modeldetail.tree.....................................................................................68

4.15.2 Package modeldetail.feedforwardneuralnet ........................................................69

4.15.3 Package modeldetail.naivebayes .........................................................................69

4.15.4 Package modeldetail.svm ....................................................................................70

5. Code examples .......................................................................................................715.1 Building a clustering model...............................................................................................71

5.2 Applying a clustering model to data..................................................................................73

5.3 Applying a clustering model to a record............................................................................74

5.4 Building a classification model..........................................................................................75

5.5 Testing a classification model............................................................................................76

5.6 Building and extracting rules from a tree model ...............................................................77

5.7 Extracting rules from an association model.......................................................................79

5.7.1 Get rules with minimum support.........................................................................79

5.7.2 Get rules with minimum support and confidence................................................79

5.7.3 Get rules containing certain items .......................................................................80

5.7.4 Get rules that do not contain certain items..........................................................815.8 Importing and exporting a model.......................................................................................81

5.8.1 Import an object using a URI ..............................................................................82

5.8.2 Export a model ....................................................................................................83

5.8.3 Export an object to a destination .........................................................................83

5.9 Using reflection..................................................................................................................84

5.10 Establishing a connection ..................................................................................................85

5.11 Uniform resource identifiers ..............................................................................................85

6. Conformance statement .........................................................................................87

6.1 Required and optional features ..........................................................................................87

6.2 Vendor extensions ..............................................................................................................88

6.3 Compliance points .............................................................................................................886.4 Determining conformance .................................................................................................89

6.4.1 Function level conformance ................................................................................89

6.4.2 Algorithm level conformance..............................................................................90

6.4.3 Model apply engine conformance .......................................................................91

6.5 Claiming conformance.......................................................................................................91

7. Summary................................................................................................................93

Appendix A.Glossary.........................................................................................................94

Appendix B.Requirements...............................................................................................102

B.1. Domain requirements.......................................................................................................102

B.2. Foundation technologies..................................................................................................103B.3. Data mining standards .....................................................................................................103

B.4. System behavior...............................................................................................................103

B.5. Exclusions for version 1 ..................................................................................................104

B.5.1. Domain exclusions ............................................................................................104

B.5.2. System exclusions .............................................................................................104

Appendix C.Optional Methods ........................................................................................106

Appendix D.Exceptions ...................................................................................................107


6/156


June 22, 2005 4

Appendix E.Web services ................................................................................................110

E.1. Introduction......................................................................................................................110

E.2. Methods ...........................................................................................................................111

E.2.1. WSDL Document Structure ..............................................................................111

E.2.2. Listing DME Contents.......................................................................................112

E.2.3. Introspection / Reflection ..................................................................................114

E.2.4. Saving objects....................................................................................................115E.2.5. Retrieving objects..............................................................................................116

E.2.6. Removing objects ..............................................................................................117

E.2.7. Renaming objects ..............................................................................................118

E.2.8. Retrieving Object Components .........................................................................119

E.2.9. Verify Object .....................................................................................................120

E.2.10. Executing tasks..................................................................................................121

E.2.11. Getting execution status ....................................................................................123

E.2.12. Terminating Tasks..............................................................................................123

E.3. Java methods supporting XML........................................................................................124

E.4. XML Schema Definition .................................................................................................125

E.4.1. JDM Document .................................................................................................125

E.4.2. Task....................................................................................................................125E.4.3. Task.Apply.........................................................................................................128

E.4.4. Data....................................................................................................................129

E.4.5. Supervised .........................................................................................................132

E.4.6. Supervised.Classification ..................................................................................133

E.4.7. Supervised.Regression.......................................................................................135

E.4.8. Clustering ..........................................................................................................136

E.4.9. Association........................................................................................................138

E.4.10. AttributeImportance ..........................................................................................138

E.4.11. Statistics.............................................................................................................139

E.4.12. Algorithm ..........................................................................................................140

E.4.13. Base ...................................................................................................................143

E.4.14. Root ...................................................................................................................145

E.4.15. Enumeration extension......................................................................................146

Appendix F.References ....................................................................................................148


7/156


June 22, 2005 1

TABLE 1. An example of a single record case table ..........................................................................................28

TABLE 2. An example of a multi-record case table...........................................................................................28

TABLE 3. Named and composite object referencing summary..........................................................................31

TABLE 4. Function-level model behavior........................................................................................................102

TABLE 5. JDM optional methods for models and model details .....................................................................106

TABLE 6. JDMException codes and messages ................................................................................................108

TABLE 7. JDM runtime exceptions, codes, and messages...............................................................................109


8/156


June 22, 2005 1

FIGURE 1.1 Architecture configuration options ......................................................................................................3

FIGURE 1.2 Example of attribute mapping for apply............................................................................................29

FIGURE 4.2 Top level package structure ...............................................................................................................37

FIGURE 4.3 Common top level interfaces .............................................................................................................38

FIGURE 4.4 Exception classes...............................................................................................................................38

FIGURE 4.5 Top level enumerations......................................................................................................................39

FIGURE 4.6 Execution Handle...............................................................................................................................39

FIGURE 4.7 Package javax.datamining.base - Named Objects .............................................................................40FIGURE 4.8 Package javax.datamining.base - Build Settings, Model, and Task ..................................................40

FIGURE 4.9 Package javax.datamining.base - BuildSettings ................................................................................41

FIGURE 4.10 Package javax.datamining.base - Model............................................................................................42

FIGURE 4.11 Package javax.datamining.resource...................................................................................................43

FIGURE 4.12 Package javax.datamining.data - PhysicalData .................................................................................44

FIGURE 4.13 Package javax.datamining.data - LogicalData...................................................................................45

FIGURE 4.14 Package javax.datamining.data - ModelSignature.............................................................................45

FIGURE 4.15 Package javax.datamining.data - Taxonomy .....................................................................................46

FIGURE 4.16 Package javax.datamining.data - CategoryMatrix.............................................................................46

FIGURE 4.17 Package javax.datamining.data - CategorySet and Interval ..............................................................46

FIGURE 4.18 Package javax.datamining.task - Build..............................................................................................47

FIGURE 4.19 Package javax.datamining.task - Import and Export .........................................................................48

FIGURE 4.20 Package javax.datamining.task - ComputeStatistics..........................................................................48

FIGURE 4.21 Package task.apply - ApplyTask and ApplySettings .........................................................................49FIGURE 4.22 Package javax.datamining.supervised - Settings and Model.............................................................50

FIGURE 4.23 Package javax.datamining.supervised - TestTask and TestMetrics ...................................................50

FIGURE 4.24 Package supervised.classification - Settings and Model ...................................................................51

FIGURE 4.25 Package supervised.classification - TestTask and TestMetrics..........................................................52

FIGURE 4.26 Package supervised.classification ClassificationTestMetricsTask.....................................................52

FIGURE 4.27 Package supervised.classification - ApplySettings............................................................................53

FIGURE 4.28 Package supervised.classification - Confusion Matrix and Cost Matrix ...........................................53

FIGURE 4.29 Package supervised.regression - Settings and Model ........................................................................54

FIGURE 4.30 Package supervised.regression - TestTask, and ApplySettings .........................................................54

FIGURE 4.31 Package supervised.regression - RegressionTestMetricsTrask..........................................................55

FIGURE 4.32 Package javax.datamining.attributeimportance - Settings and Model...............................................55

FIGURE 4.33 Package javax.datamining.associationrules - Settings and Model ....................................................56

FIGURE 4.34 Package javax.datamining.associationrules - Rule Selection ............................................................57

FIGURE 4.35 Package javax.datamining.clustering - Model...................................................................................58FIGURE 4.36 Package javax.datamining.clustering - Settings ................................................................................59

FIGURE 4.37 Package javax.datamining.clustering - ApplySettings ......................................................................60

FIGURE 4.38 Package javax.datamining.clustering - Similarity Matrix .................................................................60

FIGURE 4.39 Package javax.datamining.rule - Rule and Predicate.........................................................................61

FIGURE 4.40 Package javax.datamining.statistics - AttributeStatistics ..................................................................62

FIGURE 4.41 Package algorithm.tree - TreeSettings...............................................................................................63

FIGURE 4.42 Package algorithm.naivebayes - NaiveBayesSettings .......................................................................64

FIGURE 4.43 Package algorithm.feedforwardneuralnet - FeedForwardNeuralNetSettings....................................65

FIGURE 4.44 Package algorithm.svm.classification - SVMClassificationSettings.................................................66

FIGURE 4.45 Package algorithm.svm.regression - SVMRegressionSettings..........................................................66

FIGURE 4.46 Package algorithm.kmeans - KMeansSettings...................................................................................67

FIGURE 4.47 Package modeldetail.tree - TreeModelDetail ....................................................................................68

FIGURE 4.48 Package modeldetail.feedforwardneuralnet - NeuralNetworkModelDetail ......................................69

FIGURE 4.49 Package modeldetail.naivebayes - NaiveBayesModelDetail.............................................................69FIGURE 4.50 Package modeldetail.svm - SVMModelDetail ..................................................................................70


9/156


June 22, 2005 1

1. Overview

1.1 Introduction

The Java Data Mining (JDM) specification addresses the need for a pure Java API to facil-

itate development of data mining-enabled applications. JDM supports common data min-

ing operations, as well as the creation, persistence, access, and maintenance of metadatasupporting mining activities.

Currently, no existing Java platform specification provides a standard API for data mining

systems. Existing APIs are vendor-proprietary. By using JDM, implementers of data min-

ing applications can expose a single, standard API that will be understood by a wide vari-

ety of developers writing client applications and components running on the Java™ 2

Platform. Similarly, data mining clients can be coded against a single API that is indepen-

dent of the underlying data mining system. JDM is targeted for the Java™ 2 Platform,

Enterprise Edition (J2EE™) and Standard Edition (J2SE™).

In JDM, data mining [Mitchell1997, BL1997] includes the functional areas of classifica-

tion, regression, attribute importance1, clustering, and association. These are supported by

such supervised and unsupervised learning algorithms as decision trees, neural networks,

Naive Bayes, Support Vector Machine, K-Means, and Apriori, on structured data. Com-mon operations include model build, test, and apply (score). A particular implementation

of this specification may not necessarily support all interfaces and services defined by

JDM. However, JDM provides a mechanism for client discovery of supported interfaces

and capabilities.

JDM is based on a generalized, object-oriented, data mining conceptual model leveraging

emerging data mining standards such the Object Management Group’s Common Ware-

house Metadata (CWM), ISO’s SQL/MM for Data Mining, and the Data Mining Group’s

Predictive Model Markup Language (PMML), as appropriate

Implementation details of JDM are delegated to each vendor. A vendor may decide to

implement JDM as a native API of its data mining product. Others may opt to develop a

driver/adapter that mediates between a core JDM layer and multiple vendor products. The

JDM specification does not prescribe a particular implementation strategy, nor does it pre-scribe performance or accuracy of a given capability or algorithm.

To ensure J2EE™ compatibility and eliminate duplication of effort, JDM leverages exist-

ing specifications. In particular, JDM leverages the Java Connection Architecture [JSR16]

to provide communication and resource management between applications and the ser-

vices that implement the JDM API. JDM also reflects aspects the Java Metadata Interface

[JSR40] for the interface specification.

1.1.1 Benefits

The availability of a J2EE™-compliant data mining API provides benefit to both vendors

and users of tools and applications in the areas of business intelligence, business analytics,

data mining systems, data warehousing, and life sciences / bioinformatics.Historically, application developers coded homegrown data mining algorithms into appli-

cations, or used sophisticated end-user GUIs. These GUIs packaged a suite of algorithms

complete with support for data transformation, model building, testing, and scoring. How-

ever, it was difficult, if not impossible, to embed data mining end-to-end in applications

using commercial data mining products due to inadequate APIs. If a vendor had an API, it

was proprietary, making the development of a product using that API risky. If a different

1. Attribute importance is also referred to as feature selection or key fields analysis.


10/156


June 22, 2005 2

vendor’s solution was required, rewriting that product was also potentially costly.

The ability to leverage data mining functionality via a standard API greatly reduces risk

and potential cost. With a standard API, customers can use multiple products for solving

business problems by applying the most appropriate algorithm implementation without

investing resources to learn each vendor’s proprietary API. Moreover, a standard API

makes data mining more accessible to developers while making developer skills more

transferable. Vendors can now differentiate themselves on price, performance, accuracy,and features. Java Data Mining (JDM) addresses this need for Java.

1.1.2 Target audienceThe target audiences for the JDM specification can be categorized into the following

groups:

data mining vendors – companies that intend to implement this API for their respec-

tive products, thereby providing the API to end users

• application developers – Java programmers who wish to use a data mining API for

building GUIs or other applications that benefit from data mining technology

• data mining experts – individuals with advanced degrees in statistics, machine learn-

ing, or data mining; or with significant practical data mining experience

• data mining novices – Java-knowledgeable developers who have a basic understand-

ing of the problems that data mining can solve, who can minimally leverage the func-

tion-level of data mining tasks

1.1.3 Data analytics JSRs

The complement to data mining in data analytics is online analytical processing (OLAP).

To distinguish between OLAP and Data Mining, consider that OLAP follows a deductive

(query-oriented) strategy of analyzing data. Users formulate hypotheses, and execute que-

ries to gain understanding of the underlying data. Data Mining follows an inductive strat-

egy of analyzing data where users apply machine learning algorithms to extract non-

obvious knowledge from the data.

JOLAP, (JSR-69) specifies a Java API for OLAP, and shares a common basis in the OMG

CWM meta-model. The JDM expert group is working with the JOLAP expert group to

minimize overlap and leverage common modeling techniques and infrastructure where

applicable.

1.1.4 Exclusions

The domain of “data mining” is quite large. The JDM expert group made decisions early

to exclude certain features from JDM to make it more manageable. As such, functionality

such as data transformations, visualization, mining unstructured data (e.g., text), wrappers

and ensembles, and sensitivity analysis have been omitted from this first version of the

API. Note that with respect to visualization, JDM does provide many of the key data

objects necessary to support visualization, e.g., confusion matrix, lift results, decision tree

representation, and neural network architecture.

From a systems perspective, JDM does not specify behavior for transactions, scheduling,

or security. These are left to vendors to determine what best suits their respective products

and customer base.


11/156


June 22, 2005 3

1.2 Architectural components

JDM has three logical components that may be implemented as one executable or in a dis-

tributed environment:

• application programming interface (API) - The API is the end-user-visible compo-

nent of a JDM implementation that allows access to services provided by the data min-

ing engine (DME). An application developer using JDM requires knowledge only ofthe API library, not of other supporting components.

• data mining engine (DME) - A DME provides the infrastructure that offers a set of

data mining services to its API clients. When implemented as a server of a client-

server architecture, it is referred to as a data mining server (DMS), which is a specific

instantiation of the more general Enterprise Information System (EIS) as specified in

the Connector Architecture (JSR-16).

• mining object repository (MOR) - The DME uses a mining object repository which

serves to persist data mining objects. This repository can be based on, e.g., the CWM

framework, specifically leveraging the CWM Data Mining metamodel, or imple-

mented using a vendor-proprietary representation. The MOR may exist in a file-based

environment, or in a relational / object database. Section 3.7 discusses JDM persis-tence options.

Figure 1.1 depicts three possible architectures for a JDM implementation. In (a), each

component resides in a separate physical location or separate executable. We view this as

a three-tier architecture with the data stored in a separate repository, such as a database. In

(b), the DME contains the MOR and results in a classic client-server architecture. This

scenario is possible, e.g., where the database contains both the DME and MOR, or the

DME uses the local files system for persistent storage. In (c), the system is monolithic,

i.e., API, DME and MOR reside in, or are managed by a single executable.

FIGURE 1.1 Architecture configuration options

A vendor may choose to provide additional utilities and management interfaces to the

DME and MR, however, these are not defined as part of JDM and may be proprietary. The

JDM specification does not place any requirements on the DME and MOR design or

implementation except to support functionality as required by the JDM interface.

Vendors may implement a subset of the complete JDM specification as noted in the sec-

tion on conformance. This a la carte approach provides a common framework for all data

mining functionality, while allowing vendors to support only vendor-relevant portions of

it.

APIAPI

DM EDM E

MORMOR

APIAPI

DM EDM E

MO RMO R

APIAPI

DM EDM E

MO RMO R

(a) (b) (c)


12/156


June 22, 2005 4

1.3 Dependencies and relationships

JDM leverages aspects of the CWM Data Mining metamodel and the Java Metadata Inter-

face (JSR-40). CWM Data Mining facilitates the construction and deployment of data

warehousing and business intelligence applications, tools, and platforms based on OMG

open standards for metadata and system specification (i.e., MOF, UML, XMI, CWM). The

Java Metadata Interface provides a common naming convention for methods.

The following specifications serve as design references for JDM:

• DMG PMML 2.0, [PMML], provides an XML-based representation for mining mod-

els and facilitates interchange among vendors for model results.

• ISO SQL/MM Part 6. Data Mining [SQL/MM-DM] provides a standard interface to

RDBMSs for performing data mining. Concepts from this approach are leveraged in

the overall JDM design.

• Common Warehouse Metamodel [CWM] and CWM Specification, Volume 1, Chapter

15, Data Mining [CWM-DM] provides a sense of the overall structure of the metadata

JDM supports.

1.4 Organization

This document focuses on JDM requirements, concepts, use cases, code examples, pack-

ages supporting the API, and vendor conformance.

In Section 2, we present use cases to help the reader appreciate how this API can be used

under various circumstances, both by end users and vendors conforming to the standard.

In Section 3, we present the synthesis of data mining concepts that form the basis of the

JDM model. These concepts result from analyzing the requirements of many different data

mining functions and algorithms. These concepts are key to providing a unified data min-

ing framework.

In Sections 4, we present the JDM packages and class diagrams to illustrate the relation-ship between the various interfaces and classes. Details of each class are provided in the

companion Javadoc-generated documentation.

In Section 5, we provide and explain code examples using the JDM API. These examples

represent working with the API as a non-data mining expert, relying on convenience rou-

tines to automate much of the specification, as well as exposing detailed specification for

data mining experts.

In section 6, we present the requirements for vendor conformance to the API.

In section 7, we summarize our JDM experience and where the standard is likely to go in

subsequent versions.

In appendix A, we provide a glossary of terms used in this document.

In appendix B, we review the data mining domain requirements and foundation technolo-

gies driving the API. We explore related data mining standards and common system

behavior.

In appendix C, we list optional methods for models and model detail a vendor may choose

to implement.

In appendix D, we provide JDM error codes for JDMException.


13/156


June 22, 2005 5

In appendix E, we define Web services based on the JDM model. There has been signifi-

cant interest expressed within the expert group and from external comments for defining a

JDM Web services interface.

In appendix F, we provide a list of references.

1.5 Expert group members

Sarabjot Anand – Corporate Intellect

Robert Brunner – California Institute of Technology

Robert Chu – SAS Institute

Werner Dubitzky* – University of Ulster, N. Ireland

Kim Horn – Sun Microsystems, Inc.

Mark Hornick – Oracle Corporation

Bill Hosken* – SPSS, Inc.

Ronny Kohavi* – Blue Martini Software

Achim Kraiss – SAP AG

Marwane Jay Lamimi – KXEN

Christoph Lingenfelder – IBM Germany

Erik Marcade – KXEN

Somesh Marepalli – Computer Associates International, Inc.

Waddys Martinez* – Magnify

Cindy McMullen – BEA Systems

Chuck Mosher – Sun Microsystems, Inc.

John Poole – Hyperion Solutions

Michal Prussak – Fair Isaac

Alex Russakovskii – Hyperion Solutions

Mike Smith – Strategic Analytics

Qian (Cherry) Yang – Computer Associates International, Inc.

Sunil Venkayala – Oracle Corporation

Andrew Walaszek – SPSS, Inc.

Hankil Yoon – Oracle Corporation

* former member

1.6 Acknowledgements

The expert group recognizes and thanks Dipankar Roy and Shiby Thomas for reviewing

previous drafts. We also recognize and thank Marcos Campos, Gary Drescher, Boriana

Milenova, Joe Yarmus, and Yan.Zhuang for their contributions to the JDM effort.


14/156


June 22, 2005 6

2. Use cases

The use cases presented in this section provide a context in which to understand the possi-

ble uses of JDM. We have divided use cases into two categories: those relevant to applica-

tions and those relevant to vendors implementing JDM conforming products. Readers

already familiar with data mining may want only to browse this section.

Several JDM concepts are introduced briefly below to assist in understanding the use

cases. These are described in more detail in Section 3. The reader is expected to be famil-

iar with common data mining terminology.

Mining Function - A major subdomain of data mining that shares common high level

characteristics. Functions include classification, regression, attribute importance, associa-

tion, and clustering.

Task - A container within which to specify arguments to data mining operations to be per-

formed by the data mining engine. Tasks include model building, testing, applying (scor-

ing), computing statistics, and object import and export. Tasks may execute synchronously

or asynchronously.

Settings - A collection of parameters specifying the input for building a data mining

model or applying a model to data (i.e., scoring). Build settings may be high level, speci-

fied for mining functions, or detailed, specified for mining algorithms. Apply settings

specify the content of the scoring result, and in some cases, affect the type of content pro-

vided. For example, a cost matrix may be specified for classification at apply time.

Model - An algorithm often produces a compressed representation of input data called a

model. This model contains the essential knowledge extracted from the data as determined

by the algorithm. A model can be descriptive or predictive. A descriptive model helps in

understanding the underlying data or model behavior. For example, an association rules

model on market basket data can be used to describe consumer behavior. A predictive

model can be an equation or set of rules that makes it possible to predict an unseen or

unknown value (the dependent variable or target ) from other, known values (independent

variables or predictors).

2.1 Application use cases

In this section, we present several end-user use cases involving application developers that

explore a wide variety of situations in which JDM can be used.

2.1.1 Mining GUI

A team of developers is tasked with producing a GUI for visualizing data mining objects.

They use JDM to develop a tool for exposing objects for building models such as build

settings, and viewing model representations or contents. The models themselves includedecision trees, neural networks, and mining results such as confusion matrices and lift.

Decision trees can be traversed and graphically displayed in a tree representation; neural

networks can be traversed and graphically displayed to show hidden layers and weights on

connections. The GUI also supports scoring data, testing models, computing lift and

graphically displaying lift charts.

In this use case, a JDM implementation provides the enabling data mining functionality. If

only standard JDM features are leveraged, this GUI could be portable across vendor JDM

implementations.


15/156


June 22, 2005 7

2.1.2 Web specialty retailer

A specialty retailer sells from a website, catalogs, and stores. The website has a recom-

mendation feature that is supported by data mining. Customer data are collected from each

company’s points of sale into its data warehouse. Sales data are combined with demo-

graphic data such as age, gender , and income. Demographic data together with product

categories are regularly mined for customer 'clusters' using a clustering algorithm. Product

sales data are then partitioned by customer cluster and each cluster is mined for product

associations using association rules algorithms. The website uses the resulting association

rules to make online product recommendations with each addition to the customer’s vir-

tual shopping cart.

In this use case, multiple JDM mining functions are leveraged: clustering, association, and

the ability to score individual records to support online product recommendations.

2.1.3 Campaign management

A campaign management application provides automated support for identifying custom-

ers to receive a marketing campaign. The application has access to data collected on cus-

tomer demographics and responsiveness to such mailing campaigns. This applicationleverages database vendor-specific transformations to prepare data for mining.

Using the mining function attribute importance (also referred to as feature selection), the

application determines which attributes are most relevant for model building. By using a

smaller set of attributes, model build time can be reduced, model predictive accuracy can

increase, and the attributes most valuable to collect from customers can be highlighted.

The application uses a decision tree algorithm to produce rules that can be understood by

the marketing manager, possibly for developing more targeted mailings to customers of a

given set of demographics. Once the model is built, the application tests the model and

sends the test and lift results to the campaign manager, who can assess model quality and

expected results.

Unless directed otherwise, the application uses this model to score new customers eligible

for a new mailing campaign. Those customers who have a probability greater than 75% to

respond to the mailing will be selected for the mailing.

In this use case, data preprocessing may occur outside JDM using proprietary or ad hoc

techniques. Multiple JDM mining functions and operations are leveraged through task

specification. To communicate models and results to other users, these objects can be

exported, perhaps using an XML representation. JDM’s flexible apply settings allow the

application to specify the score, probability, customer id, and possibly other input data to

be part of the apply result table. Finally, JDM’s rule representation and the ability of cer-

tain algorithms to produce rules is leveraged to explain model behavior. Note that JDM

defines predicate-based rules from the decision tree algorithm for either classification or

regression mining functions, and the clustering mining function for the K-Means algo-rithm.

2.1.4 Minimal top level specification

A college student learned about the potential of data mining to solve many problems. For

her senior biology thesis, she wants to cluster the data she’s collected over the past year on

wild grasses of the African plains to help her categorize those grasses.


16/156


June 22, 2005 8

Although an avid Java programmer, she is unfamiliar with the details of data mining. Hav-

ing read about JDM and having access to a commercial implementation through her

school, she leverages all the automated aspects of JDM, specifying only the data and

accepting all default settings for the Clustering build settings. In this way, no algorithm

selection is necessary, nor any algorithm-specific settings.

She uses the API for the clustering model for inspecting the identified clusters.

In this use case, JDM allows novice users to extract benefit from data mining technology

by eliding algorithm details. Vendor implementations may vary in the degree of automa-

tion and the quality of models that automation produces.

2.1.5 Selecting the “best” model

An e-tailer builds models on projected customer revenue from which to base providing

customer discounts. The data analyst for the e-tailer builds multiple regression models

drawing on several algorithms: neural networks, decision trees, and naive bayes. After

building several models of each type, the models are tested against held-aside test data and

lift is computed. An initial criterion for selecting the “best” model is the one with the least

r-squared error.

In this use case, the data analyst leverages a JDM implementation’s ability to reuse a sin-

gle regression build settings object, supplying different algorithm settings. In addition,

each model can be tested by defining test tasks, and coding an outer loop to iterate over the

test results to identify the “best” model.

2.1.6 Comparing vendor implementations

Data Mining Laboratories (DML) performs independent analysis on data mining software

to measure performance, ease of use, and model portability. DML compares the effective-

ness of several vendors’ regression decision tree implementations in building models for

economic forecasting. Economic forecasts are used in corporate planning to align corpo-

rate strategy with the expected economic climate. Using JDM, the DML developers code atest application that builds one neural network model per vendor implementation. After

testing each model, the investigators rank order models according to forecast accuracy,

learning time and the ratio of these two. To ensure fairness in assessing model perfor-

mance and conformance for model portability, a separate scoring engine is used that

accepts PMML standard XML models and generates scores for the test data.

In this use case, the developers are able to code a single program and execute on multiple

vendor implementations, modifying only login information. By exporting models in

PMML format, models can be objectively assessed in a common scoring engine.

2.1.7 Incremental learning

A machine tool manufacturer collects data on the machine settings, materials, and defect

rates for the tools manufactured. These data are provided to a neural network algorithm to

predict the probability of defective components in a given batch of product. Because data

are collected over time, and the architecture of the neural network and specific learning

algorithm chosen is compute intensive, the manufacturer needs to apply incremental learn-

ing on the neural network as new data is available from each production run.

In this use case, JDM provides an interface that enables incremental learning, i.e., the abil-

ity to continue building a model with the original build data or new data. To support this, a


17/156


June 22, 2005 9

user specifies an existing JDM model as input to a build task, along with other required

inputs. On execution of the task, the DME uses this model as a seed from which to con-

tinue building the model. This optional specification can be used for any type of algorithm

that can leverage a seed model.

2.1.8 Deferred task execution

A cancer researcher, who has limited access to hardware for building and testing models,

needs to define and verify a series of mining tasks and storing them in the mining object

repository. The researcher may even build trial models on very small datasets as part of

verifying the task. Using an external scheduling mechanism, such as UNIX cron jobs, the

researcher schedules execution of these tasks over night, when computing resources are

more available.

In this use case, the researcher uses JDM’s task specification and ability to store objects in

the mining object repository. These can later be retrieved for execution. The verify method

allows the researcher to have a greater sense that his tasks will execute to completion. The

verify method typically checks if the logical and physical data map properly and if the

combination of settings specified are compatible.

2.1.9 Explaining model behavior

A bank leverages data mining to predict credit risk for customers seeking home equity

loans. To comply with government regulations to not discriminate based on gender or

race, the bank must be able prove that the rules they apply to determine credit risk exclude

such criteria.

The bank ’s data analyst is required to produce a set of human understandable rules, ideally

in english-like format, that can be presented to government auditors as needed. Bank man-

agement also reviews these rules to target certain customer segments for special promo-

tions.

In this use case, the analyst uses the JDM tree settings to request a decision tree represen-

tation for a classification model, predicting credit worthiness as low, medium, or high. The

analyst then uses JDM’s interface to generate rule objects from the decision tree model

and translate these rules to a particular format. A given vendor may have an english format

implemented for rules.

2.1.10 Manually enhancing a model

A private security agency builds decision tree models to profile suspicious individuals and

identify individuals at airports for further screening. However, in their experience they

have found manually enhancing a model can improve its performance and accuracy. Their

data mining analyst builds a model, generates an english-like representation of the rules,

removes certain irrelevant rules and possibly even adjusts some of the rule predicates.Importing this modified model to the data mining system, the analyst sets up an applica-

tion to enable profiling by leveraging single record scoring of individuals, accessing infor-

mation stored in government databases and information obtained from travelers at the

airport.

In this use case, the analyst also uses the JDM tree settings to build a classification model.

The rules are generated from the decision tree model and analyzed. However, since JDM

does not enable direct model modification via the API, the analyst can export the model,

perhaps in PMML, to ensure model integrity. The analyst modifies the model and attempts


18/156


June 22, 2005 10

to import the model. Validation of the manually modified model occurs at import. JDM’s

support for single record scoring enables the analyst to produce an application that joins

information stored in a database about individuals with that dynamically acquired by air-

port personnel, perhaps at the ticket counter.

2.1.11 OLAP schema refinement

An OLAP vendor creates cube schemas from fact tables stored in a relational database. A

particular fact table contains millions of records representing sales and customer informa-

tion of a beverage retail company. The OLAP vendor needs to create a schema for the

OLAP cube to enable analyzing and reporting the retailer's sales data.

A cube schema is a set of dimensions each having a particular hierarchy of attributes.

Dimensions usually correspond to several columns in the fact table, however, not all col-

umns should necessarily produce a dimension. A dimension normally represents an

attribute that is orthogonal to other dimensions in the fact table. In addition, some of the

columns, identified in advance, represent measures in the model.

Choosing the right set of dimensions is key to OLAP providers. If the number of dimen-

sions is too large, efficient processing of the cube becomes practically impossible. On theother hand, dropping important attributes makes data analysis deficient. Poor cube design

is one of the factors that inhibit OLAP productivity. Therefore it is important to choose the

right schema.

The optimization process of a cube structured can be seen from two different perspectives.

Starting from a fact table with hundreds of columns, OLAP vendors are either interested

in:

• identifying truly independent columns, or

• identifying what are the important columns to be kept in the optimized cube structure.

Attribute importance can be used to select the most important independent columns to bet-

ter ’see’ a given measure. For example, an internal mechanism can build an analytic dataset with columns describing both customer characteristics and product characteristics with

the sales amount as a target. Then this system trains an attribute importance model on this

data set. It returns the columns (either describing some aspects of the customer or the

product) that allow to understand better the spread of average sales figure. Some advanced

systems can even return not only the important columns but also the drilling hierarchies

that can be associated with these columns (segments for continuous variables and groups

of categories for discrete variables). These important columns will be used to create an

(eventually ad-hoc) optimized cube structure that the final user will use to understand bet-

ter the average sales figure and build ’segments’ that will combine the customer or prod-

ucts characteristics that are the most explanatory.

Such schema refinements are intractable in large cubes without data mining.

2.1.12 Web services

List Inc. offers a comprehensive list management service that includes data warehousing,

grooming, merging and predictive modeling. All their services are available as Web ser-

vices allowing customers to integrate List Inc.’s software seamlessly into their own enter-

prise systems using the Internet.


19/156


June 22, 2005 11

The Data Web service allows customers to connect to a managed warehouse and store

their transaction, customer and sales data using a secure Web service interface. List Inc.

manages the customer data in its data warehouse, cleans and grooms the data, and pro-

vides a range of preprocessing and transformation facilities. They maintain a comprehen-

sive repository of high quality background data including income, census, and

demographic and geographic data. List Inc. has relationships with many data vendors and

can call upon their services when required. This background data is merged with the cus-

tomer data using their proprietary merge technology.

List Inc. offers a complete model training and testing facility that guarantees optimal

results. The customer data is used to build predictive models to determine the best

responders, cross sell and up sell models and investigate return on investment (ROI). List

Inc. has a comprehensive testing facility that can choose the best algorithm and product

combination that delivers the optimal ROI. The customer does not have to worry about

data mining tool integration, training and testing.

The customer decides only on the schedule for updating models and the ROI they require.

List Inc. owns two super computers to provide the fastest modeling facilities available

today.

JDM is critical to List Inc.’s services. The Predictive Web service wraps JDM to allow thecustomer to apply models. The Training Web service wraps JDM to allow the customer to

build models and set parameters. JDM is used internally to connect to different vendor

data mining tools and algorithms in their building and testing processes.

The Training Web service can be used by both novice customers and experienced data

analysts. Mining savvy data analysts can tailor the training process, choose particular

algorithms and their settings. In addition, they can choose the attributes from their data

they wish to include in models.

The Prediction Web service provides access to the resultant models across the net. The

Prediction Web service interface is called with new prospect data and the score outcome

returned. The service allows customers to enhance their software systems and their own

web sites with predicted outcomes as if they owned the data mining tools themselves.

2.2 Vendor use cases

In this section, we present several use cases that explore how vendors can leverage JDM in

commercial JDM implementations.

2.2.1 Broad support of JDM

A data mining vendor has a wide range of algorithms that addresses each of the JDM min-

ing functions. The vendor’s objective is to simplify mining for unsophisticated users. As

such, the vendor provides automated selection of algorithms without requiring (or allow-ing) the user to select specific algorithms or provide algorithm-specific control of algo-

rithms, e.g., maximum tree depth in a decision tree.

In this use case, the vendor must implement all packages of the API except Algorithm sub-

classes and model detail subclasses. Users of the vendor’s data mining product will spec-

ify build settings only, obtain models, and be able to view and use those models as

appropriate. Note that the end-user can see only the function-specific model representa-

tions, not their underlying algorithm-specific model representations.


20/156


June 22, 2005 12

2.2.2 Narrow support of JDM

A data mining vendor Neural Networks, Inc. (NNI) supports various neural network algo-

rithms, both published and proprietary in their data mining tool. NNI supports both classi-

fication and regression. The vendor chooses to be JDM compliant to gain acceptance in

the marketplace.

In this use case, JDM, as an a la carte standard, allows a vendor to implement a narrow

portion of the standard to reflect its specific domain, or subset of mining functions sup-

ported. The JDM packages to support this include the core foundation packages and a

select few specific to neural networks including algorithm settings and model detail.

For the vendor’s proprietary algorithms, an additional Java package

nni.feedforwardneuralnetwork is provided which includes the specific proprietary algo-

rithm settings and model representations.


21/156


June 22, 2005 13

3. Concepts

In this section, we introduce JDM concepts: mining function, task, principal objects, phys-

ical data representations, attribute mapping, physical data storage, object references, and

reflection and introspection.

3.1 Data mining functions

In general, data mining functions can be classified into two categories: supervised and

unsupervised . Supervised functions are typically used to predict a value and require the

specification of a known outcome or target for each case to be used during model build-

ing. Examples of targets include binary attributes indicating buy/no-buy, churn/no-churn,

success/failure, and multi-class attributes indicating preferred color choice from among

the primary colors, likely salary range binned in $20,000 increments. The target allows the

algorithm to determine how well it is predicting target values. An example of supervised

learning algorithms includes Naive Bayes for classification.

Unsupervised functions do not use a target, and are typically used to find the intrinsic

structure, relations, or affinities in a body of data. Examples of unsupervised learningalgorithms include k-means clustering and Apriori association. Clustering may be used to

identify naturally occurring groups of proteins among hundreds of cases, or retail cus-

tomer segmentation. The itemset rules returned from an association model can be used to

identify products to cross-sell to retail customers.

Another view of mining involves whether data mining is descriptive or predictive.

Descriptive data mining describes a dataset in a concise and summary manner, and pre-

sents interesting general properties of the data. Algorithms supporting descriptive data

mining include k-means clustering, Apriori association, and even decision tree classifica-

tion. Predictive data mining constructs one or a set of models, performs inference on the

available dataset, and attempts to predict outcomes for new data sets. Algorithms support-

ing predictive data mining include neural networks, SVM, and decision tree classification/

regression, and even k-means clustering when used to assign new records to clusters.

Different algorithms serve different purposes, each algorithm offering its own advantages

and disadvantages. JDM specifies the following mining functions: classification, regres-

sion, attribute importance, clustering, and association. Some algorithms can be used

across multiple data mining functions.

3.1.1 Classification

Classification has been used in customer segmentation, business modeling, and credit

analysis. As a type of supervised learning, an algorithm supporting classification builds a

model from a set of predictors that are used to predict a target . A set of predictors may

include demographic data such as age, income, number of children, and zip code, to pre-

dict the binary target buy/no-buy a minivan. The input or build data for a supervised learn-ing algorithm requires the presence of attributes for both predictors and target in each

case. Given a pre-determined set of classes in the target attribute, classification analyzes

the build data to create a model that can predict to which class a given case belongs.

3.1.2 Regression

Regression has been used in financial forecasting, time series prediction, biomedical and

drug response modelling, and environmental modelling. Also a type of supervised learn-


22/156


June 22, 2005 14

ing, regression involves predicting a continuous, numerical valued target attribute given a

set of predictors. A regression problem may use the same predictors as a classification

problem, but specifies a target such as the predicted lifetime value of a customer.

3.1.3 Attribute Importance

Attribute importance is used to determine which attributes are most relevant for building amodel. Attribute importance can be used for both supervised and unsupervised learning.

Attribute importance enables users to reduce model build time, and in some algorithms,

reduce data scoring time by including only the most important attributes from the build

data. Eliminating “noise” attributes from data can also improve accuracy or model quality.

Attribute importance serves a purpose similar to feature selection. It produces a model that

ranks attributes according to how each attribute contributes to the quality of a model built.

From the ranking of attributes, users may select the attributes to be used in building mod-

els. The user can specify a number or percentage of attributes to use; alternatively a user

can specify a cutoff point. Note that the ranking of attributes is interpretable usually in a

relative sense. JDM specifies no precise interpretation of attribute rank values other than

attributes with a greater numeric value are relatively more important.

3.1.4 Clustering

Clustering has been used in customer segmentation, gene and protein analysis, product

grouping, finding numerical taxonomies, and text mining. Clustering analysis identifies

clusters embedded in the data, where a cluster is a collection of data objects that are simi-

lar to one another. A good clustering method produces high quality clusters to ensure that

the inter-cluster similarity is low and the intra-cluster similarity is high. The similarity of

two values of an attribute can be expressed as distance functions. For numeric data, this

can be as simple as the euclidean distance between points. For categorical data, similarity

can be expressed to make married and cohabiting closer to one another, as well as sepa-

rated and divorced.

3.1.5 Association

Association has been used in market basket analysis and the analysis of consumer behav-

ior for the discovery of relationships or correlations among a set of items, e.g., the pres-

ence of one pattern implies the presence of another pattern. They help to identify the

attribute value conditions that occur frequently together in a given set of data. Association

analysis is widely used in transaction data analysis for directed marketing, catalog design,

and other business decision-making process. Traditionally, association is used for market

basket data analysis such as 90% of the people who buy milk also buy bread.

Support and confidence metrics are used as a quality measure of a rule within an associa-

tion model. These are available in JDM as part of the Association model for each rule pro-

duced. Note that the rules returned from an association model are different from the

predicate-based rules produced from clustering models or decision tree models. Here, the

rules consist of a set of items. These items typically occur together in a single transaction,

such as the items purchased at an online retail checkout.

The support of a rule is used to ensure that the items in associated in the rule occur

together frequently enough to be considered significant. Using the probability notation,

support (A B) = P(A, B)


23/156


June 22, 2005 15

The confidence of a rule is the conditional probability of B given A; confidence (A B)

= P (B/A) which is equal to P(A, B) / P(A).

3.2 Data mining tasks

Data mining revolves around a few common tasks: building a model, testing a model,

applying a model to data, computing statistics, and importing and exporting miningobjects. Each of these are discussed below.

3.2.1 Building a model

JDM enables users to build models in the functional areas: classification,

regression, attribute importance, clustering, and association. The model serves as a typi-

cally concise or compact representation of the information contained in the data, relative

to the algorithm that produced it. To build models, users define tasks, which minimally

require the input parameters: model name, mining data and mining settings. Settings con-

tain parameters that describe the type of model to be built, as well as directions to the spe-

cific algorithm used to build the model.

There are two levels of settings: function and algorithm. Recall that the mining function

addresses the type of problem to be solved, e.g., classification or clustering, and the min-

ing algorithm addresses the specific technique to be applied to solve that problem, e.g.,

decision tree or k-means. When a user does not specify algorithm settings in a build set-

tings, the Data Mining Engine (DME) may choose an appropriate algorithm for the task,

either dynamically or statically, providing defaults for the relevant parameters. Model

building at the function level eliminates much of the technical details of data mining for

the user. The quality of models will be determined by the sophistication of the vendor’s

implementation and the quality of the data.

Build data, i.e., the data used as input to build a model, can be in different forms. The

attributes of the build data to be used in model building may be specified in the logical

data associated with the build settings. JDM supports flexible assignment of build data tothe logical data. If logical attributes do not map directly to physical attributes with name-

based equivalence, an explicit mapping may be provided using the task object.

A typical scenario for model building is as follows:

1. Create a physical data object (by identifying existing data in a database table or file)

2. Create a build settings object

3. Create a logical data instance based on the physical data and associate it with the build

settings (optional)

4. Create an algorithm settings object and associate it with the build settings (optional)

5. Create a build task and set the physical data and build settings

6. Map the physical attributes to logical attributes (if necessary)

7. Invoke the execute method using the task

After a model is built by the DME, it can be persisted in the MOR. See section 3.7 for

details on JDM persistence options.

The result of a build is a model. Especially for predictive models, the number of logical

attributes used by the model may be a subset of those provided in the logical data. As

such, the model has a signature specifying the possible input attributes to the model for


24/156


June 22, 2005 16

apply. These are not required attributes as a subset may be specified where NULL values

can be handled. Some algorithms perform automatic attribute selection, e.g., with a deci-

sion tree model, 100 attributes may have been used to train the model, but only 25 were

used in the final rule set and are necessary for scoring. These 25 constitute the model sig-

nature.

3.2.1.1 Incremental learning

Some applications have a nearly continuous stream of data available for model building. A

typical approach is to collect a certain amount of data, build a model from it, use the

model to score new data for some period, and then build a new model from scratch, possi-

bly using all the data accumulated to date, or using a fixed amount of data, but using the

most recent data.

This approach can be unnecessarily costly, especially for algorithms such as Naive Bayes

or Association Rules where summary frequency counts are maintained. The frequency

counts of the existing data do not change, only the new data added needs to be counted and

the results merged with the previous counts. This produces refreshed models in much less

time.

Algorithms such as neural networks can also leverage incremental learning. Here, a previ-

ously trained neural network can be provided as a seed model. The model is further trained

using new data, but starting from an already good model.

JDM provides support for incremental learning by allowing a seed model to be specified

as input to the build task. Not all functions or algorithms are expected to handle the speci-

fication of a seed model for incremental learning. The function and algorithm capabilities

list indicates if this feature is supported, which is vendor-specific.

3.2.1.2 Model evaluation

Some algorithms, such as neural networks or decision trees use a portion of the build data

to iteratively determine how well the model is learning patterns from the data. These algo-

rithms will split the build data into a train and evaluation dataset according to some inter-

nal percentage, e.g., 50%-50%, or 70%-30%. Some users, however, wish to control more

carefully the data that is used for training versus that used for evaluation.

JDM provides support for specifying the evaluation data explicitly in the build task to be

used during model building. Although some vendors may provide proprietary algorithm

settings to allow specifying the percentage of data to be used for evaluation, JDM provides

the more explicit option of providing the actual data.

3.2.2 Testing a model

Model testing gives an estimate of the accuracy a model has in predicting the target of asupervised model. Testing follows model building to compute the accuracy of a model’s

predictions when the model is applied to a previously unseen dataset, separate from the

build dataset. This provides an honest estimate of the accuracy.

The test task accepts a model and data for testing the model. Test results are stored in a

TestMetrics object as specified in the task. Physical attribute to logical attribute mapping

may be specified if the names of physical and logical attributes do not match. The test data

must be compatible with the model signature.


25/156


June 22, 2005 17

Test data must be preprocessed in the same way as the build data. The user is responsible

for ensuring this compatibility. However, some DMEs may choose to use information

present in the LogicalData stored with the model to flag incompatibilities.

Test metrics content depends on the type of model. For example, classification models

produce a confusion matrix, whereas regression models provide error estimates. In addi-

tion to obtaining a confusion matrix, model testing includes option to compute lift and

receiver operating characteristics (R