8/20/2019 Java Datamining Spec.
1/156
Maintenance Release JavaTM Data Mining (JDM) Version 1.1
June 22, 2005 1
JavaTM Specification Request 73:
JavaTM Data Mining (JDM)
JSR-73 Expert Group
Specification Lead:Mark Hornick, Oracle Corporation
Technical comments:
Version 1.1
Maintenance Release Specification
June 22, 2005
8/20/2019 Java Datamining Spec.
2/156
Maintenance Release JavaTM Data Mining (JDM) Version 1.1
June 22, 2005 2
Copyright
Copyright (c) 2005 Oracle Corporation. All rights reserved.
Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
This product or document is protected by copyright and distributed under licenses restrict-
ing its use, copying, distribution, and decompilation. No part of this product or documen-
tation may be reproduced in any form by any means without prior written authorization of
the copyright holders, or any of the licensors, if any. Any unauthorized use may be a viola-
tion of domestic or international law. RESTRICTED RIGHTS LEGEND: Use, duplica-
tion, or disclosure by the U.S. Government and its agents is subject to the restrictions of
FAR 52.227-14(g)(2)(6/87) and FAR 52.227-19(6/87), or DFAR 252.227-7015(b)(6/95)
and DFAR 227.7202-3(a).
Disclaimer
This document and its contents are furnished “as is” for informational purposes only, andare subject to change without notice. Oracle Corporation (Oracle) does not represent or
warrant that any product or business plans expressed or implied will be fulfilled in any
way. Any actions taken by the user of this document in response to the document or its
contents shall be solely at the risk of the user.
ORACLE MAKES NO WARRANTIES, EXPRESSED OR IMPLIED, WITH RESPECT
TO THIS DOCUMENT OR ITS CONTENTS, AND HEREBY EXPRESSLY DIS-
CLAIMS ANY AND ALL IMPLIED WARRANTIES OF MERCHANTABILITY, FIT-
NESS FOR A PARTICULAR USE OR NON-INFRINGEMENT. IN NO EVENT SHALL
ORACLE BE HELD LIABLE FOR ANY DIRECT, INDIRECT, SPECIAL OR CONSE-
QUENTIAL DAMAGES IN CONNECTION WITH OR ARISING FROM THE USER
OF ANY PORTION OF THE INFORMATION.
Trademarks
Sun, Sun Microsystems, Java, JavaBeans, and Enterprise JavaBeans are trademarks, reg-
istered trademarks, or servicemarks of Sun Microsystems, Inc. in the U.S. and other coun-
tries.
OMG, Object Management Group, CORBA, Unified Modeling Language, UML, are reg-
istered trademarks or trademarks of the Object Management Group, Inc.
All other product or company names mentioned are for identification purposes only, and
may be trademarks of their respective owners.
8/20/2019 Java Datamining Spec.
3/156
Maintenance Release JavaTM Data Mining (JDM) Version 1.1
June 22, 2005 1
1. Overview..................................................................................................................1
1.1 Introduction..........................................................................................................................1
1.1.1 Benefits..................................................................................................................1
1.1.2 Target audience......................................................................................................2
1.1.3 Data analytics JSRs ...............................................................................................2
1.1.4 Exclusions .............................................................................................................2
1.2 Architectural components ....................................................................................................31.3 Dependencies and relationships...........................................................................................4
1.4 Organization.........................................................................................................................4
1.5 Expert group members.........................................................................................................5
1.6 Acknowledgements..............................................................................................................5
2. Use cases..................................................................................................................6
2.1 Application use cases...........................................................................................................6
2.1.1 Mining GUI ...........................................................................................................6
2.1.2 Web specialty retailer ............................................................................................7
2.1.3 Campaign management .........................................................................................7
2.1.4 Minimal top level specification.............................................................................7
2.1.5 Selecting the “best” model ....................................................................................82.1.6 Comparing vendor implementations .....................................................................8
2.1.7 Incremental learning..............................................................................................8
2.1.8 Deferred task execution.........................................................................................9
2.1.9 Explaining model behavior....................................................................................9
2.1.10 Manually enhancing a model.................................................................................9
2.1.11 OLAP schema refinement ...................................................................................10
2.1.12 Web services........................................................................................................10
2.2 Vendor use cases ................................................................................................................11
2.2.1 Broad support of JDM.........................................................................................11
2.2.2 Narrow support of JDM.......................................................................................12
3. Concepts.................................................................................................................13
3.1 Data mining functions........................................................................................................133.1.1 Classification.......................................................................................................13
3.1.2 Regression ...........................................................................................................13
3.1.3 Attribute Importance ...........................................................................................14
3.1.4 Clustering ............................................................................................................14
3.1.5 Association ..........................................................................................................14
3.2 Data mining tasks...............................................................................................................15
3.2.1 Building a model .................................................................................................15
3.2.2 Testing a model....................................................................................................16
3.2.3 Applying a model ................................................................................................17
3.2.4 Object import and export.....................................................................................18
3.2.5 Computing statistics on data................................................................................19
3.2.6 Verifying task correctness....................................................................................19
3.3 Principal objects.................................................................................................................20
3.3.1 Connection...........................................................................................................20
3.3.2 Task......................................................................................................................20
3.3.3 Execution handle and status ................................................................................20
3.3.4 Physical data set ..................................................................................................21
3.3.5 Physical data record.............................................................................................21
3.3.6 Build settings.......................................................................................................21
3.3.7 Algorithm ............................................................................................................22
3.3.8 Algorithm settings ...............................................................................................22
8/20/2019 Java Datamining Spec.
4/156
Maintenance Release JavaTM Data Mining (JDM) Version 1.1
June 22, 2005 2
3.3.9 Model...................................................................................................................22
3.3.10 Model signature...................................................................................................22
3.3.11 Model detail.........................................................................................................23
3.3.12 Logical attribute...................................................................................................23
3.3.13 Logical data .........................................................................................................23
3.3.14 Attribute statistics set ..........................................................................................23
3.3.15 Apply settings......................................................................................................24
3.3.16 Confusion matrix.................................................................................................24
3.3.17 Lift.......................................................................................................................24
3.3.18 Cost matrix ..........................................................................................................25
3.3.19 Prior probabilities ................................................................................................25
3.3.20 Category sets .......................................................................................................26
3.3.21 Taxonomy............................................................................................................26
3.3.22 Rules....................................................................................................................27
3.3.23 Verification report................................................................................................27
3.4 Physical data representations.............................................................................................27
3.4.1 Individual record .................................................................................................27
3.4.2 Single record case table.......................................................................................28
3.4.3 Multi-record case table ........................................................................................28
3.4.4 Data preparation ..................................................................................................293.5 Attribute mapping ..............................................................................................................29
3.5.1 Direct mapping ....................................................................................................29
3.5.2 Pivot mapping......................................................................................................30
3.6 Creating physical data objects ...........................................................................................30
3.7 Persistence .........................................................................................................................30
3.8 Object references ...............................................................................................................31
3.9 Reflection / introspection...................................................................................................32
4. Packages.................................................................................................................34
4.1 Design overview ................................................................................................................34
4.2 Notation .............................................................................................................................34
4.3 Package structure...............................................................................................................36
4.4 Package javax.datamining .................................................................................................38
4.5 Package javax.datamining.base .........................................................................................40
4.6 Package javax.datamining.resource ...................................................................................43
4.7 Package javax.datamining.data..........................................................................................44
4.8 Package javax.datamining.task..........................................................................................47
4.8.1 Package task.apply ..............................................................................................49
4.9 Package javax.datamining.supervised ...............................................................................50
4.9.1 Package supervised.classification........................................................................51
4.9.2 Package supervised.regression ............................................................................54
4.9.3 Package attributeimportance ...............................................................................55
4.10 Package javax.datamining.association...............................................................................564.11 Package javax.datamining.clustering.................................................................................58
4.12 Package javax.datamining.rule ..........................................................................................61
4.13 Package javax.datamining.statistics...................................................................................62
4.14 Package javax.datamining.algorithm.................................................................................63
4.14.1 Package algorithm.tree ........................................................................................63
4.14.2 Package algorithm.naivebayes ............................................................................64
4.14.3 Package algorithm.feedforwardneuralnet............................................................65
4.14.4 Package algorithm.svm........................................................................................66
8/20/2019 Java Datamining Spec.
5/156
Maintenance Release JavaTM Data Mining (JDM) Version 1.1
June 22, 2005 3
4.14.5 Package algorithm.kmeans ..................................................................................67
4.15 Package javax.datamining.modeldetail..............................................................................68
4.15.1 Package modeldetail.tree.....................................................................................68
4.15.2 Package modeldetail.feedforwardneuralnet ........................................................69
4.15.3 Package modeldetail.naivebayes .........................................................................69
4.15.4 Package modeldetail.svm ....................................................................................70
5. Code examples .......................................................................................................715.1 Building a clustering model...............................................................................................71
5.2 Applying a clustering model to data..................................................................................73
5.3 Applying a clustering model to a record............................................................................74
5.4 Building a classification model..........................................................................................75
5.5 Testing a classification model............................................................................................76
5.6 Building and extracting rules from a tree model ...............................................................77
5.7 Extracting rules from an association model.......................................................................79
5.7.1 Get rules with minimum support.........................................................................79
5.7.2 Get rules with minimum support and confidence................................................79
5.7.3 Get rules containing certain items .......................................................................80
5.7.4 Get rules that do not contain certain items..........................................................815.8 Importing and exporting a model.......................................................................................81
5.8.1 Import an object using a URI ..............................................................................82
5.8.2 Export a model ....................................................................................................83
5.8.3 Export an object to a destination .........................................................................83
5.9 Using reflection..................................................................................................................84
5.10 Establishing a connection ..................................................................................................85
5.11 Uniform resource identifiers ..............................................................................................85
6. Conformance statement .........................................................................................87
6.1 Required and optional features ..........................................................................................87
6.2 Vendor extensions ..............................................................................................................88
6.3 Compliance points .............................................................................................................886.4 Determining conformance .................................................................................................89
6.4.1 Function level conformance ................................................................................89
6.4.2 Algorithm level conformance..............................................................................90
6.4.3 Model apply engine conformance .......................................................................91
6.5 Claiming conformance.......................................................................................................91
7. Summary................................................................................................................93
Appendix A.Glossary.........................................................................................................94
Appendix B.Requirements...............................................................................................102
B.1. Domain requirements.......................................................................................................102
B.2. Foundation technologies..................................................................................................103B.3. Data mining standards .....................................................................................................103
B.4. System behavior...............................................................................................................103
B.5. Exclusions for version 1 ..................................................................................................104
B.5.1. Domain exclusions ............................................................................................104
B.5.2. System exclusions .............................................................................................104
Appendix C.Optional Methods ........................................................................................106
Appendix D.Exceptions ...................................................................................................107
8/20/2019 Java Datamining Spec.
6/156
Maintenance Release JavaTM Data Mining (JDM) Version 1.1
June 22, 2005 4
Appendix E.Web services ................................................................................................110
E.1. Introduction......................................................................................................................110
E.2. Methods ...........................................................................................................................111
E.2.1. WSDL Document Structure ..............................................................................111
E.2.2. Listing DME Contents.......................................................................................112
E.2.3. Introspection / Reflection ..................................................................................114
E.2.4. Saving objects....................................................................................................115E.2.5. Retrieving objects..............................................................................................116
E.2.6. Removing objects ..............................................................................................117
E.2.7. Renaming objects ..............................................................................................118
E.2.8. Retrieving Object Components .........................................................................119
E.2.9. Verify Object .....................................................................................................120
E.2.10. Executing tasks..................................................................................................121
E.2.11. Getting execution status ....................................................................................123
E.2.12. Terminating Tasks..............................................................................................123
E.3. Java methods supporting XML........................................................................................124
E.4. XML Schema Definition .................................................................................................125
E.4.1. JDM Document .................................................................................................125
E.4.2. Task....................................................................................................................125E.4.3. Task.Apply.........................................................................................................128
E.4.4. Data....................................................................................................................129
E.4.5. Supervised .........................................................................................................132
E.4.6. Supervised.Classification ..................................................................................133
E.4.7. Supervised.Regression.......................................................................................135
E.4.8. Clustering ..........................................................................................................136
E.4.9. Association........................................................................................................138
E.4.10. AttributeImportance ..........................................................................................138
E.4.11. Statistics.............................................................................................................139
E.4.12. Algorithm ..........................................................................................................140
E.4.13. Base ...................................................................................................................143
E.4.14. Root ...................................................................................................................145
E.4.15. Enumeration extension......................................................................................146
Appendix F.References ....................................................................................................148
8/20/2019 Java Datamining Spec.
7/156
Maintenance Release JavaTM Data Mining (JDM) Version 1.1
June 22, 2005 1
TABLE 1. An example of a single record case table ..........................................................................................28
TABLE 2. An example of a multi-record case table...........................................................................................28
TABLE 3. Named and composite object referencing summary..........................................................................31
TABLE 4. Function-level model behavior........................................................................................................102
TABLE 5. JDM optional methods for models and model details .....................................................................106
TABLE 6. JDMException codes and messages ................................................................................................108
TABLE 7. JDM runtime exceptions, codes, and messages...............................................................................109
8/20/2019 Java Datamining Spec.
8/156
Maintenance Release JavaTM Data Mining (JDM) Version 1.1
June 22, 2005 1
FIGURE 1.1 Architecture configuration options ......................................................................................................3
FIGURE 1.2 Example of attribute mapping for apply............................................................................................29
FIGURE 4.2 Top level package structure ...............................................................................................................37
FIGURE 4.3 Common top level interfaces .............................................................................................................38
FIGURE 4.4 Exception classes...............................................................................................................................38
FIGURE 4.5 Top level enumerations......................................................................................................................39
FIGURE 4.6 Execution Handle...............................................................................................................................39
FIGURE 4.7 Package javax.datamining.base - Named Objects .............................................................................40FIGURE 4.8 Package javax.datamining.base - Build Settings, Model, and Task ..................................................40
FIGURE 4.9 Package javax.datamining.base - BuildSettings ................................................................................41
FIGURE 4.10 Package javax.datamining.base - Model............................................................................................42
FIGURE 4.11 Package javax.datamining.resource...................................................................................................43
FIGURE 4.12 Package javax.datamining.data - PhysicalData .................................................................................44
FIGURE 4.13 Package javax.datamining.data - LogicalData...................................................................................45
FIGURE 4.14 Package javax.datamining.data - ModelSignature.............................................................................45
FIGURE 4.15 Package javax.datamining.data - Taxonomy .....................................................................................46
FIGURE 4.16 Package javax.datamining.data - CategoryMatrix.............................................................................46
FIGURE 4.17 Package javax.datamining.data - CategorySet and Interval ..............................................................46
FIGURE 4.18 Package javax.datamining.task - Build..............................................................................................47
FIGURE 4.19 Package javax.datamining.task - Import and Export .........................................................................48
FIGURE 4.20 Package javax.datamining.task - ComputeStatistics..........................................................................48
FIGURE 4.21 Package task.apply - ApplyTask and ApplySettings .........................................................................49FIGURE 4.22 Package javax.datamining.supervised - Settings and Model.............................................................50
FIGURE 4.23 Package javax.datamining.supervised - TestTask and TestMetrics ...................................................50
FIGURE 4.24 Package supervised.classification - Settings and Model ...................................................................51
FIGURE 4.25 Package supervised.classification - TestTask and TestMetrics..........................................................52
FIGURE 4.26 Package supervised.classification ClassificationTestMetricsTask.....................................................52
FIGURE 4.27 Package supervised.classification - ApplySettings............................................................................53
FIGURE 4.28 Package supervised.classification - Confusion Matrix and Cost Matrix ...........................................53
FIGURE 4.29 Package supervised.regression - Settings and Model ........................................................................54
FIGURE 4.30 Package supervised.regression - TestTask, and ApplySettings .........................................................54
FIGURE 4.31 Package supervised.regression - RegressionTestMetricsTrask..........................................................55
FIGURE 4.32 Package javax.datamining.attributeimportance - Settings and Model...............................................55
FIGURE 4.33 Package javax.datamining.associationrules - Settings and Model ....................................................56
FIGURE 4.34 Package javax.datamining.associationrules - Rule Selection ............................................................57
FIGURE 4.35 Package javax.datamining.clustering - Model...................................................................................58FIGURE 4.36 Package javax.datamining.clustering - Settings ................................................................................59
FIGURE 4.37 Package javax.datamining.clustering - ApplySettings ......................................................................60
FIGURE 4.38 Package javax.datamining.clustering - Similarity Matrix .................................................................60
FIGURE 4.39 Package javax.datamining.rule - Rule and Predicate.........................................................................61
FIGURE 4.40 Package javax.datamining.statistics - AttributeStatistics ..................................................................62
FIGURE 4.41 Package algorithm.tree - TreeSettings...............................................................................................63
FIGURE 4.42 Package algorithm.naivebayes - NaiveBayesSettings .......................................................................64
FIGURE 4.43 Package algorithm.feedforwardneuralnet - FeedForwardNeuralNetSettings....................................65
FIGURE 4.44 Package algorithm.svm.classification - SVMClassificationSettings.................................................66
FIGURE 4.45 Package algorithm.svm.regression - SVMRegressionSettings..........................................................66
FIGURE 4.46 Package algorithm.kmeans - KMeansSettings...................................................................................67
FIGURE 4.47 Package modeldetail.tree - TreeModelDetail ....................................................................................68
FIGURE 4.48 Package modeldetail.feedforwardneuralnet - NeuralNetworkModelDetail ......................................69
FIGURE 4.49 Package modeldetail.naivebayes - NaiveBayesModelDetail.............................................................69FIGURE 4.50 Package modeldetail.svm - SVMModelDetail ..................................................................................70
8/20/2019 Java Datamining Spec.
9/156
Maintenance Release JavaTM Data Mining (JDM) Version 1.1
June 22, 2005 1
1. Overview
1.1 Introduction
The Java Data Mining (JDM) specification addresses the need for a pure Java API to facil-
itate development of data mining-enabled applications. JDM supports common data min-
ing operations, as well as the creation, persistence, access, and maintenance of metadatasupporting mining activities.
Currently, no existing Java platform specification provides a standard API for data mining
systems. Existing APIs are vendor-proprietary. By using JDM, implementers of data min-
ing applications can expose a single, standard API that will be understood by a wide vari-
ety of developers writing client applications and components running on the Java™ 2
Platform. Similarly, data mining clients can be coded against a single API that is indepen-
dent of the underlying data mining system. JDM is targeted for the Java™ 2 Platform,
Enterprise Edition (J2EE™) and Standard Edition (J2SE™).
In JDM, data mining [Mitchell1997, BL1997] includes the functional areas of classifica-
tion, regression, attribute importance1, clustering, and association. These are supported by
such supervised and unsupervised learning algorithms as decision trees, neural networks,
Naive Bayes, Support Vector Machine, K-Means, and Apriori, on structured data. Com-mon operations include model build, test, and apply (score). A particular implementation
of this specification may not necessarily support all interfaces and services defined by
JDM. However, JDM provides a mechanism for client discovery of supported interfaces
and capabilities.
JDM is based on a generalized, object-oriented, data mining conceptual model leveraging
emerging data mining standards such the Object Management Group’s Common Ware-
house Metadata (CWM), ISO’s SQL/MM for Data Mining, and the Data Mining Group’s
Predictive Model Markup Language (PMML), as appropriate
Implementation details of JDM are delegated to each vendor. A vendor may decide to
implement JDM as a native API of its data mining product. Others may opt to develop a
driver/adapter that mediates between a core JDM layer and multiple vendor products. The
JDM specification does not prescribe a particular implementation strategy, nor does it pre-scribe performance or accuracy of a given capability or algorithm.
To ensure J2EE™ compatibility and eliminate duplication of effort, JDM leverages exist-
ing specifications. In particular, JDM leverages the Java Connection Architecture [JSR16]
to provide communication and resource management between applications and the ser-
vices that implement the JDM API. JDM also reflects aspects the Java Metadata Interface
[JSR40] for the interface specification.
1.1.1 Benefits
The availability of a J2EE™-compliant data mining API provides benefit to both vendors
and users of tools and applications in the areas of business intelligence, business analytics,
data mining systems, data warehousing, and life sciences / bioinformatics.Historically, application developers coded homegrown data mining algorithms into appli-
cations, or used sophisticated end-user GUIs. These GUIs packaged a suite of algorithms
complete with support for data transformation, model building, testing, and scoring. How-
ever, it was difficult, if not impossible, to embed data mining end-to-end in applications
using commercial data mining products due to inadequate APIs. If a vendor had an API, it
was proprietary, making the development of a product using that API risky. If a different
1. Attribute importance is also referred to as feature selection or key fields analysis.
8/20/2019 Java Datamining Spec.
10/156
Maintenance Release JavaTM Data Mining (JDM) Version 1.1
June 22, 2005 2
vendor’s solution was required, rewriting that product was also potentially costly.
The ability to leverage data mining functionality via a standard API greatly reduces risk
and potential cost. With a standard API, customers can use multiple products for solving
business problems by applying the most appropriate algorithm implementation without
investing resources to learn each vendor’s proprietary API. Moreover, a standard API
makes data mining more accessible to developers while making developer skills more
transferable. Vendors can now differentiate themselves on price, performance, accuracy,and features. Java Data Mining (JDM) addresses this need for Java.
1.1.2 Target audienceThe target audiences for the JDM specification can be categorized into the following
groups:
data mining vendors – companies that intend to implement this API for their respec-
tive products, thereby providing the API to end users
• application developers – Java programmers who wish to use a data mining API for
building GUIs or other applications that benefit from data mining technology
• data mining experts – individuals with advanced degrees in statistics, machine learn-
ing, or data mining; or with significant practical data mining experience
• data mining novices – Java-knowledgeable developers who have a basic understand-
ing of the problems that data mining can solve, who can minimally leverage the func-
tion-level of data mining tasks
1.1.3 Data analytics JSRs
The complement to data mining in data analytics is online analytical processing (OLAP).
To distinguish between OLAP and Data Mining, consider that OLAP follows a deductive
(query-oriented) strategy of analyzing data. Users formulate hypotheses, and execute que-
ries to gain understanding of the underlying data. Data Mining follows an inductive strat-
egy of analyzing data where users apply machine learning algorithms to extract non-
obvious knowledge from the data.
JOLAP, (JSR-69) specifies a Java API for OLAP, and shares a common basis in the OMG
CWM meta-model. The JDM expert group is working with the JOLAP expert group to
minimize overlap and leverage common modeling techniques and infrastructure where
applicable.
1.1.4 Exclusions
The domain of “data mining” is quite large. The JDM expert group made decisions early
to exclude certain features from JDM to make it more manageable. As such, functionality
such as data transformations, visualization, mining unstructured data (e.g., text), wrappers
and ensembles, and sensitivity analysis have been omitted from this first version of the
API. Note that with respect to visualization, JDM does provide many of the key data
objects necessary to support visualization, e.g., confusion matrix, lift results, decision tree
representation, and neural network architecture.
From a systems perspective, JDM does not specify behavior for transactions, scheduling,
or security. These are left to vendors to determine what best suits their respective products
and customer base.
8/20/2019 Java Datamining Spec.
11/156
Maintenance Release JavaTM Data Mining (JDM) Version 1.1
June 22, 2005 3
1.2 Architectural components
JDM has three logical components that may be implemented as one executable or in a dis-
tributed environment:
• application programming interface (API) - The API is the end-user-visible compo-
nent of a JDM implementation that allows access to services provided by the data min-
ing engine (DME). An application developer using JDM requires knowledge only ofthe API library, not of other supporting components.
• data mining engine (DME) - A DME provides the infrastructure that offers a set of
data mining services to its API clients. When implemented as a server of a client-
server architecture, it is referred to as a data mining server (DMS), which is a specific
instantiation of the more general Enterprise Information System (EIS) as specified in
the Connector Architecture (JSR-16).
• mining object repository (MOR) - The DME uses a mining object repository which
serves to persist data mining objects. This repository can be based on, e.g., the CWM
framework, specifically leveraging the CWM Data Mining metamodel, or imple-
mented using a vendor-proprietary representation. The MOR may exist in a file-based
environment, or in a relational / object database. Section 3.7 discusses JDM persis-tence options.
Figure 1.1 depicts three possible architectures for a JDM implementation. In (a), each
component resides in a separate physical location or separate executable. We view this as
a three-tier architecture with the data stored in a separate repository, such as a database. In
(b), the DME contains the MOR and results in a classic client-server architecture. This
scenario is possible, e.g., where the database contains both the DME and MOR, or the
DME uses the local files system for persistent storage. In (c), the system is monolithic,
i.e., API, DME and MOR reside in, or are managed by a single executable.
FIGURE 1.1 Architecture configuration options
A vendor may choose to provide additional utilities and management interfaces to the
DME and MR, however, these are not defined as part of JDM and may be proprietary. The
JDM specification does not place any requirements on the DME and MOR design or
implementation except to support functionality as required by the JDM interface.
Vendors may implement a subset of the complete JDM specification as noted in the sec-
tion on conformance. This a la carte approach provides a common framework for all data
mining functionality, while allowing vendors to support only vendor-relevant portions of
it.
APIAPI
DM EDM E
MORMOR
APIAPI
DM EDM E
MO RMO R
APIAPI
DM EDM E
MO RMO R
(a) (b) (c)
8/20/2019 Java Datamining Spec.
12/156
Maintenance Release JavaTM Data Mining (JDM) Version 1.1
June 22, 2005 4
1.3 Dependencies and relationships
JDM leverages aspects of the CWM Data Mining metamodel and the Java Metadata Inter-
face (JSR-40). CWM Data Mining facilitates the construction and deployment of data
warehousing and business intelligence applications, tools, and platforms based on OMG
open standards for metadata and system specification (i.e., MOF, UML, XMI, CWM). The
Java Metadata Interface provides a common naming convention for methods.
The following specifications serve as design references for JDM:
• DMG PMML 2.0, [PMML], provides an XML-based representation for mining mod-
els and facilitates interchange among vendors for model results.
• ISO SQL/MM Part 6. Data Mining [SQL/MM-DM] provides a standard interface to
RDBMSs for performing data mining. Concepts from this approach are leveraged in
the overall JDM design.
• Common Warehouse Metamodel [CWM] and CWM Specification, Volume 1, Chapter
15, Data Mining [CWM-DM] provides a sense of the overall structure of the metadata
JDM supports.
1.4 Organization
This document focuses on JDM requirements, concepts, use cases, code examples, pack-
ages supporting the API, and vendor conformance.
In Section 2, we present use cases to help the reader appreciate how this API can be used
under various circumstances, both by end users and vendors conforming to the standard.
In Section 3, we present the synthesis of data mining concepts that form the basis of the
JDM model. These concepts result from analyzing the requirements of many different data
mining functions and algorithms. These concepts are key to providing a unified data min-
ing framework.
In Sections 4, we present the JDM packages and class diagrams to illustrate the relation-ship between the various interfaces and classes. Details of each class are provided in the
companion Javadoc-generated documentation.
In Section 5, we provide and explain code examples using the JDM API. These examples
represent working with the API as a non-data mining expert, relying on convenience rou-
tines to automate much of the specification, as well as exposing detailed specification for
data mining experts.
In section 6, we present the requirements for vendor conformance to the API.
In section 7, we summarize our JDM experience and where the standard is likely to go in
subsequent versions.
In appendix A, we provide a glossary of terms used in this document.
In appendix B, we review the data mining domain requirements and foundation technolo-
gies driving the API. We explore related data mining standards and common system
behavior.
In appendix C, we list optional methods for models and model detail a vendor may choose
to implement.
In appendix D, we provide JDM error codes for JDMException.
8/20/2019 Java Datamining Spec.
13/156
Maintenance Release JavaTM Data Mining (JDM) Version 1.1
June 22, 2005 5
In appendix E, we define Web services based on the JDM model. There has been signifi-
cant interest expressed within the expert group and from external comments for defining a
JDM Web services interface.
In appendix F, we provide a list of references.
1.5 Expert group members
Sarabjot Anand – Corporate Intellect
Robert Brunner – California Institute of Technology
Robert Chu – SAS Institute
Werner Dubitzky* – University of Ulster, N. Ireland
Kim Horn – Sun Microsystems, Inc.
Mark Hornick – Oracle Corporation
Bill Hosken* – SPSS, Inc.
Ronny Kohavi* – Blue Martini Software
Achim Kraiss – SAP AG
Marwane Jay Lamimi – KXEN
Christoph Lingenfelder – IBM Germany
Erik Marcade – KXEN
Somesh Marepalli – Computer Associates International, Inc.
Waddys Martinez* – Magnify
Cindy McMullen – BEA Systems
Chuck Mosher – Sun Microsystems, Inc.
John Poole – Hyperion Solutions
Michal Prussak – Fair Isaac
Alex Russakovskii – Hyperion Solutions
Mike Smith – Strategic Analytics
Qian (Cherry) Yang – Computer Associates International, Inc.
Sunil Venkayala – Oracle Corporation
Andrew Walaszek – SPSS, Inc.
Hankil Yoon – Oracle Corporation
* former member
1.6 Acknowledgements
The expert group recognizes and thanks Dipankar Roy and Shiby Thomas for reviewing
previous drafts. We also recognize and thank Marcos Campos, Gary Drescher, Boriana
Milenova, Joe Yarmus, and Yan.Zhuang for their contributions to the JDM effort.
8/20/2019 Java Datamining Spec.
14/156
Maintenance Release JavaTM Data Mining (JDM) Version 1.1
June 22, 2005 6
2. Use cases
The use cases presented in this section provide a context in which to understand the possi-
ble uses of JDM. We have divided use cases into two categories: those relevant to applica-
tions and those relevant to vendors implementing JDM conforming products. Readers
already familiar with data mining may want only to browse this section.
Several JDM concepts are introduced briefly below to assist in understanding the use
cases. These are described in more detail in Section 3. The reader is expected to be famil-
iar with common data mining terminology.
Mining Function - A major subdomain of data mining that shares common high level
characteristics. Functions include classification, regression, attribute importance, associa-
tion, and clustering.
Task - A container within which to specify arguments to data mining operations to be per-
formed by the data mining engine. Tasks include model building, testing, applying (scor-
ing), computing statistics, and object import and export. Tasks may execute synchronously
or asynchronously.
Settings - A collection of parameters specifying the input for building a data mining
model or applying a model to data (i.e., scoring). Build settings may be high level, speci-
fied for mining functions, or detailed, specified for mining algorithms. Apply settings
specify the content of the scoring result, and in some cases, affect the type of content pro-
vided. For example, a cost matrix may be specified for classification at apply time.
Model - An algorithm often produces a compressed representation of input data called a
model. This model contains the essential knowledge extracted from the data as determined
by the algorithm. A model can be descriptive or predictive. A descriptive model helps in
understanding the underlying data or model behavior. For example, an association rules
model on market basket data can be used to describe consumer behavior. A predictive
model can be an equation or set of rules that makes it possible to predict an unseen or
unknown value (the dependent variable or target ) from other, known values (independent
variables or predictors).
2.1 Application use cases
In this section, we present several end-user use cases involving application developers that
explore a wide variety of situations in which JDM can be used.
2.1.1 Mining GUI
A team of developers is tasked with producing a GUI for visualizing data mining objects.
They use JDM to develop a tool for exposing objects for building models such as build
settings, and viewing model representations or contents. The models themselves includedecision trees, neural networks, and mining results such as confusion matrices and lift.
Decision trees can be traversed and graphically displayed in a tree representation; neural
networks can be traversed and graphically displayed to show hidden layers and weights on
connections. The GUI also supports scoring data, testing models, computing lift and
graphically displaying lift charts.
In this use case, a JDM implementation provides the enabling data mining functionality. If
only standard JDM features are leveraged, this GUI could be portable across vendor JDM
implementations.
8/20/2019 Java Datamining Spec.
15/156
Maintenance Release JavaTM Data Mining (JDM) Version 1.1
June 22, 2005 7
2.1.2 Web specialty retailer
A specialty retailer sells from a website, catalogs, and stores. The website has a recom-
mendation feature that is supported by data mining. Customer data are collected from each
company’s points of sale into its data warehouse. Sales data are combined with demo-
graphic data such as age, gender , and income. Demographic data together with product
categories are regularly mined for customer 'clusters' using a clustering algorithm. Product
sales data are then partitioned by customer cluster and each cluster is mined for product
associations using association rules algorithms. The website uses the resulting association
rules to make online product recommendations with each addition to the customer’s vir-
tual shopping cart.
In this use case, multiple JDM mining functions are leveraged: clustering, association, and
the ability to score individual records to support online product recommendations.
2.1.3 Campaign management
A campaign management application provides automated support for identifying custom-
ers to receive a marketing campaign. The application has access to data collected on cus-
tomer demographics and responsiveness to such mailing campaigns. This applicationleverages database vendor-specific transformations to prepare data for mining.
Using the mining function attribute importance (also referred to as feature selection), the
application determines which attributes are most relevant for model building. By using a
smaller set of attributes, model build time can be reduced, model predictive accuracy can
increase, and the attributes most valuable to collect from customers can be highlighted.
The application uses a decision tree algorithm to produce rules that can be understood by
the marketing manager, possibly for developing more targeted mailings to customers of a
given set of demographics. Once the model is built, the application tests the model and
sends the test and lift results to the campaign manager, who can assess model quality and
expected results.
Unless directed otherwise, the application uses this model to score new customers eligible
for a new mailing campaign. Those customers who have a probability greater than 75% to
respond to the mailing will be selected for the mailing.
In this use case, data preprocessing may occur outside JDM using proprietary or ad hoc
techniques. Multiple JDM mining functions and operations are leveraged through task
specification. To communicate models and results to other users, these objects can be
exported, perhaps using an XML representation. JDM’s flexible apply settings allow the
application to specify the score, probability, customer id, and possibly other input data to
be part of the apply result table. Finally, JDM’s rule representation and the ability of cer-
tain algorithms to produce rules is leveraged to explain model behavior. Note that JDM
defines predicate-based rules from the decision tree algorithm for either classification or
regression mining functions, and the clustering mining function for the K-Means algo-rithm.
2.1.4 Minimal top level specification
A college student learned about the potential of data mining to solve many problems. For
her senior biology thesis, she wants to cluster the data she’s collected over the past year on
wild grasses of the African plains to help her categorize those grasses.
8/20/2019 Java Datamining Spec.
16/156
Maintenance Release JavaTM Data Mining (JDM) Version 1.1
June 22, 2005 8
Although an avid Java programmer, she is unfamiliar with the details of data mining. Hav-
ing read about JDM and having access to a commercial implementation through her
school, she leverages all the automated aspects of JDM, specifying only the data and
accepting all default settings for the Clustering build settings. In this way, no algorithm
selection is necessary, nor any algorithm-specific settings.
She uses the API for the clustering model for inspecting the identified clusters.
In this use case, JDM allows novice users to extract benefit from data mining technology
by eliding algorithm details. Vendor implementations may vary in the degree of automa-
tion and the quality of models that automation produces.
2.1.5 Selecting the “best” model
An e-tailer builds models on projected customer revenue from which to base providing
customer discounts. The data analyst for the e-tailer builds multiple regression models
drawing on several algorithms: neural networks, decision trees, and naive bayes. After
building several models of each type, the models are tested against held-aside test data and
lift is computed. An initial criterion for selecting the “best” model is the one with the least
r-squared error.
In this use case, the data analyst leverages a JDM implementation’s ability to reuse a sin-
gle regression build settings object, supplying different algorithm settings. In addition,
each model can be tested by defining test tasks, and coding an outer loop to iterate over the
test results to identify the “best” model.
2.1.6 Comparing vendor implementations
Data Mining Laboratories (DML) performs independent analysis on data mining software
to measure performance, ease of use, and model portability. DML compares the effective-
ness of several vendors’ regression decision tree implementations in building models for
economic forecasting. Economic forecasts are used in corporate planning to align corpo-
rate strategy with the expected economic climate. Using JDM, the DML developers code atest application that builds one neural network model per vendor implementation. After
testing each model, the investigators rank order models according to forecast accuracy,
learning time and the ratio of these two. To ensure fairness in assessing model perfor-
mance and conformance for model portability, a separate scoring engine is used that
accepts PMML standard XML models and generates scores for the test data.
In this use case, the developers are able to code a single program and execute on multiple
vendor implementations, modifying only login information. By exporting models in
PMML format, models can be objectively assessed in a common scoring engine.
2.1.7 Incremental learning
A machine tool manufacturer collects data on the machine settings, materials, and defect
rates for the tools manufactured. These data are provided to a neural network algorithm to
predict the probability of defective components in a given batch of product. Because data
are collected over time, and the architecture of the neural network and specific learning
algorithm chosen is compute intensive, the manufacturer needs to apply incremental learn-
ing on the neural network as new data is available from each production run.
In this use case, JDM provides an interface that enables incremental learning, i.e., the abil-
ity to continue building a model with the original build data or new data. To support this, a
8/20/2019 Java Datamining Spec.
17/156
Maintenance Release JavaTM Data Mining (JDM) Version 1.1
June 22, 2005 9
user specifies an existing JDM model as input to a build task, along with other required
inputs. On execution of the task, the DME uses this model as a seed from which to con-
tinue building the model. This optional specification can be used for any type of algorithm
that can leverage a seed model.
2.1.8 Deferred task execution
A cancer researcher, who has limited access to hardware for building and testing models,
needs to define and verify a series of mining tasks and storing them in the mining object
repository. The researcher may even build trial models on very small datasets as part of
verifying the task. Using an external scheduling mechanism, such as UNIX cron jobs, the
researcher schedules execution of these tasks over night, when computing resources are
more available.
In this use case, the researcher uses JDM’s task specification and ability to store objects in
the mining object repository. These can later be retrieved for execution. The verify method
allows the researcher to have a greater sense that his tasks will execute to completion. The
verify method typically checks if the logical and physical data map properly and if the
combination of settings specified are compatible.
2.1.9 Explaining model behavior
A bank leverages data mining to predict credit risk for customers seeking home equity
loans. To comply with government regulations to not discriminate based on gender or
race, the bank must be able prove that the rules they apply to determine credit risk exclude
such criteria.
The bank ’s data analyst is required to produce a set of human understandable rules, ideally
in english-like format, that can be presented to government auditors as needed. Bank man-
agement also reviews these rules to target certain customer segments for special promo-
tions.
In this use case, the analyst uses the JDM tree settings to request a decision tree represen-
tation for a classification model, predicting credit worthiness as low, medium, or high. The
analyst then uses JDM’s interface to generate rule objects from the decision tree model
and translate these rules to a particular format. A given vendor may have an english format
implemented for rules.
2.1.10 Manually enhancing a model
A private security agency builds decision tree models to profile suspicious individuals and
identify individuals at airports for further screening. However, in their experience they
have found manually enhancing a model can improve its performance and accuracy. Their
data mining analyst builds a model, generates an english-like representation of the rules,
removes certain irrelevant rules and possibly even adjusts some of the rule predicates.Importing this modified model to the data mining system, the analyst sets up an applica-
tion to enable profiling by leveraging single record scoring of individuals, accessing infor-
mation stored in government databases and information obtained from travelers at the
airport.
In this use case, the analyst also uses the JDM tree settings to build a classification model.
The rules are generated from the decision tree model and analyzed. However, since JDM
does not enable direct model modification via the API, the analyst can export the model,
perhaps in PMML, to ensure model integrity. The analyst modifies the model and attempts
8/20/2019 Java Datamining Spec.
18/156
Maintenance Release JavaTM Data Mining (JDM) Version 1.1
June 22, 2005 10
to import the model. Validation of the manually modified model occurs at import. JDM’s
support for single record scoring enables the analyst to produce an application that joins
information stored in a database about individuals with that dynamically acquired by air-
port personnel, perhaps at the ticket counter.
2.1.11 OLAP schema refinement
An OLAP vendor creates cube schemas from fact tables stored in a relational database. A
particular fact table contains millions of records representing sales and customer informa-
tion of a beverage retail company. The OLAP vendor needs to create a schema for the
OLAP cube to enable analyzing and reporting the retailer's sales data.
A cube schema is a set of dimensions each having a particular hierarchy of attributes.
Dimensions usually correspond to several columns in the fact table, however, not all col-
umns should necessarily produce a dimension. A dimension normally represents an
attribute that is orthogonal to other dimensions in the fact table. In addition, some of the
columns, identified in advance, represent measures in the model.
Choosing the right set of dimensions is key to OLAP providers. If the number of dimen-
sions is too large, efficient processing of the cube becomes practically impossible. On theother hand, dropping important attributes makes data analysis deficient. Poor cube design
is one of the factors that inhibit OLAP productivity. Therefore it is important to choose the
right schema.
The optimization process of a cube structured can be seen from two different perspectives.
Starting from a fact table with hundreds of columns, OLAP vendors are either interested
in:
• identifying truly independent columns, or
• identifying what are the important columns to be kept in the optimized cube structure.
Attribute importance can be used to select the most important independent columns to bet-
ter ’see’ a given measure. For example, an internal mechanism can build an analytic dataset with columns describing both customer characteristics and product characteristics with
the sales amount as a target. Then this system trains an attribute importance model on this
data set. It returns the columns (either describing some aspects of the customer or the
product) that allow to understand better the spread of average sales figure. Some advanced
systems can even return not only the important columns but also the drilling hierarchies
that can be associated with these columns (segments for continuous variables and groups
of categories for discrete variables). These important columns will be used to create an
(eventually ad-hoc) optimized cube structure that the final user will use to understand bet-
ter the average sales figure and build ’segments’ that will combine the customer or prod-
ucts characteristics that are the most explanatory.
Such schema refinements are intractable in large cubes without data mining.
2.1.12 Web services
List Inc. offers a comprehensive list management service that includes data warehousing,
grooming, merging and predictive modeling. All their services are available as Web ser-
vices allowing customers to integrate List Inc.’s software seamlessly into their own enter-
prise systems using the Internet.
8/20/2019 Java Datamining Spec.
19/156
Maintenance Release JavaTM Data Mining (JDM) Version 1.1
June 22, 2005 11
The Data Web service allows customers to connect to a managed warehouse and store
their transaction, customer and sales data using a secure Web service interface. List Inc.
manages the customer data in its data warehouse, cleans and grooms the data, and pro-
vides a range of preprocessing and transformation facilities. They maintain a comprehen-
sive repository of high quality background data including income, census, and
demographic and geographic data. List Inc. has relationships with many data vendors and
can call upon their services when required. This background data is merged with the cus-
tomer data using their proprietary merge technology.
List Inc. offers a complete model training and testing facility that guarantees optimal
results. The customer data is used to build predictive models to determine the best
responders, cross sell and up sell models and investigate return on investment (ROI). List
Inc. has a comprehensive testing facility that can choose the best algorithm and product
combination that delivers the optimal ROI. The customer does not have to worry about
data mining tool integration, training and testing.
The customer decides only on the schedule for updating models and the ROI they require.
List Inc. owns two super computers to provide the fastest modeling facilities available
today.
JDM is critical to List Inc.’s services. The Predictive Web service wraps JDM to allow thecustomer to apply models. The Training Web service wraps JDM to allow the customer to
build models and set parameters. JDM is used internally to connect to different vendor
data mining tools and algorithms in their building and testing processes.
The Training Web service can be used by both novice customers and experienced data
analysts. Mining savvy data analysts can tailor the training process, choose particular
algorithms and their settings. In addition, they can choose the attributes from their data
they wish to include in models.
The Prediction Web service provides access to the resultant models across the net. The
Prediction Web service interface is called with new prospect data and the score outcome
returned. The service allows customers to enhance their software systems and their own
web sites with predicted outcomes as if they owned the data mining tools themselves.
2.2 Vendor use cases
In this section, we present several use cases that explore how vendors can leverage JDM in
commercial JDM implementations.
2.2.1 Broad support of JDM
A data mining vendor has a wide range of algorithms that addresses each of the JDM min-
ing functions. The vendor’s objective is to simplify mining for unsophisticated users. As
such, the vendor provides automated selection of algorithms without requiring (or allow-ing) the user to select specific algorithms or provide algorithm-specific control of algo-
rithms, e.g., maximum tree depth in a decision tree.
In this use case, the vendor must implement all packages of the API except Algorithm sub-
classes and model detail subclasses. Users of the vendor’s data mining product will spec-
ify build settings only, obtain models, and be able to view and use those models as
appropriate. Note that the end-user can see only the function-specific model representa-
tions, not their underlying algorithm-specific model representations.
8/20/2019 Java Datamining Spec.
20/156
Maintenance Release JavaTM Data Mining (JDM) Version 1.1
June 22, 2005 12
2.2.2 Narrow support of JDM
A data mining vendor Neural Networks, Inc. (NNI) supports various neural network algo-
rithms, both published and proprietary in their data mining tool. NNI supports both classi-
fication and regression. The vendor chooses to be JDM compliant to gain acceptance in
the marketplace.
In this use case, JDM, as an a la carte standard, allows a vendor to implement a narrow
portion of the standard to reflect its specific domain, or subset of mining functions sup-
ported. The JDM packages to support this include the core foundation packages and a
select few specific to neural networks including algorithm settings and model detail.
For the vendor’s proprietary algorithms, an additional Java package
nni.feedforwardneuralnetwork is provided which includes the specific proprietary algo-
rithm settings and model representations.
8/20/2019 Java Datamining Spec.
21/156
Maintenance Release JavaTM Data Mining (JDM) Version 1.1
June 22, 2005 13
3. Concepts
In this section, we introduce JDM concepts: mining function, task, principal objects, phys-
ical data representations, attribute mapping, physical data storage, object references, and
reflection and introspection.
3.1 Data mining functions
In general, data mining functions can be classified into two categories: supervised and
unsupervised . Supervised functions are typically used to predict a value and require the
specification of a known outcome or target for each case to be used during model build-
ing. Examples of targets include binary attributes indicating buy/no-buy, churn/no-churn,
success/failure, and multi-class attributes indicating preferred color choice from among
the primary colors, likely salary range binned in $20,000 increments. The target allows the
algorithm to determine how well it is predicting target values. An example of supervised
learning algorithms includes Naive Bayes for classification.
Unsupervised functions do not use a target, and are typically used to find the intrinsic
structure, relations, or affinities in a body of data. Examples of unsupervised learningalgorithms include k-means clustering and Apriori association. Clustering may be used to
identify naturally occurring groups of proteins among hundreds of cases, or retail cus-
tomer segmentation. The itemset rules returned from an association model can be used to
identify products to cross-sell to retail customers.
Another view of mining involves whether data mining is descriptive or predictive.
Descriptive data mining describes a dataset in a concise and summary manner, and pre-
sents interesting general properties of the data. Algorithms supporting descriptive data
mining include k-means clustering, Apriori association, and even decision tree classifica-
tion. Predictive data mining constructs one or a set of models, performs inference on the
available dataset, and attempts to predict outcomes for new data sets. Algorithms support-
ing predictive data mining include neural networks, SVM, and decision tree classification/
regression, and even k-means clustering when used to assign new records to clusters.
Different algorithms serve different purposes, each algorithm offering its own advantages
and disadvantages. JDM specifies the following mining functions: classification, regres-
sion, attribute importance, clustering, and association. Some algorithms can be used
across multiple data mining functions.
3.1.1 Classification
Classification has been used in customer segmentation, business modeling, and credit
analysis. As a type of supervised learning, an algorithm supporting classification builds a
model from a set of predictors that are used to predict a target . A set of predictors may
include demographic data such as age, income, number of children, and zip code, to pre-
dict the binary target buy/no-buy a minivan. The input or build data for a supervised learn-ing algorithm requires the presence of attributes for both predictors and target in each
case. Given a pre-determined set of classes in the target attribute, classification analyzes
the build data to create a model that can predict to which class a given case belongs.
3.1.2 Regression
Regression has been used in financial forecasting, time series prediction, biomedical and
drug response modelling, and environmental modelling. Also a type of supervised learn-
8/20/2019 Java Datamining Spec.
22/156
Maintenance Release JavaTM Data Mining (JDM) Version 1.1
June 22, 2005 14
ing, regression involves predicting a continuous, numerical valued target attribute given a
set of predictors. A regression problem may use the same predictors as a classification
problem, but specifies a target such as the predicted lifetime value of a customer.
3.1.3 Attribute Importance
Attribute importance is used to determine which attributes are most relevant for building amodel. Attribute importance can be used for both supervised and unsupervised learning.
Attribute importance enables users to reduce model build time, and in some algorithms,
reduce data scoring time by including only the most important attributes from the build
data. Eliminating “noise” attributes from data can also improve accuracy or model quality.
Attribute importance serves a purpose similar to feature selection. It produces a model that
ranks attributes according to how each attribute contributes to the quality of a model built.
From the ranking of attributes, users may select the attributes to be used in building mod-
els. The user can specify a number or percentage of attributes to use; alternatively a user
can specify a cutoff point. Note that the ranking of attributes is interpretable usually in a
relative sense. JDM specifies no precise interpretation of attribute rank values other than
attributes with a greater numeric value are relatively more important.
3.1.4 Clustering
Clustering has been used in customer segmentation, gene and protein analysis, product
grouping, finding numerical taxonomies, and text mining. Clustering analysis identifies
clusters embedded in the data, where a cluster is a collection of data objects that are simi-
lar to one another. A good clustering method produces high quality clusters to ensure that
the inter-cluster similarity is low and the intra-cluster similarity is high. The similarity of
two values of an attribute can be expressed as distance functions. For numeric data, this
can be as simple as the euclidean distance between points. For categorical data, similarity
can be expressed to make married and cohabiting closer to one another, as well as sepa-
rated and divorced.
3.1.5 Association
Association has been used in market basket analysis and the analysis of consumer behav-
ior for the discovery of relationships or correlations among a set of items, e.g., the pres-
ence of one pattern implies the presence of another pattern. They help to identify the
attribute value conditions that occur frequently together in a given set of data. Association
analysis is widely used in transaction data analysis for directed marketing, catalog design,
and other business decision-making process. Traditionally, association is used for market
basket data analysis such as 90% of the people who buy milk also buy bread.
Support and confidence metrics are used as a quality measure of a rule within an associa-
tion model. These are available in JDM as part of the Association model for each rule pro-
duced. Note that the rules returned from an association model are different from the
predicate-based rules produced from clustering models or decision tree models. Here, the
rules consist of a set of items. These items typically occur together in a single transaction,
such as the items purchased at an online retail checkout.
The support of a rule is used to ensure that the items in associated in the rule occur
together frequently enough to be considered significant. Using the probability notation,
support (A B) = P(A, B)
8/20/2019 Java Datamining Spec.
23/156
Maintenance Release JavaTM Data Mining (JDM) Version 1.1
June 22, 2005 15
The confidence of a rule is the conditional probability of B given A; confidence (A B)
= P (B/A) which is equal to P(A, B) / P(A).
3.2 Data mining tasks
Data mining revolves around a few common tasks: building a model, testing a model,
applying a model to data, computing statistics, and importing and exporting miningobjects. Each of these are discussed below.
3.2.1 Building a model
JDM enables users to build models in the functional areas: classification,
regression, attribute importance, clustering, and association. The model serves as a typi-
cally concise or compact representation of the information contained in the data, relative
to the algorithm that produced it. To build models, users define tasks, which minimally
require the input parameters: model name, mining data and mining settings. Settings con-
tain parameters that describe the type of model to be built, as well as directions to the spe-
cific algorithm used to build the model.
There are two levels of settings: function and algorithm. Recall that the mining function
addresses the type of problem to be solved, e.g., classification or clustering, and the min-
ing algorithm addresses the specific technique to be applied to solve that problem, e.g.,
decision tree or k-means. When a user does not specify algorithm settings in a build set-
tings, the Data Mining Engine (DME) may choose an appropriate algorithm for the task,
either dynamically or statically, providing defaults for the relevant parameters. Model
building at the function level eliminates much of the technical details of data mining for
the user. The quality of models will be determined by the sophistication of the vendor’s
implementation and the quality of the data.
Build data, i.e., the data used as input to build a model, can be in different forms. The
attributes of the build data to be used in model building may be specified in the logical
data associated with the build settings. JDM supports flexible assignment of build data tothe logical data. If logical attributes do not map directly to physical attributes with name-
based equivalence, an explicit mapping may be provided using the task object.
A typical scenario for model building is as follows:
1. Create a physical data object (by identifying existing data in a database table or file)
2. Create a build settings object
3. Create a logical data instance based on the physical data and associate it with the build
settings (optional)
4. Create an algorithm settings object and associate it with the build settings (optional)
5. Create a build task and set the physical data and build settings
6. Map the physical attributes to logical attributes (if necessary)
7. Invoke the execute method using the task
After a model is built by the DME, it can be persisted in the MOR. See section 3.7 for
details on JDM persistence options.
The result of a build is a model. Especially for predictive models, the number of logical
attributes used by the model may be a subset of those provided in the logical data. As
such, the model has a signature specifying the possible input attributes to the model for
8/20/2019 Java Datamining Spec.
24/156
Maintenance Release JavaTM Data Mining (JDM) Version 1.1
June 22, 2005 16
apply. These are not required attributes as a subset may be specified where NULL values
can be handled. Some algorithms perform automatic attribute selection, e.g., with a deci-
sion tree model, 100 attributes may have been used to train the model, but only 25 were
used in the final rule set and are necessary for scoring. These 25 constitute the model sig-
nature.
3.2.1.1 Incremental learning
Some applications have a nearly continuous stream of data available for model building. A
typical approach is to collect a certain amount of data, build a model from it, use the
model to score new data for some period, and then build a new model from scratch, possi-
bly using all the data accumulated to date, or using a fixed amount of data, but using the
most recent data.
This approach can be unnecessarily costly, especially for algorithms such as Naive Bayes
or Association Rules where summary frequency counts are maintained. The frequency
counts of the existing data do not change, only the new data added needs to be counted and
the results merged with the previous counts. This produces refreshed models in much less
time.
Algorithms such as neural networks can also leverage incremental learning. Here, a previ-
ously trained neural network can be provided as a seed model. The model is further trained
using new data, but starting from an already good model.
JDM provides support for incremental learning by allowing a seed model to be specified
as input to the build task. Not all functions or algorithms are expected to handle the speci-
fication of a seed model for incremental learning. The function and algorithm capabilities
list indicates if this feature is supported, which is vendor-specific.
3.2.1.2 Model evaluation
Some algorithms, such as neural networks or decision trees use a portion of the build data
to iteratively determine how well the model is learning patterns from the data. These algo-
rithms will split the build data into a train and evaluation dataset according to some inter-
nal percentage, e.g., 50%-50%, or 70%-30%. Some users, however, wish to control more
carefully the data that is used for training versus that used for evaluation.
JDM provides support for specifying the evaluation data explicitly in the build task to be
used during model building. Although some vendors may provide proprietary algorithm
settings to allow specifying the percentage of data to be used for evaluation, JDM provides
the more explicit option of providing the actual data.
3.2.2 Testing a model
Model testing gives an estimate of the accuracy a model has in predicting the target of asupervised model. Testing follows model building to compute the accuracy of a model’s
predictions when the model is applied to a previously unseen dataset, separate from the
build dataset. This provides an honest estimate of the accuracy.
The test task accepts a model and data for testing the model. Test results are stored in a
TestMetrics object as specified in the task. Physical attribute to logical attribute mapping
may be specified if the names of physical and logical attributes do not match. The test data
must be compatible with the model signature.
8/20/2019 Java Datamining Spec.
25/156
Maintenance Release JavaTM Data Mining (JDM) Version 1.1
June 22, 2005 17
Test data must be preprocessed in the same way as the build data. The user is responsible
for ensuring this compatibility. However, some DMEs may choose to use information
present in the LogicalData stored with the model to flag incompatibilities.
Test metrics content depends on the type of model. For example, classification models
produce a confusion matrix, whereas regression models provide error estimates. In addi-
tion to obtaining a confusion matrix, model testing includes option to compute lift and
receiver operating characteristics (R