+ All Categories
Home > Documents > Java Datamining Spec.

Java Datamining Spec.

Date post: 07-Aug-2018
Category:
Upload: joe-dane
View: 230 times
Download: 0 times
Share this document with a friend

of 64

Transcript
  • 8/20/2019 Java Datamining Spec.

    1/156

    Maintenance Release JavaTM Data Mining (JDM) Version 1.1

    June 22, 2005 1

    JavaTM Specification Request 73:

    JavaTM Data Mining (JDM)

    JSR-73 Expert Group

    Specification Lead:Mark Hornick, Oracle Corporation

    [email protected]

    Technical comments:

     [email protected]

    Version 1.1

    Maintenance Release Specification

    June 22, 2005

  • 8/20/2019 Java Datamining Spec.

    2/156

    Maintenance Release JavaTM Data Mining (JDM) Version 1.1

    June 22, 2005 2

    Copyright

    Copyright (c) 2005 Oracle Corporation. All rights reserved.

    Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.

    This product or document is protected by copyright and distributed under licenses restrict-

    ing its use, copying, distribution, and decompilation. No part of this product or documen-

    tation may be reproduced in any form by any means without prior written authorization of

    the copyright holders, or any of the licensors, if any. Any unauthorized use may be a viola-

    tion of domestic or international law. RESTRICTED RIGHTS LEGEND: Use, duplica-

    tion, or disclosure by the U.S. Government and its agents is subject to the restrictions of

    FAR 52.227-14(g)(2)(6/87) and FAR 52.227-19(6/87), or DFAR 252.227-7015(b)(6/95)

    and DFAR 227.7202-3(a).

    Disclaimer

    This document and its contents are furnished “as is” for informational purposes only, andare subject to change without notice. Oracle Corporation (Oracle) does not represent or

    warrant that any product or business plans expressed or implied will be fulfilled in any

    way. Any actions taken by the user of this document in response to the document or its

    contents shall be solely at the risk of the user.

    ORACLE MAKES NO WARRANTIES, EXPRESSED OR IMPLIED, WITH RESPECT

    TO THIS DOCUMENT OR ITS CONTENTS, AND HEREBY EXPRESSLY DIS-

    CLAIMS ANY AND ALL IMPLIED WARRANTIES OF MERCHANTABILITY, FIT-

    NESS FOR A PARTICULAR USE OR NON-INFRINGEMENT. IN NO EVENT SHALL

    ORACLE BE HELD LIABLE FOR ANY DIRECT, INDIRECT, SPECIAL OR CONSE-

    QUENTIAL DAMAGES IN CONNECTION WITH OR ARISING FROM THE USER

    OF ANY PORTION OF THE INFORMATION.

    Trademarks

    Sun, Sun Microsystems, Java, JavaBeans, and Enterprise JavaBeans are trademarks, reg-

    istered trademarks, or servicemarks of Sun Microsystems, Inc. in the U.S. and other coun-

    tries.

    OMG, Object Management Group, CORBA, Unified Modeling Language, UML, are reg-

    istered trademarks or trademarks of the Object Management Group, Inc.

    All other product or company names mentioned are for identification purposes only, and

    may be trademarks of their respective owners.

  • 8/20/2019 Java Datamining Spec.

    3/156

    Maintenance Release JavaTM Data Mining (JDM) Version 1.1

    June 22, 2005 1

    1. Overview..................................................................................................................1

    1.1 Introduction..........................................................................................................................1

    1.1.1 Benefits..................................................................................................................1

    1.1.2 Target audience......................................................................................................2

    1.1.3 Data analytics JSRs ...............................................................................................2

    1.1.4 Exclusions .............................................................................................................2

    1.2 Architectural components ....................................................................................................31.3 Dependencies and relationships...........................................................................................4

    1.4 Organization.........................................................................................................................4

    1.5 Expert group members.........................................................................................................5

    1.6 Acknowledgements..............................................................................................................5

    2. Use cases..................................................................................................................6

    2.1 Application use cases...........................................................................................................6

    2.1.1 Mining GUI ...........................................................................................................6

    2.1.2 Web specialty retailer ............................................................................................7

    2.1.3 Campaign management .........................................................................................7

    2.1.4 Minimal top level specification.............................................................................7

    2.1.5 Selecting the “best” model ....................................................................................82.1.6 Comparing vendor implementations .....................................................................8

    2.1.7 Incremental learning..............................................................................................8

    2.1.8 Deferred task execution.........................................................................................9

    2.1.9 Explaining model behavior....................................................................................9

    2.1.10 Manually enhancing a model.................................................................................9

    2.1.11 OLAP schema refinement ...................................................................................10

    2.1.12 Web services........................................................................................................10

    2.2 Vendor use cases ................................................................................................................11

    2.2.1 Broad support of JDM.........................................................................................11

    2.2.2 Narrow support of JDM.......................................................................................12

    3. Concepts.................................................................................................................13

    3.1 Data mining functions........................................................................................................133.1.1 Classification.......................................................................................................13

    3.1.2 Regression ...........................................................................................................13

    3.1.3 Attribute Importance ...........................................................................................14

    3.1.4 Clustering ............................................................................................................14

    3.1.5 Association ..........................................................................................................14

    3.2 Data mining tasks...............................................................................................................15

    3.2.1 Building a model .................................................................................................15

    3.2.2 Testing a model....................................................................................................16

    3.2.3 Applying a model ................................................................................................17

    3.2.4 Object import and export.....................................................................................18

    3.2.5 Computing statistics on data................................................................................19

    3.2.6 Verifying task correctness....................................................................................19

    3.3 Principal objects.................................................................................................................20

    3.3.1 Connection...........................................................................................................20

    3.3.2 Task......................................................................................................................20

    3.3.3 Execution handle and status ................................................................................20

    3.3.4 Physical data set ..................................................................................................21

    3.3.5 Physical data record.............................................................................................21

    3.3.6 Build settings.......................................................................................................21

    3.3.7 Algorithm ............................................................................................................22

    3.3.8 Algorithm settings ...............................................................................................22

  • 8/20/2019 Java Datamining Spec.

    4/156

    Maintenance Release JavaTM Data Mining (JDM) Version 1.1

    June 22, 2005 2

    3.3.9 Model...................................................................................................................22

    3.3.10 Model signature...................................................................................................22

    3.3.11 Model detail.........................................................................................................23

    3.3.12 Logical attribute...................................................................................................23

    3.3.13 Logical data .........................................................................................................23

    3.3.14 Attribute statistics set ..........................................................................................23

    3.3.15 Apply settings......................................................................................................24

    3.3.16 Confusion matrix.................................................................................................24

    3.3.17 Lift.......................................................................................................................24

    3.3.18 Cost matrix ..........................................................................................................25

    3.3.19 Prior probabilities ................................................................................................25

    3.3.20 Category sets .......................................................................................................26

    3.3.21 Taxonomy............................................................................................................26

    3.3.22 Rules....................................................................................................................27

    3.3.23 Verification report................................................................................................27

    3.4 Physical data representations.............................................................................................27

    3.4.1 Individual record .................................................................................................27

    3.4.2 Single record case table.......................................................................................28

    3.4.3 Multi-record case table ........................................................................................28

    3.4.4 Data preparation ..................................................................................................293.5 Attribute mapping ..............................................................................................................29

    3.5.1 Direct mapping ....................................................................................................29

    3.5.2 Pivot mapping......................................................................................................30

    3.6 Creating physical data objects ...........................................................................................30

    3.7 Persistence .........................................................................................................................30

    3.8 Object references ...............................................................................................................31

    3.9 Reflection / introspection...................................................................................................32

    4. Packages.................................................................................................................34

    4.1 Design overview ................................................................................................................34

    4.2 Notation .............................................................................................................................34

    4.3 Package structure...............................................................................................................36

    4.4 Package javax.datamining .................................................................................................38

    4.5 Package javax.datamining.base .........................................................................................40

    4.6 Package javax.datamining.resource ...................................................................................43

    4.7 Package javax.datamining.data..........................................................................................44

    4.8 Package javax.datamining.task..........................................................................................47

    4.8.1 Package task.apply ..............................................................................................49

    4.9 Package javax.datamining.supervised ...............................................................................50

    4.9.1 Package supervised.classification........................................................................51

    4.9.2 Package supervised.regression ............................................................................54

    4.9.3 Package attributeimportance ...............................................................................55

    4.10 Package javax.datamining.association...............................................................................564.11 Package javax.datamining.clustering.................................................................................58

    4.12 Package javax.datamining.rule ..........................................................................................61

    4.13 Package javax.datamining.statistics...................................................................................62

    4.14 Package javax.datamining.algorithm.................................................................................63

    4.14.1 Package algorithm.tree ........................................................................................63

    4.14.2 Package algorithm.naivebayes ............................................................................64

    4.14.3 Package algorithm.feedforwardneuralnet............................................................65

    4.14.4 Package algorithm.svm........................................................................................66

  • 8/20/2019 Java Datamining Spec.

    5/156

    Maintenance Release JavaTM Data Mining (JDM) Version 1.1

    June 22, 2005 3

    4.14.5 Package algorithm.kmeans ..................................................................................67

    4.15 Package javax.datamining.modeldetail..............................................................................68

    4.15.1 Package modeldetail.tree.....................................................................................68

    4.15.2 Package modeldetail.feedforwardneuralnet ........................................................69

    4.15.3 Package modeldetail.naivebayes .........................................................................69

    4.15.4 Package modeldetail.svm ....................................................................................70

    5. Code examples .......................................................................................................715.1 Building a clustering model...............................................................................................71

    5.2 Applying a clustering model to data..................................................................................73

    5.3 Applying a clustering model to a record............................................................................74

    5.4 Building a classification model..........................................................................................75

    5.5 Testing a classification model............................................................................................76

    5.6 Building and extracting rules from a tree model ...............................................................77

    5.7 Extracting rules from an association model.......................................................................79

    5.7.1 Get rules with minimum support.........................................................................79

    5.7.2 Get rules with minimum support and confidence................................................79

    5.7.3 Get rules containing certain items .......................................................................80

    5.7.4 Get rules that do not contain certain items..........................................................815.8 Importing and exporting a model.......................................................................................81

    5.8.1 Import an object using a URI ..............................................................................82

    5.8.2 Export a model ....................................................................................................83

    5.8.3 Export an object to a destination .........................................................................83

    5.9 Using reflection..................................................................................................................84

    5.10 Establishing a connection ..................................................................................................85

    5.11 Uniform resource identifiers ..............................................................................................85

    6. Conformance statement .........................................................................................87

    6.1 Required and optional features ..........................................................................................87

    6.2 Vendor extensions ..............................................................................................................88

    6.3 Compliance points .............................................................................................................886.4 Determining conformance .................................................................................................89

    6.4.1 Function level conformance ................................................................................89

    6.4.2 Algorithm level conformance..............................................................................90

    6.4.3 Model apply engine conformance .......................................................................91

    6.5 Claiming conformance.......................................................................................................91

    7. Summary................................................................................................................93

    Appendix A.Glossary.........................................................................................................94

    Appendix B.Requirements...............................................................................................102

    B.1. Domain requirements.......................................................................................................102

    B.2. Foundation technologies..................................................................................................103B.3. Data mining standards .....................................................................................................103

    B.4. System behavior...............................................................................................................103

    B.5. Exclusions for version 1 ..................................................................................................104

    B.5.1. Domain exclusions ............................................................................................104

    B.5.2. System exclusions .............................................................................................104

    Appendix C.Optional Methods ........................................................................................106

    Appendix D.Exceptions ...................................................................................................107

  • 8/20/2019 Java Datamining Spec.

    6/156

    Maintenance Release JavaTM Data Mining (JDM) Version 1.1

    June 22, 2005 4

    Appendix E.Web services ................................................................................................110

    E.1. Introduction......................................................................................................................110

    E.2. Methods ...........................................................................................................................111

    E.2.1. WSDL Document Structure ..............................................................................111

    E.2.2. Listing DME Contents.......................................................................................112

    E.2.3. Introspection / Reflection ..................................................................................114

    E.2.4. Saving objects....................................................................................................115E.2.5. Retrieving objects..............................................................................................116

    E.2.6. Removing objects ..............................................................................................117

    E.2.7. Renaming objects ..............................................................................................118

    E.2.8. Retrieving Object Components .........................................................................119

    E.2.9. Verify Object .....................................................................................................120

    E.2.10. Executing tasks..................................................................................................121

    E.2.11. Getting execution status ....................................................................................123

    E.2.12. Terminating Tasks..............................................................................................123

    E.3. Java methods supporting XML........................................................................................124

    E.4. XML Schema Definition .................................................................................................125

    E.4.1. JDM Document .................................................................................................125

    E.4.2. Task....................................................................................................................125E.4.3. Task.Apply.........................................................................................................128

    E.4.4. Data....................................................................................................................129

    E.4.5. Supervised .........................................................................................................132

    E.4.6. Supervised.Classification ..................................................................................133

    E.4.7. Supervised.Regression.......................................................................................135

    E.4.8. Clustering ..........................................................................................................136

    E.4.9. Association........................................................................................................138

    E.4.10. AttributeImportance ..........................................................................................138

    E.4.11. Statistics.............................................................................................................139

    E.4.12. Algorithm ..........................................................................................................140

    E.4.13. Base ...................................................................................................................143

    E.4.14. Root ...................................................................................................................145

    E.4.15. Enumeration extension......................................................................................146

    Appendix F.References ....................................................................................................148

  • 8/20/2019 Java Datamining Spec.

    7/156

    Maintenance Release JavaTM Data Mining (JDM) Version 1.1

    June 22, 2005 1

    TABLE 1. An example of a single record case table ..........................................................................................28

    TABLE 2. An example of a multi-record case table...........................................................................................28

    TABLE 3. Named and composite object referencing summary..........................................................................31

    TABLE 4. Function-level model behavior........................................................................................................102

    TABLE 5. JDM optional methods for models and model details .....................................................................106

    TABLE 6. JDMException codes and messages ................................................................................................108

    TABLE 7. JDM runtime exceptions, codes, and messages...............................................................................109

  • 8/20/2019 Java Datamining Spec.

    8/156

    Maintenance Release JavaTM Data Mining (JDM) Version 1.1

    June 22, 2005 1

    FIGURE 1.1 Architecture configuration options ......................................................................................................3

    FIGURE 1.2 Example of attribute mapping for apply............................................................................................29

    FIGURE 4.2 Top level package structure ...............................................................................................................37

    FIGURE 4.3 Common top level interfaces .............................................................................................................38

    FIGURE 4.4 Exception classes...............................................................................................................................38

    FIGURE 4.5 Top level enumerations......................................................................................................................39

    FIGURE 4.6 Execution Handle...............................................................................................................................39

    FIGURE 4.7 Package javax.datamining.base - Named Objects .............................................................................40FIGURE 4.8 Package javax.datamining.base - Build Settings, Model, and Task ..................................................40

    FIGURE 4.9 Package javax.datamining.base - BuildSettings ................................................................................41

    FIGURE 4.10 Package javax.datamining.base - Model............................................................................................42

    FIGURE 4.11 Package javax.datamining.resource...................................................................................................43

    FIGURE 4.12 Package javax.datamining.data - PhysicalData .................................................................................44

    FIGURE 4.13 Package javax.datamining.data - LogicalData...................................................................................45

    FIGURE 4.14 Package javax.datamining.data - ModelSignature.............................................................................45

    FIGURE 4.15 Package javax.datamining.data - Taxonomy .....................................................................................46

    FIGURE 4.16 Package javax.datamining.data - CategoryMatrix.............................................................................46

    FIGURE 4.17 Package javax.datamining.data - CategorySet and Interval ..............................................................46

    FIGURE 4.18 Package javax.datamining.task - Build..............................................................................................47

    FIGURE 4.19 Package javax.datamining.task - Import and Export .........................................................................48

    FIGURE 4.20 Package javax.datamining.task - ComputeStatistics..........................................................................48

    FIGURE 4.21 Package task.apply - ApplyTask and ApplySettings .........................................................................49FIGURE 4.22 Package javax.datamining.supervised - Settings and Model.............................................................50

    FIGURE 4.23 Package javax.datamining.supervised - TestTask and TestMetrics ...................................................50

    FIGURE 4.24 Package supervised.classification - Settings and Model ...................................................................51

    FIGURE 4.25 Package supervised.classification - TestTask and TestMetrics..........................................................52

    FIGURE 4.26 Package supervised.classification ClassificationTestMetricsTask.....................................................52

    FIGURE 4.27 Package supervised.classification - ApplySettings............................................................................53

    FIGURE 4.28 Package supervised.classification - Confusion Matrix and Cost Matrix ...........................................53

    FIGURE 4.29 Package supervised.regression - Settings and Model ........................................................................54

    FIGURE 4.30 Package supervised.regression - TestTask, and ApplySettings .........................................................54

    FIGURE 4.31 Package supervised.regression - RegressionTestMetricsTrask..........................................................55

    FIGURE 4.32 Package javax.datamining.attributeimportance - Settings and Model...............................................55

    FIGURE 4.33 Package javax.datamining.associationrules - Settings and Model ....................................................56

    FIGURE 4.34 Package javax.datamining.associationrules - Rule Selection ............................................................57

    FIGURE 4.35 Package javax.datamining.clustering - Model...................................................................................58FIGURE 4.36 Package javax.datamining.clustering - Settings ................................................................................59

    FIGURE 4.37 Package javax.datamining.clustering - ApplySettings ......................................................................60

    FIGURE 4.38 Package javax.datamining.clustering - Similarity Matrix .................................................................60

    FIGURE 4.39 Package javax.datamining.rule - Rule and Predicate.........................................................................61

    FIGURE 4.40 Package javax.datamining.statistics - AttributeStatistics ..................................................................62

    FIGURE 4.41 Package algorithm.tree - TreeSettings...............................................................................................63

    FIGURE 4.42 Package algorithm.naivebayes - NaiveBayesSettings .......................................................................64

    FIGURE 4.43 Package algorithm.feedforwardneuralnet - FeedForwardNeuralNetSettings....................................65

    FIGURE 4.44 Package algorithm.svm.classification - SVMClassificationSettings.................................................66

    FIGURE 4.45 Package algorithm.svm.regression - SVMRegressionSettings..........................................................66

    FIGURE 4.46 Package algorithm.kmeans - KMeansSettings...................................................................................67

    FIGURE 4.47 Package modeldetail.tree - TreeModelDetail ....................................................................................68

    FIGURE 4.48 Package modeldetail.feedforwardneuralnet - NeuralNetworkModelDetail ......................................69

    FIGURE 4.49 Package modeldetail.naivebayes - NaiveBayesModelDetail.............................................................69FIGURE 4.50 Package modeldetail.svm - SVMModelDetail ..................................................................................70

  • 8/20/2019 Java Datamining Spec.

    9/156

    Maintenance Release JavaTM Data Mining (JDM) Version 1.1

    June 22, 2005 1

    1. Overview

    1.1 Introduction

    The Java Data Mining (JDM) specification addresses the need for a pure Java API to facil-

    itate development of data mining-enabled applications. JDM supports common data min-

    ing operations, as well as the creation, persistence, access, and maintenance of metadatasupporting mining activities.

    Currently, no existing Java platform specification provides a standard API for data mining

    systems. Existing APIs are vendor-proprietary. By using JDM, implementers of data min-

    ing applications can expose a single, standard API that will be understood by a wide vari-

    ety of developers writing client applications and components running on the Java™ 2

    Platform. Similarly, data mining clients can be coded against a single API that is indepen-

    dent of the underlying data mining system. JDM is targeted for the Java™ 2 Platform,

    Enterprise Edition (J2EE™) and Standard Edition (J2SE™).

    In JDM, data mining [Mitchell1997, BL1997] includes the functional areas of classifica-

    tion, regression, attribute importance1, clustering, and association. These are supported by

    such supervised and unsupervised learning algorithms as decision trees, neural networks,

    Naive Bayes, Support Vector Machine, K-Means, and Apriori, on structured data. Com-mon operations include model build, test, and apply (score). A particular implementation

    of this specification may not necessarily support all interfaces and services defined by

    JDM. However, JDM provides a mechanism for client discovery of supported interfaces

    and capabilities.

    JDM is based on a generalized, object-oriented, data mining conceptual model leveraging

    emerging data mining standards such the Object Management Group’s Common Ware-

    house Metadata (CWM), ISO’s SQL/MM for Data Mining, and the Data Mining Group’s

    Predictive Model Markup Language (PMML), as appropriate

    Implementation details of JDM are delegated to each vendor. A vendor may decide to

    implement JDM as a native API of its data mining product. Others may opt to develop a

    driver/adapter that mediates between a core JDM layer and multiple vendor products. The

    JDM specification does not prescribe a particular implementation strategy, nor does it pre-scribe performance or accuracy of a given capability or algorithm.

    To ensure J2EE™ compatibility and eliminate duplication of effort, JDM leverages exist-

    ing specifications. In particular, JDM leverages the Java Connection Architecture [JSR16]

    to provide communication and resource management between applications and the ser-

    vices that implement the JDM API. JDM also reflects aspects the Java Metadata Interface

    [JSR40] for the interface specification.

    1.1.1 Benefits

    The availability of a J2EE™-compliant data mining API provides benefit to both vendors

    and users of tools and applications in the areas of business intelligence, business analytics,

    data mining systems, data warehousing, and life sciences / bioinformatics.Historically, application developers coded homegrown data mining algorithms into appli-

    cations, or used sophisticated end-user GUIs. These GUIs packaged a suite of algorithms

    complete with support for data transformation, model building, testing, and scoring. How-

    ever, it was difficult, if not impossible, to embed data mining end-to-end in applications

    using commercial data mining products due to inadequate APIs. If a vendor had an API, it

    was proprietary, making the development of a product using that API risky. If a different

    1.   Attribute importance is also referred to as feature selection or key fields analysis.

  • 8/20/2019 Java Datamining Spec.

    10/156

    Maintenance Release JavaTM Data Mining (JDM) Version 1.1

    June 22, 2005 2

    vendor’s solution was required, rewriting that product was also potentially costly.

    The ability to leverage data mining functionality via a standard API greatly reduces risk

    and potential cost. With a standard API, customers can use multiple products for solving

    business problems by applying the most appropriate algorithm implementation without

    investing resources to learn each vendor’s proprietary API. Moreover, a standard API

    makes data mining more accessible to developers while making developer skills more

    transferable. Vendors can now differentiate themselves on price, performance, accuracy,and features. Java Data Mining (JDM) addresses this need for Java.

    1.1.2 Target audienceThe target audiences for the JDM specification can be categorized into the following

    groups:

    data mining vendors – companies that intend to implement this API for their respec-

    tive products, thereby providing the API to end users

    • application developers – Java programmers who wish to use a data mining API for

    building GUIs or other applications that benefit from data mining technology

    • data mining experts – individuals with advanced degrees in statistics, machine learn-

    ing, or data mining; or with significant practical data mining experience

    • data mining novices – Java-knowledgeable developers who have a basic understand-

    ing of the problems that data mining can solve, who can minimally leverage the func-

    tion-level of data mining tasks

    1.1.3 Data analytics JSRs

    The complement to data mining in data analytics is online analytical processing (OLAP).

    To distinguish between OLAP and Data Mining, consider that OLAP follows a deductive

    (query-oriented) strategy of analyzing data. Users formulate hypotheses, and execute que-

    ries to gain understanding of the underlying data. Data Mining follows an inductive strat-

    egy of analyzing data where users apply machine learning algorithms to extract non-

    obvious knowledge from the data.

    JOLAP, (JSR-69) specifies a Java API for OLAP, and shares a common basis in the OMG

    CWM meta-model. The JDM expert group is working with the JOLAP expert group to

    minimize overlap and leverage common modeling techniques and infrastructure where

    applicable.

    1.1.4 Exclusions

    The domain of “data mining” is quite large. The JDM expert group made decisions early

    to exclude certain features from JDM to make it more manageable. As such, functionality

    such as data transformations, visualization, mining unstructured data (e.g., text), wrappers

    and ensembles, and sensitivity analysis have been omitted from this first version of the

    API. Note that with respect to visualization, JDM does provide many of the key data

    objects necessary to support visualization, e.g., confusion matrix, lift results, decision tree

    representation, and neural network architecture.

    From a systems perspective, JDM does not specify behavior for transactions, scheduling,

    or security. These are left to vendors to determine what best suits their respective products

    and customer base.

  • 8/20/2019 Java Datamining Spec.

    11/156

    Maintenance Release JavaTM Data Mining (JDM) Version 1.1

    June 22, 2005 3

    1.2 Architectural components

    JDM has three logical components that may be implemented as one executable or in a dis-

    tributed environment:

    • application programming interface (API) - The API is the end-user-visible compo-

    nent of a JDM implementation that allows access to services provided by the data min-

    ing engine (DME). An application developer using JDM requires knowledge only ofthe API library, not of other supporting components.

    • data mining engine (DME) - A DME provides the infrastructure that offers a set of

    data mining services to its API clients. When implemented as a server of a client-

    server architecture, it is referred to as a data mining server  (DMS), which is a specific

    instantiation of the more general Enterprise Information System (EIS) as specified in

    the Connector Architecture (JSR-16).

    • mining object repository (MOR) - The DME uses a mining object repository which

    serves to persist data mining objects. This repository can be based on, e.g., the CWM

    framework, specifically leveraging the CWM Data Mining metamodel, or imple-

    mented using a vendor-proprietary representation. The MOR may exist in a file-based

    environment, or in a relational / object database. Section 3.7 discusses JDM persis-tence options.

    Figure 1.1 depicts three possible architectures for a JDM implementation. In (a), each

    component resides in a separate physical location or separate executable. We view this as

    a three-tier architecture with the data stored in a separate repository, such as a database. In

    (b), the DME contains the MOR and results in a classic client-server architecture. This

    scenario is possible, e.g., where the database contains both the DME and MOR, or the

    DME uses the local files system for persistent storage. In (c), the system is monolithic,

    i.e., API, DME and MOR reside in, or are managed by a single executable.

    FIGURE 1.1 Architecture configuration options

    A vendor may choose to provide additional utilities and management interfaces to the

    DME and MR, however, these are not defined as part of JDM and may be proprietary. The

    JDM specification does not place any requirements on the DME and MOR design or

    implementation except to support functionality as required by the JDM interface.

    Vendors may implement a subset of the complete JDM specification as noted in the sec-

    tion on conformance. This a la carte approach provides a common framework for all data

    mining functionality, while allowing vendors to support only vendor-relevant portions of

    it.

    APIAPI

    DM EDM E

    MORMOR

    APIAPI

    DM EDM E

    MO RMO R

    APIAPI

    DM EDM E

    MO RMO R

    (a) (b) (c)

  • 8/20/2019 Java Datamining Spec.

    12/156

    Maintenance Release JavaTM Data Mining (JDM) Version 1.1

    June 22, 2005 4

    1.3 Dependencies and relationships

    JDM leverages aspects of the CWM Data Mining metamodel and the Java Metadata Inter-

    face (JSR-40). CWM Data Mining facilitates the construction and deployment of data

    warehousing and business intelligence applications, tools, and platforms based on OMG

    open standards for metadata and system specification (i.e., MOF, UML, XMI, CWM). The

    Java Metadata Interface provides a common naming convention for methods.

    The following specifications serve as design references for JDM:

    • DMG PMML 2.0, [PMML], provides an XML-based representation for mining mod-

    els and facilitates interchange among vendors for model results.

    • ISO SQL/MM Part 6. Data Mining [SQL/MM-DM] provides a standard interface to

    RDBMSs for performing data mining. Concepts from this approach are leveraged in

    the overall JDM design.

    • Common Warehouse Metamodel [CWM] and CWM Specification, Volume 1, Chapter

    15, Data Mining [CWM-DM] provides a sense of the overall structure of the metadata

    JDM supports.

    1.4 Organization

    This document focuses on JDM requirements, concepts, use cases, code examples, pack-

    ages supporting the API, and vendor conformance.

    In Section 2, we present use cases to help the reader appreciate how this API can be used

    under various circumstances, both by end users and vendors conforming to the standard.

    In Section 3, we present the synthesis of data mining concepts that form the basis of the

    JDM model. These concepts result from analyzing the requirements of many different data

    mining functions and algorithms. These concepts are key to providing a unified data min-

    ing framework.

    In Sections 4, we present the JDM packages and class diagrams to illustrate the relation-ship between the various interfaces and classes. Details of each class are provided in the

    companion Javadoc-generated documentation.

    In Section 5, we provide and explain code examples using the JDM API. These examples

    represent working with the API as a non-data mining expert, relying on convenience rou-

    tines to automate much of the specification, as well as exposing detailed specification for

    data mining experts.

    In section 6, we present the requirements for vendor conformance to the API.

    In section 7, we summarize our JDM experience and where the standard is likely to go in

    subsequent versions.

    In appendix A, we provide a glossary of terms used in this document.

    In appendix B, we review the data mining domain requirements and foundation technolo-

    gies driving the API. We explore related data mining standards and common system

    behavior.

    In appendix C, we list optional methods for models and model detail a vendor may choose

    to implement.

    In appendix D, we provide JDM error codes for JDMException.

  • 8/20/2019 Java Datamining Spec.

    13/156

    Maintenance Release JavaTM Data Mining (JDM) Version 1.1

    June 22, 2005 5

    In appendix E, we define Web services based on the JDM model. There has been signifi-

    cant interest expressed within the expert group and from external comments for defining a

    JDM Web services interface.

    In appendix F, we provide a list of references.

    1.5 Expert group members

    Sarabjot Anand – Corporate Intellect

    Robert Brunner – California Institute of Technology

    Robert Chu – SAS Institute

    Werner Dubitzky* – University of Ulster, N. Ireland

    Kim Horn – Sun Microsystems, Inc.

    Mark Hornick – Oracle Corporation

    Bill Hosken* – SPSS, Inc.

    Ronny Kohavi* – Blue Martini Software

    Achim Kraiss – SAP AG

    Marwane Jay Lamimi – KXEN

    Christoph Lingenfelder – IBM Germany

    Erik Marcade – KXEN

    Somesh Marepalli – Computer Associates International, Inc.

    Waddys Martinez* – Magnify

    Cindy McMullen – BEA Systems

    Chuck Mosher – Sun Microsystems, Inc.

    John Poole – Hyperion Solutions

    Michal Prussak  – Fair Isaac

    Alex Russakovskii – Hyperion Solutions

    Mike Smith – Strategic Analytics

    Qian (Cherry) Yang – Computer Associates International, Inc.

    Sunil Venkayala – Oracle Corporation

    Andrew Walaszek  – SPSS, Inc.

    Hankil Yoon – Oracle Corporation

    * former member

    1.6 Acknowledgements

    The expert group recognizes and thanks Dipankar Roy and Shiby Thomas for reviewing

    previous drafts. We also recognize and thank Marcos Campos, Gary Drescher, Boriana

    Milenova, Joe Yarmus, and Yan.Zhuang for their contributions to the JDM effort.

  • 8/20/2019 Java Datamining Spec.

    14/156

    Maintenance Release JavaTM Data Mining (JDM) Version 1.1

    June 22, 2005 6

    2. Use cases

    The use cases presented in this section provide a context in which to understand the possi-

    ble uses of JDM. We have divided use cases into two categories: those relevant to applica-

    tions and those relevant to vendors implementing JDM conforming products. Readers

    already familiar with data mining may want only to browse this section.

    Several JDM concepts are introduced briefly below to assist in understanding the use

    cases. These are described in more detail in Section 3. The reader is expected to be famil-

    iar with common data mining terminology.

    Mining Function - A major subdomain of data mining that shares common high level

    characteristics. Functions include classification, regression, attribute importance, associa-

    tion, and clustering.

    Task - A container within which to specify arguments to data mining operations to be per-

    formed by the data mining engine. Tasks include model building, testing, applying (scor-

    ing), computing statistics, and object import and export. Tasks may execute synchronously

    or asynchronously.

    Settings - A collection of parameters specifying the input for building a data mining

    model or applying a model to data (i.e., scoring). Build settings may be high level, speci-

    fied for mining functions, or detailed, specified for mining algorithms. Apply settings

    specify the content of the scoring result, and in some cases, affect the type of content pro-

    vided. For example, a cost matrix may be specified for classification at apply time.

    Model - An algorithm often produces a compressed representation of input data called a

    model. This model contains the essential knowledge extracted from the data as determined

    by the algorithm. A model can be descriptive or predictive. A descriptive model helps in

    understanding the underlying data or model behavior. For example, an association rules

    model on market basket data can be used to describe consumer behavior. A predictive

    model can be an equation or set of rules that makes it possible to predict an unseen or

    unknown value (the dependent variable or target ) from other, known values (independent

    variables or predictors).

    2.1 Application use cases

    In this section, we present several end-user use cases involving application developers that

    explore a wide variety of situations in which JDM can be used.

    2.1.1 Mining GUI

    A team of developers is tasked with producing a GUI for visualizing data mining objects.

    They use JDM to develop a tool for exposing objects for building models such as build

    settings, and viewing model representations or contents. The models themselves includedecision trees, neural networks, and mining results such as confusion matrices and lift.

    Decision trees can be traversed and graphically displayed in a tree representation; neural

    networks can be traversed and graphically displayed to show hidden layers and weights on

    connections. The GUI also supports scoring data, testing models, computing lift and

    graphically displaying lift charts.

    In this use case, a JDM implementation provides the enabling data mining functionality. If

    only standard JDM features are leveraged, this GUI could be portable across vendor JDM

    implementations.

  • 8/20/2019 Java Datamining Spec.

    15/156

    Maintenance Release JavaTM Data Mining (JDM) Version 1.1

    June 22, 2005 7

    2.1.2 Web specialty retailer

    A specialty retailer sells from a website, catalogs, and stores. The website has a recom-

    mendation feature that is supported by data mining. Customer data are collected from each

    company’s points of sale into its data warehouse. Sales data are combined with demo-

    graphic data such as age, gender , and income. Demographic data together with product

    categories are regularly mined for customer 'clusters' using a clustering algorithm. Product

    sales data are then partitioned by customer cluster and each cluster is mined for product

    associations using association rules algorithms. The website uses the resulting association

    rules to make online product recommendations with each addition to the customer’s vir-

    tual shopping cart.

    In this use case, multiple JDM mining functions are leveraged: clustering, association, and

    the ability to score individual records to support online product recommendations.

    2.1.3 Campaign management

    A campaign management application provides automated support for identifying custom-

    ers to receive a marketing campaign. The application has access to data collected on cus-

    tomer demographics and responsiveness to such mailing campaigns. This applicationleverages database vendor-specific transformations to prepare data for mining.

    Using the mining function attribute importance (also referred to as feature selection), the

    application determines which attributes are most relevant for model building. By using a

    smaller set of attributes, model build time can be reduced, model predictive accuracy can

    increase, and the attributes most valuable to collect from customers can be highlighted.

    The application uses a decision tree algorithm to produce rules that can be understood by

    the marketing manager, possibly for developing more targeted mailings to customers of a

    given set of demographics. Once the model is built, the application tests the model and

    sends the test and lift results to the campaign manager, who can assess model quality and

    expected results.

    Unless directed otherwise, the application uses this model to score new customers eligible

    for a new mailing campaign. Those customers who have a probability greater than 75% to

    respond to the mailing will be selected for the mailing.

    In this use case, data preprocessing may occur outside JDM using proprietary or ad hoc

    techniques. Multiple JDM mining functions and operations are leveraged through task

    specification. To communicate models and results to other users, these objects can be

    exported, perhaps using an XML representation. JDM’s flexible apply settings allow the

    application to specify the score, probability, customer id, and possibly other input data to

    be part of the apply result table. Finally, JDM’s rule representation and the ability of cer-

    tain algorithms to produce rules is leveraged to explain model behavior. Note that JDM

    defines predicate-based rules from the decision tree algorithm for either classification or

    regression mining functions, and the clustering mining function for the K-Means algo-rithm.

    2.1.4 Minimal top level specification

    A college student learned about the potential of data mining to solve many problems. For

    her senior biology thesis, she wants to cluster the data she’s collected over the past year on

    wild grasses of the African plains to help her categorize those grasses.

  • 8/20/2019 Java Datamining Spec.

    16/156

    Maintenance Release JavaTM Data Mining (JDM) Version 1.1

    June 22, 2005 8

    Although an avid Java programmer, she is unfamiliar with the details of data mining. Hav-

    ing read about JDM and having access to a commercial implementation through her

    school, she leverages all the automated aspects of JDM, specifying only the data and

    accepting all default settings for the Clustering build settings. In this way, no algorithm

    selection is necessary, nor any algorithm-specific settings.

    She uses the API for the clustering model for inspecting the identified clusters.

    In this use case, JDM allows novice users to extract benefit from data mining technology

    by eliding algorithm details. Vendor implementations may vary in the degree of automa-

    tion and the quality of models that automation produces.

    2.1.5 Selecting the “best” model

    An e-tailer builds models on projected customer revenue from which to base providing

    customer discounts. The data analyst for the e-tailer builds multiple regression models

    drawing on several algorithms: neural networks, decision trees, and naive bayes. After

    building several models of each type, the models are tested against held-aside test data and

    lift is computed. An initial criterion for selecting the “best” model is the one with the least

    r-squared error.

    In this use case, the data analyst leverages a JDM implementation’s ability to reuse a sin-

    gle regression build settings object, supplying different algorithm settings. In addition,

    each model can be tested by defining test tasks, and coding an outer loop to iterate over the

    test results to identify the “best” model.

    2.1.6 Comparing vendor implementations

    Data Mining Laboratories (DML) performs independent analysis on data mining software

    to measure performance, ease of use, and model portability. DML compares the effective-

    ness of several vendors’ regression decision tree implementations in building models for

    economic forecasting. Economic forecasts are used in corporate planning to align corpo-

    rate strategy with the expected economic climate. Using JDM, the DML developers code atest application that builds one neural network model per vendor implementation. After

    testing each model, the investigators rank order models according to forecast accuracy,

    learning time and the ratio of these two. To ensure fairness in assessing model perfor-

    mance and conformance for model portability, a separate scoring engine is used that

    accepts PMML standard XML models and generates scores for the test data.

    In this use case, the developers are able to code a single program and execute on multiple

    vendor implementations, modifying only login information. By exporting models in

    PMML format, models can be objectively assessed in a common scoring engine.

    2.1.7 Incremental learning

    A machine tool manufacturer collects data on the machine settings, materials, and defect

    rates for the tools manufactured. These data are provided to a neural network algorithm to

    predict the probability of defective components in a given batch of product. Because data

    are collected over time, and the architecture of the neural network and specific learning

    algorithm chosen is compute intensive, the manufacturer needs to apply incremental learn-

    ing on the neural network as new data is available from each production run.

    In this use case, JDM provides an interface that enables incremental learning, i.e., the abil-

    ity to continue building a model with the original build data or new data. To support this, a

  • 8/20/2019 Java Datamining Spec.

    17/156

    Maintenance Release JavaTM Data Mining (JDM) Version 1.1

    June 22, 2005 9

    user specifies an existing JDM model as input to a build task, along with other required

    inputs. On execution of the task, the DME uses this model as a seed from which to con-

    tinue building the model. This optional specification can be used for any type of algorithm

    that can leverage a seed model.

    2.1.8 Deferred task execution

    A cancer researcher, who has limited access to hardware for building and testing models,

    needs to define and verify a series of mining tasks and storing them in the mining object

    repository. The researcher may even build trial models on very small datasets as part of

    verifying the task. Using an external scheduling mechanism, such as UNIX cron jobs, the

    researcher schedules execution of these tasks over night, when computing resources are

    more available.

    In this use case, the researcher uses JDM’s task specification and ability to store objects in

    the mining object repository. These can later be retrieved for execution. The verify method

    allows the researcher to have a greater sense that his tasks will execute to completion. The

    verify method typically checks if the logical and physical data map properly and if the

    combination of settings specified are compatible.

    2.1.9 Explaining model behavior

    A bank leverages data mining to predict credit risk for customers seeking home equity

    loans. To comply with government regulations to not discriminate based on gender or

    race, the bank must be able prove that the rules they apply to determine credit risk exclude

    such criteria.

    The bank ’s data analyst is required to produce a set of human understandable rules, ideally

    in english-like format, that can be presented to government auditors as needed. Bank man-

    agement also reviews these rules to target certain customer segments for special promo-

    tions.

    In this use case, the analyst uses the JDM tree settings to request a decision tree represen-

    tation for a classification model, predicting credit worthiness as low, medium, or high. The

    analyst then uses JDM’s interface to generate rule objects from the decision tree model

    and translate these rules to a particular format. A given vendor may have an english format

    implemented for rules.

    2.1.10 Manually enhancing a model

    A private security agency builds decision tree models to profile suspicious individuals and

    identify individuals at airports for further screening. However, in their experience they

    have found manually enhancing a model can improve its performance and accuracy. Their

    data mining analyst builds a model, generates an english-like representation of the rules,

    removes certain irrelevant rules and possibly even adjusts some of the rule predicates.Importing this modified model to the data mining system, the analyst sets up an applica-

    tion to enable profiling by leveraging single record scoring of individuals, accessing infor-

    mation stored in government databases and information obtained from travelers at the

    airport.

    In this use case, the analyst also uses the JDM tree settings to build a classification model.

    The rules are generated from the decision tree model and analyzed. However, since JDM

    does not enable direct model modification via the API, the analyst can export the model,

    perhaps in PMML, to ensure model integrity. The analyst modifies the model and attempts

  • 8/20/2019 Java Datamining Spec.

    18/156

    Maintenance Release JavaTM Data Mining (JDM) Version 1.1

    June 22, 2005 10

    to import the model. Validation of the manually modified model occurs at import. JDM’s

    support for single record scoring enables the analyst to produce an application that joins

    information stored in a database about individuals with that dynamically acquired by air-

    port personnel, perhaps at the ticket counter.

    2.1.11 OLAP schema refinement

    An OLAP vendor creates cube schemas from fact tables stored in a relational database. A

    particular fact table contains millions of records representing sales and customer informa-

    tion of a beverage retail company. The OLAP vendor needs to create a schema for the

    OLAP cube to enable analyzing and reporting the retailer's sales data.

    A cube schema is a set of dimensions each having a particular hierarchy of attributes.

    Dimensions usually correspond to several columns in the fact table, however, not all col-

    umns should necessarily produce a dimension. A dimension normally represents an

    attribute that is orthogonal to other dimensions in the fact table. In addition, some of the

    columns, identified in advance, represent measures in the model.

    Choosing the right set of dimensions is key to OLAP providers. If the number of dimen-

    sions is too large, efficient processing of the cube becomes practically impossible. On theother hand, dropping important attributes makes data analysis deficient. Poor cube design

    is one of the factors that inhibit OLAP productivity. Therefore it is important to choose the

    right schema.

    The optimization process of a cube structured can be seen from two different perspectives.

    Starting from a fact table with hundreds of columns, OLAP vendors are either interested

    in:

    • identifying truly independent columns, or

    • identifying what are the important columns to be kept in the optimized cube structure.

    Attribute importance can be used to select the most important independent columns to bet-

    ter ’see’ a given measure. For example, an internal mechanism can build an analytic dataset with columns describing both customer characteristics and product characteristics with

    the sales amount as a target. Then this system trains an attribute importance model on this

    data set. It returns the columns (either describing some aspects of the customer or the

    product) that allow to understand better the spread of average sales figure. Some advanced

    systems can even return not only the important columns but also the drilling hierarchies

    that can be associated with these columns (segments for continuous variables and groups

    of categories for discrete variables). These important columns will be used to create an

    (eventually ad-hoc) optimized cube structure that the final user will use to understand bet-

    ter the average sales figure and build ’segments’ that will combine the customer or prod-

    ucts characteristics that are the most explanatory.

    Such schema refinements are intractable in large cubes without data mining.

    2.1.12 Web services

    List Inc. offers a comprehensive list management service that includes data warehousing,

    grooming, merging and predictive modeling. All their services are available as Web ser-

    vices allowing customers to integrate List Inc.’s software seamlessly into their own enter-

    prise systems using the Internet.

  • 8/20/2019 Java Datamining Spec.

    19/156

    Maintenance Release JavaTM Data Mining (JDM) Version 1.1

    June 22, 2005 11

    The Data Web service allows customers to connect to a managed warehouse and store

    their transaction, customer and sales data using a secure Web service interface. List Inc.

    manages the customer data in its data warehouse, cleans and grooms the data, and pro-

    vides a range of preprocessing and transformation facilities. They maintain a comprehen-

    sive repository of high quality background data including income, census, and

    demographic and geographic data. List Inc. has relationships with many data vendors and

    can call upon their services when required. This background data is merged with the cus-

    tomer data using their proprietary merge technology.

    List Inc. offers a complete model training and testing facility that guarantees optimal

    results. The customer data is used to build predictive models to determine the best

    responders, cross sell and up sell models and investigate return on investment (ROI). List

    Inc. has a comprehensive testing facility that can choose the best algorithm and product

    combination that delivers the optimal ROI. The customer does not have to worry about

    data mining tool integration, training and testing.

    The customer decides only on the schedule for updating models and the ROI they require.

    List Inc. owns two super computers to provide the fastest modeling facilities available

    today.

    JDM is critical to List Inc.’s services. The Predictive Web service wraps JDM to allow thecustomer to apply models. The Training Web service wraps JDM to allow the customer to

    build models and set parameters. JDM is used internally to connect to different vendor

    data mining tools and algorithms in their building and testing processes.

    The Training Web service can be used by both novice customers and experienced data

    analysts. Mining savvy data analysts can tailor the training process, choose particular

    algorithms and their settings. In addition, they can choose the attributes from their data

    they wish to include in models.

    The Prediction Web service provides access to the resultant models across the net. The

    Prediction Web service interface is called with new prospect data and the score outcome

    returned. The service allows customers to enhance their software systems and their own

    web sites with predicted outcomes as if they owned the data mining tools themselves.

    2.2 Vendor use cases

    In this section, we present several use cases that explore how vendors can leverage JDM in

    commercial JDM implementations.

    2.2.1 Broad support of JDM

    A data mining vendor has a wide range of algorithms that addresses each of the JDM min-

    ing functions. The vendor’s objective is to simplify mining for unsophisticated users. As

    such, the vendor provides automated selection of algorithms without requiring (or allow-ing) the user to select specific algorithms or provide algorithm-specific control of algo-

    rithms, e.g., maximum tree depth in a decision tree.

    In this use case, the vendor must implement all packages of the API except Algorithm sub-

    classes and model detail subclasses. Users of the vendor’s data mining product will spec-

    ify build settings only, obtain models, and be able to view and use those models as

    appropriate. Note that the end-user can see only the function-specific model representa-

    tions, not their underlying algorithm-specific model representations.

  • 8/20/2019 Java Datamining Spec.

    20/156

    Maintenance Release JavaTM Data Mining (JDM) Version 1.1

    June 22, 2005 12

    2.2.2 Narrow support of JDM

    A data mining vendor Neural Networks, Inc. (NNI) supports various neural network algo-

    rithms, both published and proprietary in their data mining tool. NNI supports both classi-

    fication and regression. The vendor chooses to be JDM compliant to gain acceptance in

    the marketplace.

    In this use case, JDM, as an a la carte standard, allows a vendor to implement a narrow

    portion of the standard to reflect its specific domain, or subset of mining functions sup-

    ported. The JDM packages to support this include the core foundation packages and a

    select few specific to neural networks including algorithm settings and model detail.

    For the vendor’s proprietary algorithms, an additional Java package

    nni.feedforwardneuralnetwork is provided which includes the specific proprietary algo-

    rithm settings and model representations.

  • 8/20/2019 Java Datamining Spec.

    21/156

    Maintenance Release JavaTM Data Mining (JDM) Version 1.1

    June 22, 2005 13

    3. Concepts

    In this section, we introduce JDM concepts: mining function, task, principal objects, phys-

    ical data representations, attribute mapping, physical data storage, object references, and

    reflection and introspection.

    3.1 Data mining functions

    In general, data mining functions can be classified into two categories: supervised  and

    unsupervised . Supervised functions are typically used to predict a value and require the

    specification of a known outcome or target  for each case to be used during model build-

    ing. Examples of targets include binary attributes indicating buy/no-buy, churn/no-churn,

    success/failure, and multi-class attributes indicating preferred color choice from among

    the primary colors, likely salary range binned in $20,000 increments. The target allows the

    algorithm to determine how well it is predicting target values. An example of supervised

    learning algorithms includes Naive Bayes for classification.

    Unsupervised functions do not use a target, and are typically used to find the intrinsic

    structure, relations, or affinities in a body of data. Examples of unsupervised learningalgorithms include k-means clustering and Apriori association. Clustering may be used to

    identify naturally occurring groups of proteins among hundreds of cases, or retail cus-

    tomer segmentation. The itemset rules returned from an association model can be used to

    identify products to cross-sell to retail customers.

    Another view of mining involves whether data mining is descriptive or predictive.

    Descriptive data mining describes a dataset in a concise and summary manner, and pre-

    sents interesting general properties of the data. Algorithms supporting descriptive data

    mining include k-means clustering, Apriori association, and even decision tree classifica-

    tion. Predictive data mining constructs one or a set of models, performs inference on the

    available dataset, and attempts to predict outcomes for new data sets. Algorithms support-

    ing predictive data mining include neural networks, SVM, and decision tree classification/ 

    regression, and even k-means clustering when used to assign new records to clusters.

    Different algorithms serve different purposes, each algorithm offering its own advantages

    and disadvantages. JDM specifies the following mining functions: classification, regres-

    sion, attribute importance, clustering, and association. Some algorithms can be used

    across multiple data mining functions.

    3.1.1 Classification

    Classification has been used in customer segmentation, business modeling, and credit

    analysis. As a type of supervised learning, an algorithm supporting classification builds a

    model from a set of predictors that are used to predict a target . A set of predictors may

    include demographic data such as age, income, number of children, and zip code, to pre-

    dict the binary target buy/no-buy a minivan. The input or build  data for a supervised learn-ing algorithm requires the presence of attributes for both predictors and target in each

    case. Given a pre-determined set of classes in the target attribute, classification analyzes

    the build data to create a model that can predict to which class a given case belongs.

    3.1.2 Regression

    Regression has been used in financial forecasting, time series prediction, biomedical and

    drug response modelling, and environmental modelling. Also a type of supervised learn-

  • 8/20/2019 Java Datamining Spec.

    22/156

    Maintenance Release JavaTM Data Mining (JDM) Version 1.1

    June 22, 2005 14

    ing, regression involves predicting a continuous, numerical valued target attribute given a

    set of predictors. A regression problem may use the same predictors as a classification

    problem, but specifies a target such as the predicted lifetime value of a customer.

    3.1.3 Attribute Importance

    Attribute importance is used to determine which attributes are most relevant for building amodel. Attribute importance can be used for both supervised and unsupervised learning.

    Attribute importance enables users to reduce model build time, and in some algorithms,

    reduce data scoring time by including only the most important attributes from the build

    data. Eliminating “noise” attributes from data can also improve accuracy or model quality.

    Attribute importance serves a purpose similar to feature selection. It produces a model that

    ranks attributes according to how each attribute contributes to the quality of a model built.

    From the ranking of attributes, users may select the attributes to be used in building mod-

    els. The user can specify a number or percentage of attributes to use; alternatively a user

    can specify a cutoff point. Note that the ranking of attributes is interpretable usually in a

    relative sense. JDM specifies no precise interpretation of attribute rank values other than

    attributes with a greater numeric value are relatively more important.

    3.1.4 Clustering

    Clustering has been used in customer segmentation, gene and protein analysis, product

    grouping, finding numerical taxonomies, and text mining. Clustering analysis identifies

    clusters embedded in the data, where a cluster is a collection of data objects that are simi-

    lar to one another. A good clustering method produces high quality clusters to ensure that

    the inter-cluster similarity is low and the intra-cluster similarity is high. The similarity of

    two values of an attribute can be expressed as distance functions. For numeric data, this

    can be as simple as the euclidean distance between points. For categorical data, similarity

    can be expressed to make married and cohabiting closer to one another, as well as sepa-

    rated and divorced.

    3.1.5 Association

    Association has been used in market basket analysis and the analysis of consumer behav-

    ior for the discovery of relationships or correlations among a set of items, e.g., the pres-

    ence of one pattern implies the presence of another pattern. They help to identify the

    attribute value conditions that occur frequently together in a given set of data. Association

    analysis is widely used in transaction data analysis for directed marketing, catalog design,

    and other business decision-making process. Traditionally, association is used for market

    basket data analysis such as 90% of the people who buy milk also buy bread.

    Support and confidence metrics are used as a quality measure of a rule within an associa-

    tion model. These are available in JDM as part of the Association model for each rule pro-

    duced. Note that the rules returned from an association model are different from the

    predicate-based rules produced from clustering models or decision tree models. Here, the

    rules consist of a set of items. These items typically occur together in a single transaction,

    such as the items purchased at an online retail checkout.

    The support of a rule is used to ensure that the items in associated in the rule occur

    together frequently enough to be considered significant. Using the probability notation,

    support (A B) = P(A, B)

  • 8/20/2019 Java Datamining Spec.

    23/156

    Maintenance Release JavaTM Data Mining (JDM) Version 1.1

    June 22, 2005 15

    The confidence of a rule is the conditional probability of B given A; confidence (A B)

    = P (B/A) which is equal to P(A, B) / P(A).

    3.2 Data mining tasks

    Data mining revolves around a few common tasks: building a model, testing a model,

    applying a model to data, computing statistics, and importing and exporting miningobjects. Each of these are discussed below.

    3.2.1 Building a model

    JDM enables users to build models in the functional areas: classification,

    regression, attribute importance, clustering, and association. The model serves as a typi-

    cally concise or compact representation of the information contained in the data, relative

    to the algorithm that produced it. To build models, users define tasks, which minimally

    require the input parameters: model name, mining data and mining settings. Settings con-

    tain parameters that describe the type of model to be built, as well as directions to the spe-

    cific algorithm used to build the model.

    There are two levels of settings: function and algorithm. Recall that the mining function

    addresses the type of problem to be solved, e.g., classification or clustering, and the min-

    ing algorithm addresses the specific technique to be applied to solve that problem, e.g.,

    decision tree or k-means. When a user does not specify algorithm settings in a build set-

    tings, the Data Mining Engine (DME) may choose an appropriate algorithm for the task,

    either dynamically or statically, providing defaults for the relevant parameters. Model

    building at the function level eliminates much of the technical details of data mining for

    the user. The quality of models will be determined by the sophistication of the vendor’s

    implementation and the quality of the data.

     Build data, i.e., the data used as input to build a model, can be in different forms. The

    attributes of the build data to be used in model building may be specified in the logical

    data associated with the build settings. JDM supports flexible assignment of build data tothe logical data. If logical attributes do not map directly to physical attributes with name-

    based equivalence, an explicit mapping may be provided using the task object.

    A typical scenario for model building is as follows:

    1. Create a physical data object (by identifying existing data in a database table or file)

    2. Create a build settings object

    3. Create a logical data instance based on the physical data and associate it with the build

    settings (optional)

    4. Create an algorithm settings object and associate it with the build settings (optional)

    5. Create a build task and set the physical data and build settings

    6. Map the physical attributes to logical attributes (if necessary)

    7. Invoke the execute method using the task 

    After a model is built by the DME, it can be persisted in the MOR. See section 3.7 for

    details on JDM persistence options.

    The result of a build is a model. Especially for predictive models, the number of logical

    attributes used by the model may be a subset of those provided in the logical data. As

    such, the model has a signature specifying the possible input attributes to the model for

  • 8/20/2019 Java Datamining Spec.

    24/156

    Maintenance Release JavaTM Data Mining (JDM) Version 1.1

    June 22, 2005 16

    apply. These are not required attributes as a subset may be specified where NULL values

    can be handled. Some algorithms perform automatic attribute selection, e.g., with a deci-

    sion tree model, 100 attributes may have been used to train the model, but only 25 were

    used in the final rule set and are necessary for scoring. These 25 constitute the model sig-

    nature.

    3.2.1.1 Incremental learning

    Some applications have a nearly continuous stream of data available for model building. A

    typical approach is to collect a certain amount of data, build a model from it, use the

    model to score new data for some period, and then build a new model from scratch, possi-

    bly using all the data accumulated to date, or using a fixed amount of data, but using the

    most recent data.

    This approach can be unnecessarily costly, especially for algorithms such as Naive Bayes

    or Association Rules where summary frequency counts are maintained. The frequency

    counts of the existing data do not change, only the new data added needs to be counted and

    the results merged with the previous counts. This produces refreshed  models in much less

    time.

    Algorithms such as neural networks can also leverage incremental learning. Here, a previ-

    ously trained neural network can be provided as a seed model. The model is further trained

    using new data, but starting from an already good model.

    JDM provides support for incremental learning by allowing a seed model to be specified

    as input to the build task. Not all functions or algorithms are expected to handle the speci-

    fication of a seed model for incremental learning. The function and algorithm capabilities

    list indicates if this feature is supported, which is vendor-specific.

    3.2.1.2 Model evaluation

    Some algorithms, such as neural networks or decision trees use a portion of the build data

    to iteratively determine how well the model is learning patterns from the data. These algo-

    rithms will split the build data into a train and evaluation dataset according to some inter-

    nal percentage, e.g., 50%-50%, or 70%-30%. Some users, however, wish to control more

    carefully the data that is used for training versus that used for evaluation.

    JDM provides support for specifying the evaluation data explicitly in the build task to be

    used during model building. Although some vendors may provide proprietary algorithm

    settings to allow specifying the percentage of data to be used for evaluation, JDM provides

    the more explicit option of providing the actual data.

    3.2.2 Testing a model

    Model testing gives an estimate of the accuracy a model has in predicting the target of asupervised model. Testing follows model building to compute the accuracy of a model’s

    predictions when the model is applied to a previously unseen dataset, separate from the

    build dataset. This provides an honest estimate of the accuracy.

    The test task accepts a model and data for testing the model. Test results are stored in a

    TestMetrics object as specified in the task. Physical attribute to logical attribute mapping

    may be specified if the names of physical and logical attributes do not match. The test data

    must be compatible with the model signature.

  • 8/20/2019 Java Datamining Spec.

    25/156

    Maintenance Release JavaTM Data Mining (JDM) Version 1.1

    June 22, 2005 17

    Test data must be preprocessed in the same way as the build data. The user is responsible

    for ensuring this compatibility. However, some DMEs may choose to use information

    present in the LogicalData stored with the model to flag incompatibilities.

    Test metrics content depends on the type of model. For example, classification models

    produce a confusion matrix, whereas regression models provide error estimates. In addi-

    tion to obtaining a confusion matrix, model testing includes option to compute lift and

    receiver operating characteristics (R


Recommended