Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM...

Overview of the SPSS Modeler Integration with IBM PureData System for Analytics Session Number 2921

Gregory Walker, Ph.D., IBM

© 2013 IBM Corporation

Takeaways

•High-level understanding of Modeler and IPDA

•Integration points

•Tips/Best Practices

Agenda

Prerequisites

SPSS Modeler and Netezza Integration Points

Tips/Best Practices

Prerequisites

•Netezza Appliance

•SPSS Modeler Client

•SPSS Modeler Server

•IBM Netezza Analytics*

Netezza/SPSS Modeler Integration Highlights

•As of IBM SPSS Modeler 15:

• Tier 1 database support

• Enhanced support for SQL generation/pushback

• 11 Netezza In-Database modeling nodes

• Scoring adapter

• Database function (udfs) exposure

SQL Pushback

SQL Pushback Automatically converts Modeler nodes into corresponding SQL

Purple nodes at execution time indicate SQL Pushback is occurring for those nodes

Will attempt to include as much of the Stream as possible in SQL Pushback

Can push back none, some, or all of a Stream’s nodes

A node that cannot be represented in SQL will receive the result set of the previous node’s SQL Pushback statement

7

Nodes Supporting SQL Generation

Sources

Database source only

Can specify a table as a source

Can enter a SQL statement directly

Record Operations

Field Operations

Graphs

Modeler Models

Output

Export

Database

Publisher (Published stream will contain generated SQL)

Expressions

SQL Pushback – Supported Nodes

Record Node

Select

Supports generation only if SQL generation for the select expression itself is supported (see

expressions below). If any fields have nulls, SQL generation does not give the same results

for discard as are given in native IBM® SPSS® Modeler.

Sample

Simple sampling supports SQL generation in certain instances.

Complex sampling does not support SQL generation.

Aggregate In certain instances

RFM

Aggregate

Supports generation except if saving the date of the second or third most recent

transactions, or if only including recent transactions. However, including recent transactions

does work if the datetime_date(YEAR,MONTH,DAY) function is pushed back.

Sort

Merge

No SQL generated for merge by order.

Merge by key with full or partial outer join is only supported if the database/driver supports it.

Non-matching input fields can be renamed by means of a Filter node, or the Filter tab of a

source node.

For all types of merge, SQL_SP_EXISTS is not supported if inputs originate in different

databases.

Append Supports generation if inputs are unsorted.

Distinct

9

Notes on Simple Sampling With IPDA

•First N • Generates SQL but prevents downstream SQL

generation unless the node is cached

• Will return error downstream if not cached • “A connection must be supplied as the

previous nodes do not pushback”

•1-in-n • No SQL Pushback support

•Random Percent • Does generate SQL and does NOT inhibit

downstream SQL pushback even w/o cache

SQL Generation in the Aggregate Node

Storage Sum Mean Min Max Sdev Median Count Variance Percentile

Integer Y Y Y Y Y Y Y

Real Y Y Y Y Y Y Y

Date Y Y Y

Time Y Y Y

Time-

stamp

Y Y Y

String Y Y Y

SQL Pushback – Supported Nodes Field Node

Type Supports SQL generation if the Type node is instantiated

and no ABORT or WARN type checking is specified.

Filter

Derive Supports SQL generation if SQL generated for the derive

expression is supported (see expressions below).

Ensemble

Supports SQL generation for Continuous targets. For other

targets, supports generation only if the "Highest confidence

wins" ensemble method is used.

Filler Supports SQL generation if the SQL generated for the derive

expression is supported (see expressions below).

Anonymize Supports SQL generation for Continuous targets, and partial

SQL generation for Nominal and Flag targets.

Reclassify

Binning

Supports SQL generation if the "Tiles (equal count)" binning

method is used and the "Read from Bin Values tab if available"

option is selected.

RFM

Analysis

Supports SQL generation if the "Read from Bin Values tab if

available" option is selected, but downstream nodes will not

support it.

Partition Supports SQL generation to assign records to partitions.

SetToFlag

Restructure

12

Graph Node

Graphboard

SQL generation is

supported for the following

graph types: Area, 3-D

Area, Bar, 3-D Bar, Bar of

Counts, Heat map, Pie, 3-D

Pie, Pie of Counts. For

Histograms, SQL

generation is supported for

categorical data only.

Distribution

Web

Evaluation

SQL Pushback – Supported Nodes

Export Node

Database

Publisher

The published stream

will contain generated

SQL.

13

Output Node

Table

Supports generation if SQL

generation is supported for

highlight expression (see

expressions below).

Matrix

Supports generation except if

"All numerics" is selected for

the Fields option.

Analysis

Supports generation,

depending on the options

selected.

Transform

Statistics Supports generation if the

Correlate option is not used.

Report

Set Globals

Model Apply Node

C&R Tree Supports SQL generation for the single tree

option, but not for the boosting, bagging or

large dataset options.

QUEST

CHAID

C5.0

Decision List

Linear

Supports SQL generation for the standard

model option, but not for the boosting, bagging

or large dataset options.

Neural Net


model option (Multilayer Perceptron only), but

not for the boosting, bagging or large dataset

options.

PCA/Factor

Logistic

Supports SQL generation for Multinomial

procedure but not Binomial. For Multinomial,

generation is not supported when confidences

are selected, unless the target type is Flag.

Generated Rulesets

SQL Pushback - Expressions

14

Expressions

Operators + - / * ><

Relational Operators = /= > >= < <=

Functions

Abs Islowercode Or

Allbutfirst Isnumbercode Pi

Allbutlast Isstartstring Real

And Issubstring Rem

Arccos Isuppercode Round

Arcsin Last Sign

Arctan Length Sin

Arctanh Locchar Sqrt

Cos Log String

Div Log10 Strmember

Exp Lowertoupper Subcrs

Fracof Max Substring

Hasstartstring Member Substring_betwe

en

Hassubstring Min Uppertolower

Integer Negate To_string

Intof Not

Isalphacode Number

Aggregate Functions Sum Min Count

Mean Max Sdev

Enabling SQL Pushback

•Verify Modeler Server enablement from Modeler Client: • Help -> About -> Additional Details • Look for “Server Enablement”

Enabling SQL Pushback, Continued

•Enable Optimization Settings • Tools -> Stream Properties -> Options -> Optimization

How Do I Know SQL Pushback Occurs?

•Nodes will turn purple

Where SQL Pushback Can Help the Most

•Joins

• Merge by key

•Aggregration

•Selection

•Sorting

•Field Derivations

•Field Projections

•Scoring

In-Database Scoring

Scoring with SPSS Modeler

•Out of database scoring

•SQL Pushback

•Scoring Adapter

In-Database Scoring Using SQL Pushback

•Small number of Modeler Models

Model Apply Node

C&R Tree Supports SQL generation for the single tree

option, but not for the boosting, bagging or

large dataset options.

QUEST

CHAID

C5.0

Decision List

Linear


model option, but not for the boosting, bagging

or large dataset options.

Neural Net


model option (Multilayer Perceptron only), but

not for the boosting, bagging or large dataset

options.

PCA/Factor

Logistic

Supports SQL generation for Multinomial

procedure but not Binomial. For Multinomial,

generation is not supported when confidences

are selected, unless the target type is Flag.

Generated Rulesets


•Must enable model to score via SQL Pushback within the Model Nugget

• Double-click model nugget -> Settings

Compressed Compressed High

Analytic model*

102M rows

18 GB of data

Out of box Out of box performance performance After tuning


Oracle IPDA 1000-12 Exadata ¼ Rack

20x faster

Run Regression Model

9 seconds (customer churn prediction) 59 minutes 178 seconds

* Created 20 Telco Churn Models using Multinomial Logistic Regression and scored a compressed Table with 102M rows using SQL Pushback

SPSS Modeler Scoring Adapter

Extension to current In-Database Capabilities allowing more SPSS Modeler models to be scored In-Database

Improve the efficiency of scoring models by minimizing data movement and leveraging database capabilities

Supported for IPDA w/ NPS version > 6.0 P8

Modeler Scoring Adapter Overview

Implementation

IBM SPSS Modeler Server Scoring Adapter must be installed within the database that you will be using with Modeler (they leverage database UDFs for processing)

Models are stored within tables and published when updated

You do not need individual installations for each model.

Benefit:

Once installed, Modeler will automatically use the adapter when a stream is executed and the stream is running against that database.

Usage:

Can be turned off at Server level if needed or which method to use can be determined at model level

Scoring Adapter SQL Pushback * Local Scoring

C&RT, Quest, CHAID, C5.0 X X X

Decision List X X X

ALM X X

Linear Regression X X X

Logistic Regression X X X

Neural Net X X X

Discriminant X X

GenLin X X

Cox X X

SVM X X

Bayes Net X X

SLRM X X

K-Means, Kohonen, Two Step X X

Anomaly Detection X X

KNN X X

Split Models, Large Dataset , Boosting, Bagging X X

GLMM X X

PCA / Factor X X

Feature Selection X X

Time Series / Sequence X

Apriori / Carma X

Text Analytics X

Social Network Analysis X

Entity Analytics X

*Not all options supported - refer to product documentation for Limitations

Pure SQL vs Scoring Adapter for Model Scoring

27

Pure SQL Scoring Adapter (UDFs)

Difficult to support some

model scoring algorithms

Easily supports a large

class of scoring algorithms

Requires a SQL mapping to

be constructed for each

model type

Reuses existing scoring

component to score each

model type

Resulting SQL will run on

many database systems

Needs to be adapted for

each database system

requiring support

No database extensions

required

Requires database

extensions to be installed

Performance/reliability

harder to predict

Performance/reliability

easier to predict

Harder to generate SQL to

score ensemble models

Easier to score ensemble

models

Database Function Exposure


•Exposed in downstream nodes via Expression Builder

• Derive

• Select

• Balance*

• Filler

• Analysis

• Report

• Table

• Merge

• Merge by Condition

•Includes

• Regular database functions

• UDFs

* Balance node does not pushback to database


• Can be useful for replacing Modeler functions that do not pushback

• E.g. various time Modeler time arithmetic functions

IPDA In-Database Models

Netezza In-Database Models

INZA models supported within Modeler

Bayes Net

Decision Trees

Divisive Clustering

Generalized Linear

K-Means

KNN

Linear Regression

Naive Bayes

PCA

Regression Tree

Time Series

Enabling Netezza In-Database Modeling

•Tools -> Options -> Helper Applications -> IBM Netezza

Using Netezza In-Database Models






Model Scoring in IPDA or SPSS?

•Depends on dataset size

0

200

400

600

800

1000

1200

1400

1600

1800

2000

10000 100000 1000000 10000000 100000000

Pro

cessin

g T

ime (

sec)

Number of Records

Netezza

SPSS Modeler

Summary Netezza + SPSS Modeler Integration

Feature Benefit

Asymmetric massively

parallel processing (AMPP)

architecture

Answers to your sophisticated questions, across all of your data,

returned in a fraction of the time it used to take

Analytics Workbench Easy to build, manage, validate and deploy analytic models

SQL Pushback In-database optimized SQL generated for common data

preparation tasks including sampling

In-Database Data Mining / Ready-to-use, parallelized, in-database via Netezza Analytics:

Model Building Decision Trees, K-Means, PCA, Linear Regression, Regression

Trees, Bayes Network, Naïve Bayes, K Nearest Neighbors,

Divisive Clustering, GLM, Time Series

In-Database Model Scoring

via SPSS Algorithms

C&RT, Quest, CHAID, C5.0, Decision List, Linear, Neural Net,

PCA/Factor, Logistic, Generated Rulesets

In-Database Ensemble

Scoring

Delivers higher performance for ensemble models with larger

data and more dimensions / variables

In-Database Scoring with

Scoring Adapter

Delivers high performance scoring for Modeler models that

cannot be rendered in SQL.

Acknowledgements and Disclaimers

Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates.

The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.

All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.

© Copyright IBM Corporation 2013. All rights reserved.

•U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

IBM, the IBM logo, ibm.com, and IBM SPSS Modeler, IBM PureData for Analytics are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml

Other company, product, or service names may be trademarks or service marks of others.

http://www.ibm.com/legal/copytrade.shtml

http://www.ibm.com/legal/copytrade.shtml

Communities

• On-line communities, User Groups, Technical Forums, Blogs, Social networks, and more

o Find the community that interests you …

• Information Management bit.ly/InfoMgmtCommunity

• Business Analytics bit.ly/AnalyticsCommunity

• Enterprise Content Management bit.ly/ECMCommunity

• IBM Champions

o Recognizing individuals who have made the most outstanding contributions to Information Management, Business Analytics, and Enterprise Content Management communities

• ibm.com/champion

http://bit.ly/InfoMgmtCommunity




http://bit.ly/AnalyticsCommunity



http://bit.ly/ECMCommunity



http://www.ibm.com/software/data/champion



Thank You Your feedback is important!

• Access the Conference Agenda Builder to complete your session surveys

o Any web or mobile browser at http://iod13surveys.com/surveys.html

o Any Agenda Builder kiosk onsite

http://iod13surveys.com/surveys.html

Date post:	23-Mar-2018
Category:	Documents
Upload:	haanh
View:	237 times
Download:	5 times

Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM...

Documents