Overview of the SPSS Modeler Integration with IBM PureData System for Analytics Session Number 2921
Gregory Walker, Ph.D., IBM
© 2013 IBM Corporation
Takeaways
•High-level understanding of Modeler and IPDA
•Integration points
•Tips/Best Practices
Agenda
Prerequisites
SPSS Modeler and Netezza Integration Points
Tips/Best Practices
Prerequisites
•Netezza Appliance
•SPSS Modeler Client
•SPSS Modeler Server
•IBM Netezza Analytics*
Netezza/SPSS Modeler Integration Highlights
•As of IBM SPSS Modeler 15:
• Tier 1 database support
• Enhanced support for SQL generation/pushback
• 11 Netezza In-Database modeling nodes
• Scoring adapter
• Database function (udfs) exposure
SQL Pushback
SQL Pushback Automatically converts Modeler nodes into corresponding SQL
Purple nodes at execution time indicate SQL Pushback is occurring for those nodes
Will attempt to include as much of the Stream as possible in SQL Pushback
Can push back none, some, or all of a Stream’s nodes
A node that cannot be represented in SQL will receive the result set of the previous node’s SQL Pushback statement
7
Nodes Supporting SQL Generation
Sources
Database source only
Can specify a table as a source
Can enter a SQL statement directly
Record Operations
Field Operations
Graphs
Modeler Models
Output
Export
Database
Publisher (Published stream will contain generated SQL)
Expressions
SQL Pushback – Supported Nodes
Record Node
Select
Supports generation only if SQL generation for the select expression itself is supported (see
expressions below). If any fields have nulls, SQL generation does not give the same results
for discard as are given in native IBM® SPSS® Modeler.
Sample
Simple sampling supports SQL generation in certain instances.
Complex sampling does not support SQL generation.
Aggregate In certain instances
RFM
Aggregate
Supports generation except if saving the date of the second or third most recent
transactions, or if only including recent transactions. However, including recent transactions
does work if the datetime_date(YEAR,MONTH,DAY) function is pushed back.
Sort
Merge
No SQL generated for merge by order.
Merge by key with full or partial outer join is only supported if the database/driver supports it.
Non-matching input fields can be renamed by means of a Filter node, or the Filter tab of a
source node.
For all types of merge, SQL_SP_EXISTS is not supported if inputs originate in different
databases.
Append Supports generation if inputs are unsorted.
Distinct
9
Notes on Simple Sampling With IPDA
•First N • Generates SQL but prevents downstream SQL
generation unless the node is cached
• Will return error downstream if not cached • “A connection must be supplied as the
previous nodes do not pushback”
•1-in-n • No SQL Pushback support
•Random Percent • Does generate SQL and does NOT inhibit
downstream SQL pushback even w/o cache
SQL Generation in the Aggregate Node
Storage Sum Mean Min Max Sdev Median Count Variance Percentile
Integer Y Y Y Y Y Y Y
Real Y Y Y Y Y Y Y
Date Y Y Y
Time Y Y Y
Time-
stamp
Y Y Y
String Y Y Y
SQL Pushback – Supported Nodes Field Node
Type Supports SQL generation if the Type node is instantiated
and no ABORT or WARN type checking is specified.
Filter
Derive Supports SQL generation if SQL generated for the derive
expression is supported (see expressions below).
Ensemble
Supports SQL generation for Continuous targets. For other
targets, supports generation only if the "Highest confidence
wins" ensemble method is used.
Filler Supports SQL generation if the SQL generated for the derive
expression is supported (see expressions below).
Anonymize Supports SQL generation for Continuous targets, and partial
SQL generation for Nominal and Flag targets.
Reclassify
Binning
Supports SQL generation if the "Tiles (equal count)" binning
method is used and the "Read from Bin Values tab if available"
option is selected.
RFM
Analysis
Supports SQL generation if the "Read from Bin Values tab if
available" option is selected, but downstream nodes will not
support it.
Partition Supports SQL generation to assign records to partitions.
SetToFlag
Restructure
12
Graph Node
Graphboard
SQL generation is
supported for the following
graph types: Area, 3-D
Area, Bar, 3-D Bar, Bar of
Counts, Heat map, Pie, 3-D
Pie, Pie of Counts. For
Histograms, SQL
generation is supported for
categorical data only.
Distribution
Web
Evaluation
SQL Pushback – Supported Nodes
Export Node
Database
Publisher
The published stream
will contain generated
SQL.
13
Output Node
Table
Supports generation if SQL
generation is supported for
highlight expression (see
expressions below).
Matrix
Supports generation except if
"All numerics" is selected for
the Fields option.
Analysis
Supports generation,
depending on the options
selected.
Transform
Statistics Supports generation if the
Correlate option is not used.
Report
Set Globals
Model Apply Node
C&R Tree Supports SQL generation for the single tree
option, but not for the boosting, bagging or
large dataset options.
QUEST
CHAID
C5.0
Decision List
Linear
Supports SQL generation for the standard
model option, but not for the boosting, bagging
or large dataset options.
Neural Net
Supports SQL generation for the standard
model option (Multilayer Perceptron only), but
not for the boosting, bagging or large dataset
options.
PCA/Factor
Logistic
Supports SQL generation for Multinomial
procedure but not Binomial. For Multinomial,
generation is not supported when confidences
are selected, unless the target type is Flag.
Generated Rulesets
SQL Pushback - Expressions
14
Expressions
Operators + - / * ><
Relational Operators = /= > >= < <=
Functions
Abs Islowercode Or
Allbutfirst Isnumbercode Pi
Allbutlast Isstartstring Real
And Issubstring Rem
Arccos Isuppercode Round
Arcsin Last Sign
Arctan Length Sin
Arctanh Locchar Sqrt
Cos Log String
Div Log10 Strmember
Exp Lowertoupper Subcrs
Fracof Max Substring
Hasstartstring Member Substring_betwe
en
Hassubstring Min Uppertolower
Integer Negate To_string
Intof Not
Isalphacode Number
Aggregate Functions Sum Min Count
Mean Max Sdev
Enabling SQL Pushback
•Verify Modeler Server enablement from Modeler Client: • Help -> About -> Additional Details • Look for “Server Enablement”
Enabling SQL Pushback, Continued
•Enable Optimization Settings • Tools -> Stream Properties -> Options -> Optimization
How Do I Know SQL Pushback Occurs?
•Nodes will turn purple
Where SQL Pushback Can Help the Most
•Joins
• Merge by key
•Aggregration
•Selection
•Sorting
•Field Derivations
•Field Projections
•Scoring
In-Database Scoring
Scoring with SPSS Modeler
•Out of database scoring
•SQL Pushback
•Scoring Adapter
In-Database Scoring Using SQL Pushback
•Small number of Modeler Models
Model Apply Node
C&R Tree Supports SQL generation for the single tree
option, but not for the boosting, bagging or
large dataset options.
QUEST
CHAID
C5.0
Decision List
Linear
Supports SQL generation for the standard
model option, but not for the boosting, bagging
or large dataset options.
Neural Net
Supports SQL generation for the standard
model option (Multilayer Perceptron only), but
not for the boosting, bagging or large dataset
options.
PCA/Factor
Logistic
Supports SQL generation for Multinomial
procedure but not Binomial. For Multinomial,
generation is not supported when confidences
are selected, unless the target type is Flag.
Generated Rulesets
In-Database Scoring Using SQL Pushback
•Must enable model to score via SQL Pushback within the Model Nugget
• Double-click model nugget -> Settings
Compressed Compressed High
Analytic model*
102M rows
18 GB of data
Out of box Out of box performance performance After tuning
In-Database Scoring Using SQL Pushback
Oracle IPDA 1000-12 Exadata ¼ Rack
20x faster
Run Regression Model
9 seconds (customer churn prediction) 59 minutes 178 seconds
* Created 20 Telco Churn Models using Multinomial Logistic Regression and scored a compressed Table with 102M rows using SQL Pushback
SPSS Modeler Scoring Adapter
Extension to current In-Database Capabilities allowing more SPSS Modeler models to be scored In-Database
Improve the efficiency of scoring models by minimizing data movement and leveraging database capabilities
Supported for IPDA w/ NPS version > 6.0 P8
Modeler Scoring Adapter Overview
Implementation
IBM SPSS Modeler Server Scoring Adapter must be installed within the database that you will be using with Modeler (they leverage database UDFs for processing)
Models are stored within tables and published when updated
You do not need individual installations for each model.
Benefit:
Once installed, Modeler will automatically use the adapter when a stream is executed and the stream is running against that database.
Usage:
Can be turned off at Server level if needed or which method to use can be determined at model level
Scoring Adapter SQL Pushback * Local Scoring
C&RT, Quest, CHAID, C5.0 X X X
Decision List X X X
ALM X X
Linear Regression X X X
Logistic Regression X X X
Neural Net X X X
Discriminant X X
GenLin X X
Cox X X
SVM X X
Bayes Net X X
SLRM X X
K-Means, Kohonen, Two Step X X
Anomaly Detection X X
KNN X X
Split Models, Large Dataset , Boosting, Bagging X X
GLMM X X
PCA / Factor X X
Feature Selection X X
Time Series / Sequence X
Apriori / Carma X
Text Analytics X
Social Network Analysis X
Entity Analytics X
*Not all options supported - refer to product documentation for Limitations
Pure SQL vs Scoring Adapter for Model Scoring
27
Pure SQL Scoring Adapter (UDFs)
Difficult to support some
model scoring algorithms
Easily supports a large
class of scoring algorithms
Requires a SQL mapping to
be constructed for each
model type
Reuses existing scoring
component to score each
model type
Resulting SQL will run on
many database systems
Needs to be adapted for
each database system
requiring support
No database extensions
required
Requires database
extensions to be installed
Performance/reliability
harder to predict
Performance/reliability
easier to predict
Harder to generate SQL to
score ensemble models
Easier to score ensemble
models
Database Function Exposure
Database Function Exposure
•Exposed in downstream nodes via Expression Builder
• Derive
• Select
• Balance*
• Filler
• Analysis
• Report
• Table
• Merge
• Merge by Condition
•Includes
• Regular database functions
• UDFs
* Balance node does not pushback to database
Database Function Exposure
• Can be useful for replacing Modeler functions that do not pushback
• E.g. various time Modeler time arithmetic functions
IPDA In-Database Models
Netezza In-Database Models
INZA models supported within Modeler
Bayes Net
Decision Trees
Divisive Clustering
Generalized Linear
K-Means
KNN
Linear Regression
Naive Bayes
PCA
Regression Tree
Time Series
Enabling Netezza In-Database Modeling
•Tools -> Options -> Helper Applications -> IBM Netezza
Using Netezza In-Database Models
Using Netezza In-Database Models
Using Netezza In-Database Models
Using Netezza In-Database Models
Using Netezza In-Database Models
Using Netezza In-Database Models
Model Scoring in IPDA or SPSS?
•Depends on dataset size
0
200
400
600
800
1000
1200
1400
1600
1800
2000
10000 100000 1000000 10000000 100000000
Pro
cessin
g T
ime (
sec)
Number of Records
Netezza
SPSS Modeler
Summary Netezza + SPSS Modeler Integration
Feature Benefit
Asymmetric massively
parallel processing (AMPP)
architecture
Answers to your sophisticated questions, across all of your data,
returned in a fraction of the time it used to take
Analytics Workbench Easy to build, manage, validate and deploy analytic models
SQL Pushback In-database optimized SQL generated for common data
preparation tasks including sampling
In-Database Data Mining / Ready-to-use, parallelized, in-database via Netezza Analytics:
Model Building Decision Trees, K-Means, PCA, Linear Regression, Regression
Trees, Bayes Network, Naïve Bayes, K Nearest Neighbors,
Divisive Clustering, GLM, Time Series
In-Database Model Scoring
via SPSS Algorithms
C&RT, Quest, CHAID, C5.0, Decision List, Linear, Neural Net,
PCA/Factor, Logistic, Generated Rulesets
In-Database Ensemble
Scoring
Delivers higher performance for ensemble models with larger
data and more dimensions / variables
In-Database Scoring with
Scoring Adapter
Delivers high performance scoring for Modeler models that
cannot be rendered in SQL.
Acknowledgements and Disclaimers
Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates.
The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.
© Copyright IBM Corporation 2013. All rights reserved.
•U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
IBM, the IBM logo, ibm.com, and IBM SPSS Modeler, IBM PureData for Analytics are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml
Other company, product, or service names may be trademarks or service marks of others.
Communities
• On-line communities, User Groups, Technical Forums, Blogs, Social networks, and more
o Find the community that interests you …
• Information Management bit.ly/InfoMgmtCommunity
• Business Analytics bit.ly/AnalyticsCommunity
• Enterprise Content Management bit.ly/ECMCommunity
• IBM Champions
o Recognizing individuals who have made the most outstanding contributions to Information Management, Business Analytics, and Enterprise Content Management communities
• ibm.com/champion
Thank You Your feedback is important!
• Access the Conference Agenda Builder to complete your session surveys
o Any web or mobile browser at http://iod13surveys.com/surveys.html
o Any Agenda Builder kiosk onsite