Modeler Server Performance, Optimization, And Sizing

transcript

7/28/2019 Modeler Server Performance, Optimization, And Sizing

1/16

Technical report

PASW

Modeler Server Performance,Optimization, and Sizing

SPSS is a registered trademar k and the other SPSS Inc. products named are trademarks of SPSS Inc. All other names are trademarks of their respective owners. 2009 SPSS Inc. All rights reserved. CSWP-0209

Table of contents

Introduction .......................................................................................................................... 2

High performance out-of-the-box ....................................................................................... 3

Scaling the data mining process with SPSS Predictive Enterprise Services .......................... 5

Performance optimization ..................................................................................................... 7

Advanced performance optimization................................................................................... 10

Scoping and sizing PASW Modeler Server ........................................................................... 12

Conclusion ......................................................................................................................... 16

About SPSS Inc. .................................................................................................................. 16


2/16

PASW ModelerServer Performance, Optimization, and Sizing

Introduction

Data mining offers organizations many benefits, including a more detailed view of their customers, along with a clearer view

of current conditions and deeper insight into future events. By choosing a high-performance data mining tool, organizations

can mine their data more efficiently and gain a significant return on investment (ROI). PASW Modeler*, the leading data mining

workbench from SPSS Inc., enables organizations to easily and quickly mine many types of data, including large datasets.

The result: more business value than other solutions can offer.

PASW Modeler uses a scalable, three-tiered architecture to improve modeling productivity and deployment when working with

large datasets. The PASW Modeler Client tier passes data mining processes to the PASW Modeler Server. Then PASW Modeler

Server** analyzes these tasks to determine which ones should be executed within the database. After the database processes

those tasks, it passes only the relevant aggregate or summary data to PASW Modeler Server. Since data pre-processing

typically 80-90 percent of the data mining effortoccurs in the database tier, users will accelerate modeling, maximize

resources, and minimize network traffic.

Data mining is an exploratory and interactive process requiring immediate feedback, so high-performance tools like PASW

Modeler Server are essential. PASW Modeler Server provides increased productivity and faster access to results. When

analytical results are deployed into operational systems, the impact of performance is even more significant because of high

data volumes and real-time constraints.

Data mining is a core process involved in predictive analytics, which combines advanced analytic techniques and decision

optimization to inform and direct decision making. The value of predictive analytics is that it gives your organization the ability

to act on the results, and PASW Modeler Servers high performance is crucial to timely action. This technical brief serves as

a guide for understanding and maximizing PASW Modeler Servers already high performance. It focuses on PASW Modeler

Servers out-of-the-box performance, scalability, and performance optimization, as well as its scoping and sizing requirements.

* PASW Modeler, formerly called Clementine, is part of SPSS Inc.s Predictive Analytics Software portfolio.

** PASW Modeler Server, formerly called Clementine Server, is part of SPSS Inc.s Predictive Analytics Software portfolio.


3/16

High performance out-of-the-box

PASW Modeler Server has been designed and developed to provide high performance and scalability for all data mining tasks.

SQL generation and parallel processing, for example, are performed automatically. As a result, PASW Modeler users dont need

to make any changes to the way they work to get consistently high performance.

In our benchmark tests of PASW Modeler Server performance1, we measured the ability of PASW Modeler to carry out the

common tasks of model building, model scoring, and data preparation.

PASW ModelerServer Performance, Optimization, and Sizing 3

Figure 1: This stream was used in tests of model building performance.

Figure 1Model building: 16 million records in under five minutes

PASW Modeler Server was able to build a logistic

regression model from approximately 16 million records2

in less than five minutes (see Figure 1).

This dataset is larger than those typically used for model

building. Against a more modest-sized dataset of 500,000

records, all of the model types were built in less than two

minutes (see Figure 2).

PASW Modeler Server transforms a time-consuming

process into an iterative one and vastly reduces the time

required to build models and to find the best model.

Figure 2: The elapsed time taken to build a model usingdifferent algorithms3.

Figure 2

1 Test environment: 2 x Intel Xeon 3.6GHz (hyperthreaded), 8GB RAM, 36GB RAID 1 System disk, 440GB RAID 0 Data disk, Microsoft WindowsServer 2003 Enterprise x64 SP1, Microsoft SQL Server 2000 SP4, and Clementine 10.0.

2 21 fields used, mixture of data types.3 Neural network build time is affected by randomization in the selection of records to prevent overtraining.


4/16

4 PASW ModelerServer Performance, Optimization, and Sizing

Figure 3: This stream was used in tests of model scoring performance.

Figure 3

Figure 4: The elapsed time taken to score a C&RT decision tree model.

Figure 4

Figure 5: This stream was used in tests of data preparation performance.

Figure 5

4 21 fields used, mixture of data types.

Model scoring: 32 million records in close to

eight minutes

In a test scoring records against a classification model

(see Figures 3 and 4), PASW Modeler Server accessed

data from a table of 32 million records4, scored the data

against a decision tree model, and wrote the scores to a

new database table in less than eight minutes.

This scoring was achieved at a sustained rate of close

to 65,000 records per second, equivalent to 225 million

records per hour.

Data preparation: 16 million customer

records processed against 42 million products

in eight minutes

Data mining is about more than model

building and scoring. A large part of the data

mining process involves preparing the data. As

seen in Figure 5, our tests of data preparation

involved the performance of multiple, common

data preparation steps, including joining

customer data to a product dataset of nearly

three times its size.


5/16

However, with SPSS Predictive Enterprise Services,

organizations receive a complete, enterprise solution

to the problems of analytical asset and process

management. SPSS Predictive Enterprise Services uses

an advanced, service-oriented architecture to improve

the management of predictive models and related

analytical processes within your organizations business

operations. It extends PASW Modelers rapid model

development and deployment capabilities to create

more manageable predictive analytics solutions.

By providing an integrated way to centralize and organize predictive modelsand also automate predictive analytics

processesSPSS Predictive Enterprise Services helps organizations improve analytical asset and process management.

Analytical asset managementThe resources that are involved in a predictive analytics process may involve:

n PASW Modeler streams, models, and outputs

n Documentation

n External scripts for data preparation or report generation

n Resources from other predictive analytics tools, such as PASW Statistics syntax and outputs, and SAS code

PASW Modeler Server ran the stream against 16 million

customer records in approximately eight minutes for an

overall rate of over 33,000 customers per second (see

Figure 6).

Scaling the data mining process with SPSS

Predictive Enterprise Services

Raw data processing speed is not the only factor affecting

performance. Frequently, the volume of modelsrather

than the volume of datais the bottleneck hampering

data mining productivity. In many organizations, the

number of data miners, analysts, and others involved

in the process can also have a very significant impact

on performance.


Figure 6: The elapsed time taken to perform data preparation steps.

Figure 6

By using PASW Modeler Server with SPSS Predictive

Enterprise Services, one financial services organization

optimized its operational analytics, reducing the timetaken to execute a key analytical process by a factor of

80 times. This resulted in major, quantifiable savings.

Generating real performance from data mining activities often depends more on an organizations ability to manage its

analytical assets and complex, multi-part analytical processes than on raw data processing performance alone. For

example, powerful servers are often underutilized when organizations are unable to put the right models in the right place

and effectively schedule their execution.


6/16


Figure 7: Predictive Enterprise Manager allows users to create and schedule multi-part, multi-tool, analytical processesvia a visual workflow interface.

Figure 7

These are analytical assetsthe tangible results of the efforts of data mining teams. SPSS Predictive Enterprise Services

provides a centralized repository that offers:

n Security and access control

n Version control and labeling

n Audit and tracking capabilities

n Advanced data mining-aware organization and search facilities

n Direct integration with PASW Modeler and also with PASW Statistics tools

Managing analytical assets provides a foundation for data mining processes, enabling these processes to scale to the

enterprise level.

Analytical process management

Developing robust processes for data mining activities such as model building, scoring, and validation is integral to delivering

high performance on an enterprise scale. These processes often involve the combination of multiple tools and technologies.

SPSS Predictive Enterprise Services provides a visual workflow user interface, Predictive Enterprise Manager, which allows a

full, end-to-end process to be defined using assets stored in the repository and a mix of technologies (see Figure 7).

Analytical processes are fully integrated with the repository, automatically extracting the required objects and versions, and

storing the results. A scheduling service allows these processes to be executed at regular intervals, and a notification service

provides e-mail tracking.


7/16

Performance optimization

Most of PASW Modeler Servers high performance is achieved through performance optimizations that are switched on by

default. Many PASW Modeler operations can be further improved by fine-tuning performance parameters.

Maximize performance with in-database mining

One of the key benefits of PASW Modeler Server is that it allows organizations to fully utilize their investments in high-

performance database systems. Many organizations have invested heavily in a database infrastructure and business

intelligence systems, but these systems are often under-utilized by the analytical tools that use them.

PASW Modeler Server improves performance when mining large datasets by maximizing in-database mining. For example, you

can delegate as many operations as possible to your IBM DB2 Data Warehouse database or Oracle Database 10g, taking

advantage of database optimization and reducing data movement.

With PASW Modeler Server, processing is executed inthe database via SQL queries. Any operation that

cannot be represented using SQL queries is performed

by the server itself. Only relevant results are passed

back to the client; perhaps more importantly, data

transfer between the database and PASW Modeler

Server is minimized.

Another advantage of PASW Modeler Servers in-database mining is that it minimizesand can even eliminatedata transfer

costs. In a test measuring the impact of in-database mining (see Figure 8), the same PASW Modeler stream was executed

with full SQL generation, no SQL generation, and a scoring-only SQL generation (which executed the scoring in-database but

performed transfer of data to and from the database).


While SQL generation of the scoring was approximately

10 percent quicker than scoring in the application,

the biggest factor in performance is data transfer, which

accounts for more than 85 percent of the elapsed time

for scoring.

The only way to manage the data transfer bottleneck

is to ensure that less data is transferred. PASW Modeler

Servers SQL generation reduces data transfer to aminimum and leverages your investment in high-

performance databases.

Figure 8: Scoring stream executed with full SQL generation, SQLgeneration of scoring only, and no SQL generation

Figure 8

Data transfer costs are the most significant factor affecting

performance. For example, over 85 percent of the time

allotted to score a model can be attributed to data transfer

between the database and the scoring application.


8/16

In Figure 9, the PASW Modeler stream is executed using SQL generation. Many nodes are purple, rather than the usual

white, during execution. Purple nodes mean that the operations represented by those nodes have been translated into SQL

and executed in-database. This feedback helps an analyst ensure that as much of the stream as possible is executed in the

database. Additional options allow the user to examine the SQL that is generated.

Stream optimization relies on intelligent SQL generation and stream execution

SQL generation is a powerful capability, but it depends upon analysts to understand how PASW Modeler operations can be

executed on a database. And analysts are focused on solving business problems, rather than optimizing their PASW Modeler

streams for performance.

For this reason, PASW Modeler Server features advanced optimization that intelligently re-orders operations in the PASW

Modeler stream to maximize performance without altering results. Data miners can organize streams in a way that makes

sense to them, and PASW Modeler Server will reorganize those same operations in a way that makes sense to the database.


Figure 9: SQL generation and highlighting in a PASW Modeler stream

Figure 9

SQL feedback, previewing, and viewing

There will be times when analysts will want more control over the optimization of PASW Modeler streams. PASW Modeler

Server supports this by providing immediate feedback: upon execution, every PASW Modeler node that can be fully translated

to SQL is highlighted (see Figure 9).


9/16


Figure 11: Setting a cache on a node that is likely to be re-executedwill store the data in a temporary table on the database, whenpossible. Executing streams from that cached node will allow furtherin-database operations.

Figure 11

Figure 10: Stream optimization

Figure 10In Figure 10, the derive node contains an operation that

cannot be carried out in the database. PASW Modeler

optimizes the process so that the select operation is

performed before the derive operation, thereby reducing

data transfer and improving performance.

In-database caching

One common user optimization is to set up a cache on

a node. The next time data is passed through that node,

the cache is filled with that data. From then on, the data

is read from the cache rather than from the data source.

This can be a useful way to ensure that expensive data

processing is only executed once.

Normally, the cache is stored as a temporary file on the

file system, but PASW Modeler Server also supports

the caching of this data into a temporary table in the

database. When combined with SQL optimization,

this may result in significant gains in performance.

As illustrated in Figure 11, the output from a stream

that merges multiple tables to create a data mining

view may be cached and reused as needed.

Plus, by automatically generating SQL for all downstream nodes, performance can be improved further. In Figure 11,

the select operation is highlighted, indicating that the operation is being executed in the database from the filled

database cache.

In-database model building

PASW Modeler Server supports integration with data mining algorithms that are available from other database vendors.

Organizations can use PASW Modeler to manage the entire data mining process while modeling with the database-native

algorithms provided by these vendors. Using in-database modeling ensures that data transfer is minimized, even during

the model building phase. It also helps organizations leverage their existing investments in IBM DB2 Intelligent Miner,

Microsoft SQL Server 2005, and Oracle Data Mining.


10/16

Advanced performance optimization

In addition to in-database mining, PASW Modeler Server provides a number of capabilities that allow the user to optimize the

performance of his streams.

Database bulk-loading

Data movement is often a bottleneck in performance, especially when writing data to a database. PASW Modeler Server

provides a number of features to optimize this process for large data volumes.


Figure 12: Database export advanced options allow bulk loading todatabase via ODBC or through an external loader.

Figure 12

Figure 13: Create indexes on database tables to improvedatabase performance.

Figure 13

By default, writing data to a database is performed on

a row-by-row basis. While this prevents errors and

provides data security, it slows performance. Allowing

the PASW Modeler Server to commit multiple rows at

a time is a good way to ensure more reasonable

performance, and this option is available by default.

In addition to the batch committal of records, PASW

Modeler Server supports two types of bulk loading,

as shown in Figure 12.

The first is provided through ODBC bulk loading facilities.

The second type uses an external bulk loading tool to

allow a database-native solution. External bulk loading

scripts are provided for Microsoft SQL Server, Oracle Data

Mining, IBM DB2 Intelligent Miner, Netezza Performance

Server, Teradata Warehouse, and IBM Redbrick

Warehouse databases. These scripts can be customized,

and custom scripts may be written for other databases.

Database indexing

Indexing database tables maintains the performance of

in-database options. Correct indexing significantly impacts

many subsequent database operations.

As shown in Figure 13, PASW Modeler Server enables

users to create indexes on tables exported from PASWModeler. Simple indexes can be created easily, and PASW

Modeler also allows you to customize the SQL statement

used to create the index (for instance, to create a BITMAP,

UNIQUE, or FILLFACTOR index).


11/16

Optimized joins and sorts

By default, PASW Modeler has to make assumptions

about the state of data in the system. For example,

PASW Modeler cannot assume that any data has already

been sorted, so many operations ensure that a sort

is performed when required, even if such a sort is

redundant. PASW Modeler allows the user to optimize

a sort or join operation by specifying any existing sorts

on the data. This eliminates redundancy and improves

performance, as shown in Figure 14.

Users can also optimize the performance of PASW

Modeler Server through special case algorithms for joins.

PASW Modelers default join algorithm is designed toperform optimally when joining datasets of similar size.

In some very common operations, such as when using a

join to connect an ID in one table to a label or description

from another (e.g., joining a product code in a table of

transactions to a product name in a look-up table), the

default join is inefficient.

PASW Modeler offers an alternate join algorithm for these

situations that significantly boosts performance speed,

as can be seen in Figure 15.

High performance through parallel data processing

Multithreading is a method by which an applications

process can perform more than one task at the same

time. Threads share the same memory space, and


Figure 15: Impact of specialized join when joining a large table to asmall table (250,000 records)

Figure 15

Figure 14: Impact of pre-sorting optimization on sort performance

Figure 14

must synchronize at certain points within their execution to access shared resources safely. Operating systems provide

low-level mechanisms to support this synchronization. If an application uses more than one thread to execute, it is said

to be multithreaded.

Symmetric multiprocessing (SMP) machines are widely used and available for all platforms supported by PASW Modeler

Server. They comprise multiple CPUs sharing access to the same memory, disk, network, and other I/O resources. When amultithreaded application runs on an SMP box, threads may be distributed across the CPUs and execute truly in parallel.

Application processes and individual threads can usually migrate dynamically between CPUs to balance processor load.

This is generally handled transparently by the operating system.

PASW Modeler Server employs parallel processing to improve performance in both data processing and modeling operations.


12/16

Parallel data processing

PASW Modeler Server uses a parallel data-sorting algorithm to improve the performance of a number of data processing

operations. Sorting is used by many PASW Modeler operations, including binning, model evaluation, merge and, of course,

the sort operation itself. All of these operations benefit from the parallelization of the sort operation.

The parallelized sort algorithm uses a technique called

record parallelism. This technique distributes records

across a number of separate sorting processes. Each process

sorts its own subset of records and then the results are joined.

Figure 16 shows the effect of running a parallelized sort on

multiprocessor hardware. At high data volumes, sort times

can be reduced by more than 30 percent.


Figure 16: Impact of multiple CPUs on data sorting performance

Figure 16

Parallel predictive model building

Parallel processing techniques are also used by PASW

Modelers C5.0 decision tree algorithm and can improve

performance in building decision trees and rule sets. The

benefits depend largely on dataset sizeboth the number

of records and the number of fieldsbut they can provide

a useful boost to what can be a time-consuming process.

Scoping and sizing PASW Modeler ServerMany factors must be considered when scoping hardware requirements for a PASW Modeler Server installation. The breadth

of PASW Modeler operations and differences in data volumes make it difficult to estimate performance for any specific

hardware configuration.

Impact of CPUs on performance

Obviously, the core speed of any individual CPU will impact data mining performance. Almost all data mining operations,

especially modeling, are heavily processor dependent, so an increase in CPU speed will produce a proportional increase

in performance for many PASW Modeler processes.

The main benefits of multiple CPUs (or multicore CPUs) occur when running multiple streams. This means that the number of

users will often be the deciding factor in determining the optimum number of CPUs. Multiple CPUs will also benefit parallelized

operations, but the main benefits will be from supporting multiple users.


13/16

Table 1: Recommended number of CPUs per number of users

For a production server running scheduled data mining via SPSS Predictive Enterprise Services, the number of CPUs

should be determined by the number of separate processes to be performed simultaneously. Maximum performance

can be achieved, for instance, by splitting a model scoring process across multiple CPUs or building multiple

models simultaneously.

Impact of physical memory on performance

Most PASW Modeler operations can be performed on large volumes of data with minimal memory usage. Only certain

operations, such as sorting, joining, and modeling, require data to be temporarily stored in memory. If not enough memory is

available, these operations will store part of the data as virtual memory on disk. This can affect performance, since disk access

is significantly slower than memory access.

As with CPU usage, the number of users impacts the required memory for normal operation. Memory requirements depend on

data volume. Typical minimum requirements can be found in Table 2.

Table 2: Minimum RAM for number of users in normal use

Large volume model building

Model building is one of the more memory-intensive operations in the data mining process. This is because the model-

building algorithms require access to the entire modeling dataset, often making multiple passes at the data.

For this reason, model building is usually performed on subsets or samples of data. It is normally more productive to build

different models on a small subset of the data and then choose the best model, rather than to build a single model on a larger

dataset. This type of model building can usually be performed within minimal memory requirements.


Number of users Minimum RAM

1-2 1GB

3-4 2GB

5-10 4GB

11-20 8GB

21+ 16GB

Number of users Number of CPUs

1-2 1

3-4 2

5-10 4

11-20 8

21+ 16


14/16

Using more data rarely improves the predictive accuracy of a model. However, if model building on larger volumes is required,

additional memory can help performance.


5 Estimates based on neural network, Kohonen, and K-means algorithm memory requirements. Maximum physical memory may also be limited by theoperating system.

Table 3: Estimated RAM required (GB) to avoid disk-caching during model building5

Table 3 provides guidance on the memory required to avoid disk-caching on model building operations, based on the memory

usage of the neural network, K-means, and Kohonen modeling algorithms.

Memory configuration

PASW Modeler Server will, by default, limit the amount of physical memory used by any single process to ensure that other

simultaneous processes arent affected. A maximum of 25 percent of available memory will be allocated for model building,

and approximately 10 percent will be available for sorting operations. This figure is lower, as there may be multiple sorts in

a single stream. The PASW Modeler Server administrator can modify these settings.

Impact of disk space on performance

Before addressing disk space requirements, it is important to understand the volume of data that is likely to be used for

the actual data mining. Most organizations store many terabytes of data, especially transactional data, but this amount

will rarely be used. Normally the data is aggregated, selected, or sampled before it is used for analysis. While large data

volumes are typically used in model scoring, the model scoring processes usually rely on operations that dont use a lot

of system resources.

When trying to maximize performance, disk usage for data processing steps can be relatively high. The user often caches data

to minimize execution times, and some operations will spill to disk when physical memory is unavailable. In addition, some

operations may produce a dataset larger than the raw input data, further increasing disk requirements.

Columns

Rows (millions) 10 20 50 100 500 1000

0.1 0.5 0.5 0.5 0.5 2 4

0.5 0.5 0.5 0.5 1 4 8

1 0.5 0.5 1 2 8 16

2 0.5 0.5 2 4 16 32

4 0.5 1 4 8 32 -

8 1 2 8 16 - -

16 2 4 16 32 - -32 4 8 32 - - -

64 8 16 - - - -


15/16

To understand disk usage, a series of tests was performed based upon the PASW Modeler Application Template for customer

relationship management (CRM). This template consists of streams that demonstrate data mining techniques used for CRM.

The source dataset was 72MB in size, representing a sample of 140,000 customers and 360,000 transactions, plus other

associated data.


6 SQL generation typically reduces the disk space requirements for PASW Modeler Server since many of the data preparation steps can be carried out onthe database.

7 Estimates based on 1 million rows/10 columns requiring 100MB disk (high estimate) and a working multiplier of 5 times (high estimate for single user).

Figure 17: Percentage of original disk space required for data miningstream operations.

Figure 17The data was stored in text files and all operations

were carried out by PASW Modeler Serverno SQL

generation was required6.

As shown in Figure 17, the tests measured the maximum

amount of disk space needed to execute over 100

separate execution streams. The vast majority of streams

required little disk usage, but others used over four times

the disk space of the source data.

Given that these data preparation steps are typically

executed infrequently (its a best practice to store the

results of such processing as intermediate files or tables),

a conservative rule of thumb is to reserve between

three to five times the disk space required to store the

original data.

Table 4: Estimated disk space required (GB) for data mining (15 users)7

Columns

Rows (million) 10 20 50 100 500 1000

1 0.5 1 2.5 5 25 50

2 1 2 5 10 50 100

4 2 4 10 20 100 200

8 4 8 20 40 200 400

16 8 16 40 80 400 800

32 16 32 80 160 800 1600

64 32 64 160 320 1600 3200

This rule holds for small numbers of users because users will rarely perform high disk-usage operations simultaneously. In

addition, organizations can minimize overall disk usage by scheduling expensive data preparation steps during times of low

system usage.


16/16

Conclusion

The ever-growing amount of data created by organizations presents opportunities and challenges for data mining.

The PASW Modeler data mining solution makes it easy to use business knowledge to quickly develop, update, and deploy

predictive models.

Furthermore, PASW Modeler Servers combination of high performance, scalability, performance optimization options, and

flexible hardware requirements enables it to handle large and complex data mining projects. With PASW Modeler Server,

your organization can:

n Utilize your investment in high-performance databases for all data mining tasks, ensuring high performance and

minimizing data transfer costs

n Maximize your use of multiple CPUs (or multicore CPUs) in your operating environment by using parallel processing

during a number of data preparation and model-building operations

n Use in-database caching, database write-back with indexing, and optimized merging to join tables outside ofthe database

Scaling the entire data mining process with PASW Modeler Server makes it possible for your organization to analyze large

volumes of data efficiently, shortening the time needed to turn data into better business decisions that boost your ROI.

About SPSS Inc.

SPSS Inc. (NASDAQ: SPSS) is a leading global provider of predictive analytics software and solutions. The companys

predictive analytics technology improves business processes by giving organizations consistent control over decisions made

every day. By incorporating predictive analytics into their daily operations, organizations become Predictive Enterprisesable

to direct and automate decisions to meet business goals and achieve measurable competitive advantage.

More than 250,000 public sector, academic, and commercial customers rely on SPSS Inc. technology to help increase

revenue, reduce costs, and detect and prevent fraud. Founded in 1968, SPSS Inc. is headquartered in Chicago, Illinois. For

additional information, please visit www.spss.com.

To learn more, please visit www.spss.com. For SPSS Inc. office locations and telephone numbers, go to www.spss.com/worldwide.

SPSS is a registered trademar k and the other SPSS Inc. products named are trademarks of SPSS Inc. All other names are trademarks of their respective owners. 2009 SPSS Inc. All rights reserved. CSWP-0209

Modeler Server Performance, Optimization, And Sizing

Documents