Essay 3 · This method is an iterative one that flattens the multidimensional data into one or two...

transcript

Essay 3 “Visualization of Multi-Algorithm Clustering for Better Economic Decisions - The

Case of Car Pricing”

VISUALIZATION OF MULTI-ALGORITHM CLUSTERING

FOR BETTER ECONOMIC DECISIONS - THE CASE OF CAR PRICING

Abstract

Clustering decisions are frequently arise in business applications such as recommendation concerning

products, markets, human resources, etc. Currently, decision makers must analyze diverse algorithms

and parameters on an individual basis in order to establish preferences on the decision-making issues

they face; cause there is no supportive model or tool which enables comparing different result-clusters

generated by these algorithms and parameters combinations.

The Multi-Algorithm-Voting (MAV) methodology enables not only visualization of results

produced by diverse clustering algorithms, but also as quantitative analysis of the results.

The current research applies MAV methodology to the case of recommending new-car pricing.

The findings illustrate the impact and the benefits of such decision support system.

Key words: Decision Making, Decision Support System, Cluster Analysis, Visualization techniques,

Multi-Algorithm-Voting, Pricing.

Manuscript

1. Introduction

Unsupervised Clustering, i.e. classification of samples without prior knowledge of the exact number of

clusters, decisions are frequently arise in business applications such as finance (pricing), computer

science (image processing), marketing (market segmentation), and medicine (diagnostics) among others

[7] [8] [12] [17] [26] [28] [29] .

Currently, researchers, decision makers and business analysts, must test and analyze diverse algorithms

and parameters on an individual basis in order to set up and establish preferences to make decisions

about the problems they face. However, supportive models or tools to help them compare different

result-clusters produced by these algorithm and parameter combinations are very limited. Commercial

products neither show the resulting clusters of multiple methods nor give the decision maker tools with

which to analyze and compare the outcomes of various analyses.

Furthermore, visualization of the dataset and its classification are virtually impossible when more than

three attributes are used, as is the case in many financial problems, since displaying the dataset in such a

case requires dropping some of the attributes, or using a method to display the dataset distribution over

four (or more) dimensions. This makes it very difficult to relate to the dataset samples. In particular it is

hard to determine which of these samples might be difficult to classify, even when they are classified

correctly, and which samples and clusters stand out clearly [5] [10] [16] [19] [27] .

We developed a methodology called Multi Algorithms Voting (MAV), that overcomes these

shortcomings by using a “Tetris-like” visualization format, which enables a cross-algorithm presentation

[4] . The “Tetris-like” format is composed of rows, columns and colors; each column represents a

specific algorithm, each line represents a specific sample case, and each color represents a “Vote” (i.e.,

decision suggestion, formed by a specific algorithm for a specific sample case).

In this article we applied MAV methodology and its visualization approach to common financial and

marketing recommendation problems related to the dilemma of car pricing. Pricing of consumer

products in general and pricing of cars in particular is an important factor in the success of the product as

part of its launch and lifecycle, this is why the topic of car pricing is well researched. Within this context

the following advantages are discussed:

Visual presentation of multiple classification options, resulting from diverse algorithms using

tools developed specifically for this purpose.

Identification of areas (with respect to the car pricing problem) where the clustering is effective

and areas where the clustering is less effective.

Identification of irregular samples that may indicate difficult pricing and positioning of the

product

Identification of the most effective algorithms concerning the tested dataset and the pricing

problem at hand.

2. Theoretical Background

2.1. Cluster Analysis

In order to classify a dataset of samples according to a given set of attributes, a decision maker uses

algorithms that process the attributes of the dataset samples and associate them with suggested clusters.

These associations are obtained by calculating a likelihood measure, which indicates the likelihood of a

sample to be associated with a certain cluster.

The current research uses hierarchical clustering methods. These algorithms take the dataset attributes

that need to be clustered and start by initially classifying the dataset so that each sample represents a

cluster. Next, it merges the clusters in steps. Each step merges two clusters into a single cluster until

only one cluster (the dataset) remains. The algorithms differ in the way in which distance is measured

between clusters, mainly by using two parameters: (1) the distance or a likelihood measure, e.g.

Euclidean, Dice, etc.; and (2) the cluster method, e.g. Between-Group Linkage, Nearest-Neighbor, etc.

[14] [17] .

Five hierarchical algorithms were used in this study to classify the datasets. In all of them the commonly

used squared Euclidean distance measure was used as the likelihood measure. This calculates the

distance between two samples as the square root of the sums of all the squared distances between their

attributes.

As seen above, the algorithms and the likelihood measures differ in their definition of the task, i.e. the

clusters are different and the distance of a sample from a cluster is measured differently. Thus the

resulting dataset classification differs although there is no obvious dependency among the applied

algorithms [14] . The analysis becomes even more complicated if the true classification is unknown and

the decision maker has no means of identifying the core of the correct classification or the samples that

are difficult to classify.

2.2. Cluster Analysis Visualization

Currently, datasets are analyzed according to the following method:

The decision maker selects the best classification algorithm based on his or her experience and

knowledge of the dataset and problem at hand.

The decision maker tunes the chosen classification algorithm by determining parameters such as

the likelihood measure.

The decision maker applies the algorithm to the dataset using one of the following options:

- Predetermination of a fixed number of clusters to divide the dataset into (supervised

classification).

- Deciding on the preferred number of clusters to classify the dataset into based on the algorithm

output (unsupervised classification).

Presently, there are a limited number of visual aids to help the decision maker with the analysis of the

clustering results. These methods are discussed below.

2.2.1. Visualization - Dendrogram

Clustering results can be displayed in numerical tables, in 2D and 3D graphs, and when hierarchical

classification algorithms are applied, also in a dendrogram.

A dendrogram is a tree-like graph that presents the entire “clustering space”, i.e. the merger of clusters

from the initial case, where each sample is a different cluster in the total merger, where the whole

dataset is one cluster. The lines connecting clusters in a dendrogram represent clusters that are joined,

while the distance of the connecting lines represent the likelihood coefficient for a merger. The shorter

the distance, the greater the likelihood that the clusters will merge. Though the dendrogram provides the

decision maker with some sort of visual representation, the information in the dendrogram relates to the

chosen algorithm and does not compare or utilize additional algorithms. The information itself serves as

a visual aid to joining clusters, however the dendrogram does not provide a clear indication of

inconsistent samples in the sense that while a certain sample was classified to belong to a certain cluster,

this classification might not be accurate and the sample may actually belong to a different cluster.

Dendrogram is a common visual aid used by decision makers but it is not applicable to all algorithms.

Among the tools that utilize the dendrogram visual aid is the Hierarchical Clustering Explorer. This tool

attempts to deal with multidimensional presentation of datasets with multiple variables. It produces the

dashboard in Figure 1 of presentations around the dendrogram that show the classification process of

hierarchical clustering and the scatter plot that is a human readable presentation of the dataset, but

limited to two variables [24] [25] .

Figure 1: HCE Dashboard [25]

Although dendrograms are a popular tool, it is important to note that a dendrogram can only represent a

single algorithm at a time and cannot compare or utilize multiple algorithms simultaneously. Hence, a

dendrogram cannot single out unusual cases and this may result in a misleading interpretation and

inaccurate clustering.

2.2.2. Visualization - Discriminant Analysis & Factor Analysis

The problem of clustering may be perceived as finding functions applied to the variables that

discriminate between samples and decide on cluster membership. Since usually there are more than two

or even three variables it is difficult to visualize the samples in such multidimensional spaces. Some

methods use discriminating functions, which are a transformation of the original variables, and present

them on two- dimensional plots. Discriminant function analysis is analogous to multiple regression.

Two-group discriminant analysis is also called Fisher linear discriminant analysis [13] . In general, in

this approach we fit a linear equation of the type:

Group = a + b1*x1 + b2*x2 + ... + bm*xm

Where: a is a constant and b1 through bm are regression coefficients.

The variables (attributes) with significant regression coefficients are the ones that contribute most to the

prediction of group membership. However, these coefficients do not tell us which groups the respective

functions discriminate. The means of the functions across groups identify the group’s discrimination.

This can be visualized by plotting the individual scores for the discriminant functions, as illustrated in

Figure 2.

Figure 2: Discriminant Analysis of Fisher’s Iris Dataset [1]

Factor analysis is another way to determine which variables (attributes) define a particular discriminant

function. The former correlations can be regarded as factor loadings of the variables on each

discriminant function. Figure 3 illustrates the visualization of both correlations between the variables in

the model (using adjusted Factor Analysis), and discriminant functions using a tool that combines these

two methods [1] [23] . Each ray represents one variable (attribute). The angle between any two rays

presents the correlation between these variables (possible factors).

Figure 3: Discriminant Analysis & Factor Analysis of Fisher’s Iris Dataset [23]

2.2.3. Visualization - Self-Organizing Map

Self-Organizing Map (SOM) is another method, based on neural network methods, for clustering data.

This method is an iterative one that flattens the multidimensional data into one or two dimensions, thus

identifying similar clusters by visual attributes such as distance and color [20] . This method presents its

clustering recommendation in a visual manner, but the attributes of the data are not well presented

especially in the case where there are many of them. Research on the presentation of SOM suggested an

extended presentation of the maps by showing not just the sample’s representative color but also how it

is constructed based on the samples attributes [19] . An example of such presentations can be seen in

Figure 4.

Figure 4: SOM presentation and extended presentation [19]

2.2.4. Cluster Analysis Visualization - Discussion

As described above, these methodologies support visualization of a specific classification, based on a

single set of parameters. For this reason, current methodologies are usually incapable of making

comparisons between different algorithms and leave the decision-making, regarding which algorithm to

choose, to the decision maker. Furthermore most visual aids, though providing a visual interpretation of

the classification by the method of choice, lose some of the relevant information along the way, as in the

case of Discriminant Analysis, where the actual relations among the dataset variables are lost when

projected on the two- dimensional space.

This leaves the decision maker with very limited visual assistance and makes a full view of the relations

between the samples and a comparison between the dataset classifications difficult.

2.3. Cluster Analysis using MAV

2.3.1. The “Tetris-like” format

As said earlier, although dendrograms are a popular tool, they can only represent a single method at a

time and cannot compare or utilize multiple algorithms simultaneously. Hence, a dendrogram cannot

single out unusual cases and this may result in a misleading interpretation and inaccurate clustering.

MAV overcomes these shortcomings by enabling a cross-algorithm presentation in which all clusters are

presented together in a “Tetris-like format” in which each column represents a specific algorithm, each

line represents a specific sample case, and each color represents a “Vote” (i.e., decision suggestion,

formed by a specific algorithm for a specific sample case).

Consider the following illustration. In Figure 5, there are seven algorithms, denoted A1 to A7, and five

samples with a numerical ID denoted S. The three gray scale colors represent three distinctive

categorizations. Samples 174, 175 and 178 have an identical pattern determined by six out of seven

algorithms that voted for dark-gray color categorizations, and one algorithm that voted for the mid-gray

colored categorization. Samples 176 and 177 have identical patterns composed of three colors. Four

algorithms voted for the mid-gray colored categorization, two algorithms voted for the dark-gray color

categorization, and one algorithm voted for the light-gray color categorization. Alternatively, there could

be a case where all seven algorithms concur (i.e., vote for the same color categorization). By rearranging

the line orders, case 178 would be associated with cases 174 and 175 because all three have the same

color pattern. Thus, we would obtain two clusters: one consisting of samples 174, 175, and 178 and the

other consisting of samples 176 and 177. Finally, we could say that the cluster consists of three samples

that represent the dark gray category. Although the cluster consists of two samples (176 and 177) that

did not achieve the same level of agreement as in the other case, there is a majority for the mid-gray

color, but it could be claimed that there is an influence of both the dark-and light gray categorizations.

Clearly, there are other potential situations (a case in which there is total agreement), or a case in which

no distinctive categorization emerges (e.g. three dark-gray colored categorization, three mid-gray

colored categorization, and one light-gray colored categorization).

Here, we seek to minimize the heterogeneity meter representing the voting consensus. A decision about

which cluster a decision maker should decide to adopt should be based on the level of heterogeneity vs.

homogeneity. The sorted “Tetris block” diagram gives the decision maker a clear and explicit indication

of which cluster should be adopted. As such, a cluster with the minimal heterogeneity (maximum

homogeneity) should be adopted. This resolves the problem of arbitrary decisions concerning the

number of clusters; it is to say where to “cut” a dendrogram.

Figure 5: Tetris-like format (unsorted)

2.3.2. Voting and the Heterogeneity Meter

In order to find the best association, the Heterogeneity Meter needs to be minimized, i.e. identify the

association that makes the votes for each sample as homogeneous as possible.

The Heterogeneity Meter is then used to sort the Voting Matrix, giving the decision maker a clear, two-

dimensional perspective of the clusters and indicating how well each sample is associated with its

designated cluster. Multiple methods can be used to calculate the Heterogeneity Meter. Including the

following meters:

Squared Vote Error (SVE) is calculated as the square sum of all the algorithms [votes] that did not

vote for the chosen classification. It is calculated as follows:

Equation 1: SVE Heterogeneity Meter

Where:

H – is the Heterogeneity Meter

N – is the number of algorithms voting for the sample

M – is the maximum number of similar votes according to a specific association obtained for a single sample

i – is the sample number

n – is the total number of samples in the dataset

Distance From Second Best (DFSB) is calculated as the difference in the number of votes that the best

vote, i.e. the vote common to most algorithms, received and the number of votes the second best vote

received. The idea is to find out how much separates the best vote from the rest. This is actually a

Homogeneity meter as a higher score indicates less heterogeneity. It is calculated as follows:

iii SBBH

Equation 2: DFSB Homogeneity Meter

Where:

H – is the Homogeneity Meter

B – is the Best, i.e. the cluster voted most times; cluster for a given sample

SB – is the Second Best cluster for a given sample

i – is the sample number

n – is the total number of samples in the dataset

To maintain consistency in the association of the clusters a negative value for the DFSB meter is used

changing it into a Heterogeneity meter.

We used the SVE meter method to associate the algorithm clusters. This meter yields clearer associated

clusters than the DFSB meter that emphasizes the best associated samples. Using the SVE meter, the

decision maker can identify which samples belong to which cluster with the highest significance. Thus

the methodology enables the classification of the dataset and the distribution of the samples within each

cluster for further analysis.

2.4. The case of Car Pricing

Cars are popular commodity, and there are numerous pricing models. This is why we chose to

demonstrate the implementation of the proposed methodology on car pricing.

Previous research has modeled car characteristics into frameworks to estimate their effect on car pricing

for new car models [2] [3] , car price comparisons between different countries [11] or the effect of

Corporate Average Fuel Economy (CAFE) Standards regulations on the automobile industry and on car

prices [15] . Other studies have applied the Mixed Multi-Nomial-Logit (MNL) model on automobiles to

estimate the penetration of alternatives to fuel vehicles [6] [21] .

The fact that this is a well researched area, using, as in the case of cluster analysis, models based on

sample characteristics to estimate the distribution, encouraged us to demonstrate the value and

capabilities of the proposed methodology to analyze a well known car characteristics dataset. This

application visualizes the dataset and provides the decision maker with a display of the full dataset

showing trends and anomalies in an easily grasped format. When used in the initial stages of a study,

this approach provides the decision maker with tools to rapidly decide on what to concentrate, and a

simple means to communicate this to fellow researchers. The researcher, as an expert in the field, can

then fine-tune the findings through different algorithms, different likelihood meters or different

characteristics; in all cases, however, the procedure and the presentation remain the same.

3. Research Objectives and Environment

The objectives of this study are to give the decision maker a visual aid, difficult to achieve otherwise, of

the distribution of the car market prices in 1993, as presented by the dataset at hand. This is a new

approach to modeling in the car industry, which is usually based on a specific target.

The dataset contains new car model specifications for cars sold in the US in the year 1993. This dataset

appeared in the Journal of Statistics Education [18] and contains random car model data collected from

Consumer Reports: The 1993 Cars - Annual Auto Issue (April 1993) and PACE New Car & Truck 1993

Buying Guide (1993) [9] [22] . The source eliminated Pickup trucks and Sport/Utility Vehicles (SUV)

since their information was incompatible with the rest of the cars. We also eliminated models where

information was incomplete in the original dataset.

The cars in the dataset were classified into three price classes. We decided on a commonly used

classification to three categories:

Economy cars that cost less than $15,000

Middle class cars that cost more than $15,000, but less than $30,000

Luxury cars that cost more than $30,000

We used the following parameters for each car to perform the clustering:

The car manufacturer

The number of air bags in the car

The number of cylinders in the car engine

The car’s engine size

The car’s horse power

The car’s transmission type

The car’s fuel tank capacity

The car’s passenger capacity

The car’s length

The car’s wheel base

The car’s width

The car’s rear seat capacity

The car’s luggage capacity

The car’s origin: domestic or foreign

We performed the classification using the following five algorithms (marked as M1 - M5) via SPSS

version 13.0 for Windows.

M1 - Average Linkage (between Groups)

This method calculates the distance between two clusters by applying the likelihood measure to all

the samples of one cluster and then comparing it with all the samples of the other cluster. Once

again, the two clusters with the best likelihood measure are then united.

M2 - Average Linkage (within Groups)

This method calculates the distance between two clusters by applying the likelihood measure to all

the samples in the two clusters. The clusters with the best average likelihood measure are then

united.

M3 - Single Linkage

This method, as in the average linkage (between groups) method, calculates the distance between

two clusters by applying the likelihood measure to all the samples of one cluster and then

comparing it with all the samples of the other cluster. The two clusters with the best likelihood

measure, from a pair of samples, are united.

M4 - Median Method

This method calculates the median of each cluster. The likelihood measure is applied to the

medians of the clusters, after which the clusters with the best median likelihood are then united.

M5 - Ward Method

This method calculates the centroid for each cluster and the square of the likelihood measure of

each sample in both the cluster and the centroid. The two clusters, which when united have the

smallest (negative) affect on the sum of likelihood measures, are the clusters that need to be

united.

These algorithms were chosen after initial implementation of multiple hierarchical algorithms on the car

pricing dataset and selecting the algorithms that were able to produce reasonable classification results.

For all classifications, we used the Squared Euclidean likelihood measure.

The True classification is marked as T for reference; Sample ID is marked as S and the Row number as

R for reference.

Color-coding was used to distinguish between the different classes so that the same color denotes the

same classification after algorithm association; the class numbers in each row are arbitrary numbers

resulting from the individual classification of each algorithm.

Microsoft Excel 2003 was used to perform the analysis and the visualization of the Vote Matrix.

4. Results

4.1. Classification Visual ResultsAfter performing the classification, we ran the model to associate the different classification results

using the SVE method. The results were as follows:

4.1.1. First car price class results (Economy)Figure 6 shows that the M3, Single Linkage, was unable to match this class correctly because this

algorithm classified nearly all the samples as belonging to the same class.

In general, apart from the Single Linkage algorithm, this class was easy to classify by all algorithms,

where samples 54 and 60 were classified as belonging to the second class by many algorithms. This

classification may indicate that these models are under-priced.

Algorithm M4, the Median method, correctly classified all the cars belonging to this class, an indication

that this algorithm is a good candidate if the decision maker needs to find models belonging to this class.

However, as we will see later, using this algorithm may result in faulty classification - models belonging

to other classes may be assigned to this class.

R S T M1 M2 M3 M4 M5

1 12 1 2 1 1 1 3

2 13 1 2 1 1 1 3

3 20 1 2 1 1 1 3

4 21 1 2 1 1 1 3

5 22 1 2 1 1 1 3

6 25 1 2 1 1 1 3

7 27 1 2 1 1 1 3

8 29 1 2 1 1 1 3

9 31 1 2 1 1 1 3

10 34 1 2 1 1 1 3

11 35 1 2 1 1 1 3

12 37 1 2 1 1 1 3

13 39 1 2 1 1 1 3

14 41 1 2 1 1 1 3

15 48 1 2 1 1 1 3

16 49 1 2 1 1 1 3

17 53 1 2 1 1 1 3

18 55 1 2 1 1 1 3

19 57 1 2 1 1 1 3

20 63 1 2 1 1 1 3

21 64 1 2 1 1 1 3

22 65 1 2 1 1 1 3

23 70 1 2 1 1 1 3

24 71 1 2 1 1 1 3

25 72 1 2 1 1 1 3

26 74 1 2 1 1 1 3

27 75 1 2 1 1 1 3

28 78 1 2 1 1 1 3

29 28 1 1 1 1 1 1

30 40 1 1 1 1 1 1

31 42 1 1 1 1 1 1

32 54 1 1 2 1 1 1

33 60 1 1 2 1 1 1

Figure 6: Cars’ First Price Classification

4.1.2. Second car price class results (Middle)

Figure 7 shows that algorithm M3, Single Linkage, identified this class correctly in most cases, but this

is because it did not work well with the dataset and identified most of the samples as belonging to

middle class cars with very few exceptions (three).

This class was also classified correctly most of the time by the different algorithms. Using algorithm

M1, Average Linkage (between Groups), is the best classification choice. In addition, the cars in

samples 66-71 were classified as belonging to the first price class by most algorithms, suggesting that

they may be overpriced. On the other hand, the car in sample 72 was classified as belonging to price

class three, suggesting it is under-priced.

R S T M1 M2 M3 M4 M5

34 8 2 1 2 1 2 1

35 16 2 1 2 1 2 1

36 33 2 1 2 1 2 1

37 80 2 1 2 1 2 1

38 3 2 1 2 1 1 1

39 7 2 1 2 1 1 1

40 9 2 1 2 1 1 1

41 14 2 1 2 1 1 1

42 17 2 1 2 1 1 1

43 19 2 1 2 1 1 1

44 36 2 1 2 1 1 1

45 50 2 1 2 1 1 1

46 58 2 1 2 1 1 1

47 59 2 1 2 1 1 1

48 62 2 1 2 1 1 1

49 66 2 1 2 1 1 1

50 68 2 1 2 1 1 1

51 82 2 1 2 1 1 1

52 5 2 1 2 1 2 2

53 26 2 1 2 1 2 2

54 56 2 1 2 1 2 2

55 44 2 1 2 1 2 2

56 67 2 1 2 1 2 2

57 1 2 1 1 1 1 1

58 18 2 1 1 1 1 1

59 32 2 1 1 1 1 1

60 38 2 1 1 1 1 161

69 2 1 1 1 11

62 73 2 1 1 1 11

63 76 2 1 1 1 1 1

64 77 2 1 1 1 1 165

79 2 1 1 1 1 1

66 6 2 2 1 1 1 3

67 15 2 2 1 1 1 3

68 23 2 2 1 1 1 3

69 30 2 2 1 1 1 3

70 61 2 2 1 1 1 371

81 2 2 1 1 1 3

72 24 2 3 3 3 3 2

Figure 7: Cars’ Second Price Classification

Overpriced carsUnderpriced cars

4.1.3. Third car price class results (Luxury)

This price class proved to be the hardest one to classify. This makes sense since usually prices of luxury

items in general and cars in particular are more effected by unobserved characteristics. This is consistent

with previous research [2] on car pricing.

Algorithm M5, the Ward Method, is an exception to the rule and proved to be quite effective in

classifying cars belonging to this class. However it wrongly classified some samples belonging to the

second price class as belonging to this class.

R S T M1 M2 M3 M4 M573

11 3 3 3 2 3 2

74 43 3 3 3 2 3 2

75 52 3 1 2 1 2 2

76 47 3 1 2 1 2 2

77 2 3 1 2 1 2 2

78 10 3 1 2 1 2 2

79 45 3 1 2 1 22

80 4 3 1 2 1 1 1

81 46 3 1 2 1 1 1

82 51 3 1 1 1 1 1

Figure 8: Cars’ Third Price Classification

5. Summary and Discussion

Visual presentation of multi-classifications allows the decision maker to identify the right models, not

just within the context of the whole dataset, but also for specific tasks. The car pricing case study

findings reveal that the visual presentation shows the following:

Which clustering algorithms are suitable for different tasks

Which pricing categories can be easily identified

Which cars might be relatively overpriced

Which cars might be relatively under-priced

M5 is best for the problematic third price category

Specifically in our case study we identified the following:

Average Linkage (between Groups) is a good algorithm for identifying cars belonging to the first

two price categories, but it is not the best way to identify the cars belonging to the luxury price

category.

The Ward Method is a good algorithm for classifying cars in general and is the only algorithm that

identified cars belonging to the luxury price category. The algorithm is however, not the best

algorithm for classifying the rest of the categories, e.g. it might identify cars from the second price

category as belonging to the luxury price category more often than M1, Average Linkage (between

Groups).

Cars belonging to the luxury price category are difficult to identify using conventional parameters

such as classification factors.

We identified cars that are suspected to be overpriced and under-priced, based on conventional

classification factors. Potential buyers can use this as a guide for deciding which car to buy, and it

may assist car manufacturers with pricing policies.

This type of supportive model and DSS impact the ultimate business utility-decision in a significant

manner. Not only can it save critical time, but it also pinpoints irregular sample-cases, which may

require specific examination. In this way, the decision process focuses on the main issues instead of

wasting time on technical details.

6. Future Research

While the said methodology provides an effective tool for DSS, there are diverse directions for further

research, such as using the distribution association algorithms used in the methodology for associating

the different proposed clusters in additional areas where multiple associations need to be matched.

Furthermore, optimization of the association algorithms to allow better scalability over a large number

of clusters and clustering algorithms, using weights for the different proposed clusters and finding

methods to eliminate, or reduce the weight of ineffective clustering algorithms.

7. References

[1] H. Abdi, “Discriminant Correspondence Analysis”, In N.J. Salkind (Ed.): Encyclopedia of Measurement and Statistics. Sage., (2007)

[2] S. Berry, J. Levinsohn, and A. Pakes, "Automobile Prices in Market Equilibrium", Econometrica, (1995), 63(4), 841-890.

[3] S. Berry, J. Levinsohn, and A. Pakes, “Differentiated Products Demand Systems from a Combination of Micro and Macro Data: The New Car Market”, Journal of Political Economy, (2004), 112(1), 68-105.

[4] R.M. Bittmann, and R. Gelbard, “Decision-making method using a visual approach for cluster analysis problems; indicative classification algorithms and grouping scope”, Expert Systems, (2007), 24(3), 171-187.

[5] L. Boudjeloud and F. Poulet, "Visual interactive evolutionary algorithm for high dimensional data clustering and outlier detection", Lecture Notes in Artificial Intelligence, (2005), 3518, 426-431.

[6] D. Brownstone, D. Bunch, T. Golob, and W. Ren, “Transactions choice model for forecasting demand for alternative-fueled vehicles”, Research in Transportation Economics, (1996), 87-129.

[7] Cadez, D. Heckerman, C. Meek, P. Smyth and S. White, “Model-Based Clustering and Visualization of Navigation Patterns on a Web Site”, Data Mining and Knowledge Discovery, (2003), 7(4), 399–424.

[8] H.T. Clifford and W. Stevenson, “An Introduction to Numerical Classification”, Academic Press, (1975)

[9] Consumer Reports: The 1993 Cars - Annual Auto Issue, Yonkers, NY: Consumers Union, (1993).

[10] M.C.F. de Oliveira and H. Levkowitz, "From visual data exploration to visual data mining: A survey", IEEE Transactions on Visualization and Computer Graphics, (2003), 9(3), 378-394.

[11] H. Degryse and F. Verboven, “Car Price Differentials in the European Union: An Economic Analysis”, Centre for economic policy research, (2000).

[12] Z. Erlich, R. Gelbard and I. Spiegler, “Data Mining by Means of Binary Representation: A Model for Similarity and Clustering”, Information Systems Frontiers, (2002), 4(2), 187-197.

[13] R.A. Fisher, "The Use of Multiple Measurements in Taxonomic Problems" Annual Eugenics, (1936), 7, 179-188.

[14] R. Gelbard, O. Goldman and I. Spiegler, “Investigating Diversity of Clustering Methods: An Empirical Comparison”, Data & Knowledge Engineering, (2007), 63(1), 155-166.

[15] P.K. Goldberg, “The Effects of the Corporate Average Fuel Efficiency Standards in the US,” The Journal of Industrial Economics, (1998), 46(1), pp 1-33.

[16] J. Grabmier and A. Rudolph, “Techniques of Cluster Algorithms in Data Mining”, Data Mining and Knowledge Discovery, (2002), 6(4), 303-360.

[17] A.K. Jain, M.N. Murty and P.J. Flynn, “Data Clustering: A Review”, ACM Communication Surveys, (1999), 31(3), 264-323.

[18] R.H. Lock, “1993 New Car Data”, Journal of Statistics Education, (1993), 1(1).

[19] Y. Kim, “Weighted Order-dependent Clustering and Visualization of Web Navigation Patterns”, Decision Support Systems, (2007), 43(4), 1630-1645.

[20] T. Kohonen, Self-Organizing Maps (Third ed., Vol. 30). Berlin, Heidelberg, New York: Springer, (2001).

[21] D. McFadden, and K. Train, “Mixed MNL Models for Discrete Response,” Journal of Applied Economics, (2000), 15(5), 447-470.

[22] PACE New Car & Truck 1993 Buying Guide, Milwaukee, WI: Pace Publications Inc., (1993).

[23] A. Raveh, "Co-plot: A Graphic Display Method for Geometrical Representations of MCDM", European Journal of Operational Research, (2000), 125(3), 670-678.

[24] J. Seo, and B. Shneiderman, “Interactively Exploring Hierarchical Clustering Results," IEEE Computer, (2002), 35(7), 80-86.

[25] J. Seo, and B. Shneiderman, “A Rank-by-Feature Framework for Interactive Exploration of Multidimensional Data”, Information Visualization, (2005), 4(2), 99-113.

[26] R. Sharan and R. Shamir, "Algorithmic approaches to clustering gene expression data", In Jiang T. et al. (eds): Current Topics in Computational Molecular Biology, MIT Press, (2002), 269–300.

[27] T.R. Shultz, D. Mareschal, and W.C. Schmidt, “Modeling Cognitive Development on Balance Scale Phenomena”, Machine Learning, (1994), 16(1-2), 57-86.

[28] S. Thomassey and A. Fiordaliso, “A Hybrid Sales Forecasting System Based on Clustering and Decision Trees”, Decision Support Systems, (2006), 42(1), 408-421.

[29] N. Wu and J. Zhang, “Factor-analysis Based Anomaly Detection and Clustering”, Decision Support Systems, (2006), 42(1), 375-389.

References Abdi, H. (2007). Discriminant Correspondence Analysis. In N. J. Salkind (Ed.),

Encyclopedia of Measurement and Statistics (pp. 270-275). Thousand Oaks, CA, USA:

Sage Publications.

Berry, S., Levinsohn, J., & Pakes, A. (1995). Automobile Prices in Market Equilibrium.

Econometrica , 63 (4), 841-890.

Berry, S., Levinsohn, J., & Pakes, A. (2004). Differentiated Products Demand Systems

from a Combination of Micro and Macro Data: The New Car Market. Journal of Political

Economy , 112 (1), 68-105.

Bittmann, R. M., & Gelbard, R. M. (2007). Decision-making method using a visual

approach for cluster analysis problems; indicative classification algorithms and grouping

scope. Expert Systems , 24 (3), 171-187.

Bittmann, R. M., & Gelbard, R. M. (2008). DSS Using Visualization of Multi-Algorithms

Voting. In F. Adam, & P. Humphreys (Eds.), Encyclopedia of Decision Making and

Decision Support Technologies. IGI Global.

Boudjeloud, L., & Poulet, F. (2005). Visual Interactive Evolutionary Algorithm for High

Dimensional Data Clustering and Outlier Detection. In T. B. Ho, D. Cheung, & H. Liu

(Eds.), Advances in Knowledge Discovery and Data Mining (Vol. 3518, pp. 426-431).

Berlin / Heidelberg: Springer.

Brownstone, D., Bunch, D. S., Golob, T. F., & Ren, W. (1996). A Transactions Choice

Model For Forecasting Demand For Alternative-Fuel Vehicles. (S. B. McMullen, Ed.)

Research in Transportation Economics , 4, 87-129.

Cadez, I. V., Heckerman, D., Meek, C., Smyth, P., & White, S. (2003). Model-Based

Clustering and Visualization of Navigation Patterns on a Web Site. Data Mining and

Knowledge Discovery , 7 (4), 399-424.

Chapman, G. B., & Johnson, E. J. (1994). The Limits of Anchoring. Journal of

Behavioral Decision Making , 7 (4), 223-242.

Clifford, H. T., & Stephenson, W. (1975). An Introduction to Numerical Classification.

New York, NY, USA: Academic Press.

(1993). Consumer Reports: The 1993 Cars - Annual Auto Issue. Yonkers, NY, USA:

Consumers Union.

de Olivera, F. M., & Levkowitz, H. (2003). From Visual Data Exploration to Visual Data

Mining: A Survey. Transactions on Visualization and Computer Graphics , 9 (3), 378-

Degryse, H., & Verboven, F. (2000). Car Price Differentials in the European Union: An

Economic Analysis. Centre For Economic Policy Research, London.

Erlich, Z., Gelbard, R. M., & Spiegler, I. (2002). Data Mining by Means of Binary

Representation: A Model for Similarity and Clustering. Information Systems Frontiers , 4

(2), 187-197.

Fisher, R. A. (1936). The Use of Multiple Measurements in Taxonomic Problems. Annals

Eugenics , 7, 179-188.

Gelbard, R. M., Goldman, O., & Spiegler, I. (2007). Investigating Diversity of Clustering

Methods: An Empirical Comparison. Data & Knowledge Engineering , 63 (1), 155-166.

Goldberg, P. K. (1998). The Effects of the Corporate Average Fuel Efficiency Standards

in the US. Journal of Industrial Economics , 46 (1), 1-33.

Grabmeier, J., & Rudolph, A. (2002). Techniques of Cluster Algorithms in Data Mining.

Data Mining and Knowledge Discovery , 6 (4), 303-360.

Henderson, D. (2001). Assessing the Finite-Time Performance of Local Search

Algorithms. PhD Dissertation, Virginia Polytechnic Institute and State University,

Industrial and Systems Engineering, Blacksburg, Virginia.

Jain, A. K., Murthy, M. N., & Flynn, P. J. (1999). Data Clustering: A Review. ACM

Computing Surveys (CSUR) , 31 (3), 264-323.

Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., & Wu, A. Y.

(2002). A local search approximation algorithm for k-means clustering. Proceedings of

the eighteenth annual symposium on Computational geometry (pp. 10-18). Barcelona,

Spain: ACM.

Kim, Y. (2007). Weighted Order-dependent Clustering and Visualization of Web

Navigation Patterns. Decision Support Systems , 43 (4), 1630-1645.

Kohonen, T. (2001). Self-Organizing Maps (Third ed., Vol. 30). Berlin, Heidelberg, New

York: Springer.

Lock, R. H. (1993). 1993 New Car Data. Journal of Statistics Education , 1 (1).

McFadden, D., & Train, K. (2000). Mixed MNL Models for Discrete Response. Journal

of Applied Econometrics , 15 (5), 447-470.

PACE New Car & Truck 1993 Buying Guide. (1993). Milwaukee, WI, USA: Pace

Publications Inc.

Raveh, A. (2000). Co-plot: A Graphic Display Method for Geometrical Representations

of MCDM. European Journal of Operational Research , 125 (3), 670-678.

Sensen, N. (1999). Algorithms for a Job-Scheduling Problem within a Parallel Digital

Library. Proceedings of the 1999 International Conference on Parallel Processing (p.

422). Washington, DC, USA: IEEE Computer Society.

Seo, J., & Shneiderman, B. (2005). A Rank-by-Feature Framework for Interactive

Exploration of Multidimensional Data. Information Visualization , 4 (2), 96-113.

Seo, J., & Shneiderman, B. (2002). Interactively Exploring Hierarchical Clustering

Results. Computer , 35 (7), 80-86.

Shamir, R., & Sharan, R. (2002). Algorithmic Approaches to Clustering Gene Expression

Data. In T. Jiang, Y. Xu, & M. Q. Zhang (Eds.), Current Topics in Computational

Molecular Biology (pp. 269-300). Cambridge, MA, USA: MIT Press.

Shultz, T. R., Mareschal, D., & Schmidt, W. C. (1994). Modeling Cognitive

Development on Balance Scale Phenomena. Machine Learning , 16 (1-2), 57-86.

Tang, Q. C., & Cheng, H. K. (2005). Optimal Location and Pricing of Web Services

Intermediary. Decision Support Systems , 40 (1), 129-141.

Thomassey, S., & Fiordaliso, A. (2006). A Hybrid Sales Forecasting System Based on

Clustering and Decision Trees. Decision Support Systems , 42 (1), 408-421.

Wu, N., & Zhang, J. (2006). Factor-analysis Based Anomaly Detection and Clustering.

Decision Support Systems , 42 (1), 375 - 389.

תקציר

ומשאבי , שיווק, קיבוץ נתונים היא שיטה מקובלת לקבלת החלטות בנושאים מגוונים כגון מכירות

בכדי לקבוע באופן נפרד כיום מקבלי ההחלטות נאלצים לנתח אלגוריתמים ומשתנים . אנוש

מקבלי ההחלטות אין מודל או כלי המאפשר להשוות ל. לתעדף את ההחלטות העומדות בפניהםו

.שהתקבלו מהפעלת שיטות שונות או על סמך משתנים שוניםבין שיטות החלוקה

אשר פותחה ) MAV(מבוססת על שיטת ההצבעה מרובת האלגוריתמים המוצעת מתודולוגיהה

כאשר כל אחד , מהפעלת מספר אלגוריתמיםהמתקבלות על מנת לנתח ולהציג תוצאות

התצוגה משתמשת בפורמט דמוי משחק הטטריס אשר בו . מהאלגוריתמים מציע החלטה שונה

כל קיבוץ המוצע על ידי אלגוריתם מסוים מיוצג . מסודר במטריצה) ההחלטות(קיבוץ הנתונים

" טות מקומיותהחל. "וכל דגימה מיוצגת בשורה באותה המטריצה, בטור במטריצה האמורה

מיוצגות על ידי תגים בתאים המתאימים ) החלטות לגבי דגימה מסוימת על ידי אלגוריתם מסוים(

.במטריצה האמורה

על ידי אלגוריתם ,זה לזה מתאימה את התגים שניקבעו באופן שרירותי MAV - שיטת ה

אשר פותח ,)Local Search(ממשפחת האלגוריתמים המממשים חיפוש מקומי אופטימיזציה

הצבעים . לדוגמא באמצעות צבע שונה, כל התאמה מיוצגת בצורה חזותית. לצורך המחקר

מיצגים קבוצה , אפילו בשורות שונות, וצבעים דומים, שומרים על אחידות בכל המטריצה

למרות שהשימוש באלגוריתם נעשה לצורך התאמת החלוקות במסגרת ניתוח .זהה) החלטה(

באלגוריתם ובשיטת התצוגה שפותחה הוא כללי ומתאים לכל הבעיות הרי שהשימוש, הקיבוץ

.הדורשות התאמה וניתוח של פיזור נתונים

את איכות . המיצגת דגימה, מחשבת את איכות ההתאמה עבור כל שורה MAV - שיטת ה

או הביזור ) Homogeneity(באמצעות האחידות ,מבלי לפגוע בכלליות, ההתאמה ניתן לחשב

)Heterogeneity ( של ההתאמה של דגימה בודדת על כל האלגוריתמים המופעלים בניתוח

.ההתאמה הטובה ביותר מוצגת על סמך מדד האיכות בו השתמשו. הנתונים

התוצאות המתקבלות על ידי הפעלת מספר רב של הצגתמאפשרת לא רק את MAV-שיטת ה

.טיב התוצאה לשאלא גם אפיון כמותי , אלגוריתמים

תוכן העניינים

i................................................................................................................................... )אנגלית(תקציר

1.................................................................................................................................................. מבוא

1.................................................................................................................................. הקדמה .1

3....................................................................................................................... מטרות המחקר .2

4........................................................................................................................... רקע תאורטי .3

4............................................................................................. אלגוריתמים -קיבוץ נתונים .3.1

7...................................................................................................... תצוגה -קיבוץ נתונים .3.2

11 ........................................................................................... ”חיפוש מקומי“ אלגוריתמי .3.3

12 .................................................................................................................... הנחות המחקר .4

13 ...................................................................................................................... המודל המוצע .5

13 ................................................................................................................ עקרון המודל .5.1

16 ..................................................................................................................... שיטות המחקר .6

16 .................................................................................................................. כלי המחקר .6.1

22 ........................................................................................................... הערכת המחקר .6.2

23 ..................................................................................................... מבנה המחקר ופירסומים .7

24 ........................................................................................................................... ודיון סיכום .8

25 ......................................................................................................................................... 'מאמר א

42 .......................................................................................................................................... 'מאמר ב

53 .......................................................................................................................................... 'מאמר ג

75 ................................................................................................................................ רשימת מקורות

א ................................................................................................................................. )עברית(תקציר

תודות

.מורה דגול וידיד אמיתי. הנחה ותמך, רועי גלברד שלימד' לדר

.אברהם כרמלי על עזרתו הרבה' לפרופ

.אדם גדול ואובדן גדול. הידע, ל על המסירות"יורי זלוטניקוב ז' לדר

.שבורכתי במשפחה נפלאה שנתנה לי את הכח להמשיךוגיא על , ולילדי עדי, לאשתי אורנה

.אירית ביטמן שעטפתם אותי באהבה ותמיכה כל חיי' ולאחותי דרלהורי מרדכי ורות ביטמן

.ל שהצבתם את הרף אליו אני שואף"ל וסבתי גרדה ז"יוליוס סימון ז' לסבי דר

מנהל לאברהם כרמלי מבית הספר ' רועי גלברד ופרופ' עבודה זו נעשתה בהדרכתם של דר

.אילן-של אוניברסיטת בר עסקים

ח"תשס, תמוז רמת גן

לתמיכה בבעיות הקבצה, מתודולוגיה משולבת גישה חזותית

"דוקטור לפילוסופיה"חיבור לשם קבלת התואר

:מאת

רן משה ביטמן

מנהל עסקיםלבית הספר

אילן-הוגש לסנט של אוניברסיטת בר

Essay 3 · This method is an iterative one that flattens the multidimensional data into one or two...

Documents