Post on 09-Jul-2020
transcript
53
Essay 3 “Visualization of Multi-Algorithm Clustering for Better Economic Decisions - The
Case of Car Pricing”
1
VISUALIZATION OF MULTI-ALGORITHM CLUSTERING
FOR BETTER ECONOMIC DECISIONS - THE CASE OF CAR PRICING
Abstract
Clustering decisions are frequently arise in business applications such as recommendation concerning
products, markets, human resources, etc. Currently, decision makers must analyze diverse algorithms
and parameters on an individual basis in order to establish preferences on the decision-making issues
they face; cause there is no supportive model or tool which enables comparing different result-clusters
generated by these algorithms and parameters combinations.
The Multi-Algorithm-Voting (MAV) methodology enables not only visualization of results
produced by diverse clustering algorithms, but also as quantitative analysis of the results.
The current research applies MAV methodology to the case of recommending new-car pricing.
The findings illustrate the impact and the benefits of such decision support system.
Key words: Decision Making, Decision Support System, Cluster Analysis, Visualization techniques,
Multi-Algorithm-Voting, Pricing.
Manuscript
54
2
1. Introduction
Unsupervised Clustering, i.e. classification of samples without prior knowledge of the exact number of
clusters, decisions are frequently arise in business applications such as finance (pricing), computer
science (image processing), marketing (market segmentation), and medicine (diagnostics) among others
[7] [8] [12] [17] [26] [28] [29] .
Currently, researchers, decision makers and business analysts, must test and analyze diverse algorithms
and parameters on an individual basis in order to set up and establish preferences to make decisions
about the problems they face. However, supportive models or tools to help them compare different
result-clusters produced by these algorithm and parameter combinations are very limited. Commercial
products neither show the resulting clusters of multiple methods nor give the decision maker tools with
which to analyze and compare the outcomes of various analyses.
Furthermore, visualization of the dataset and its classification are virtually impossible when more than
three attributes are used, as is the case in many financial problems, since displaying the dataset in such a
case requires dropping some of the attributes, or using a method to display the dataset distribution over
four (or more) dimensions. This makes it very difficult to relate to the dataset samples. In particular it is
hard to determine which of these samples might be difficult to classify, even when they are classified
correctly, and which samples and clusters stand out clearly [5] [10] [16] [19] [27] .
We developed a methodology called Multi Algorithms Voting (MAV), that overcomes these
shortcomings by using a “Tetris-like” visualization format, which enables a cross-algorithm presentation
[4] . The “Tetris-like” format is composed of rows, columns and colors; each column represents a
specific algorithm, each line represents a specific sample case, and each color represents a “Vote” (i.e.,
decision suggestion, formed by a specific algorithm for a specific sample case).
In this article we applied MAV methodology and its visualization approach to common financial and
marketing recommendation problems related to the dilemma of car pricing. Pricing of consumer
products in general and pricing of cars in particular is an important factor in the success of the product as
55
3
part of its launch and lifecycle, this is why the topic of car pricing is well researched. Within this context
the following advantages are discussed:
Visual presentation of multiple classification options, resulting from diverse algorithms using
tools developed specifically for this purpose.
Identification of areas (with respect to the car pricing problem) where the clustering is effective
and areas where the clustering is less effective.
Identification of irregular samples that may indicate difficult pricing and positioning of the
product
Identification of the most effective algorithms concerning the tested dataset and the pricing
problem at hand.
2. Theoretical Background
2.1. Cluster Analysis
In order to classify a dataset of samples according to a given set of attributes, a decision maker uses
algorithms that process the attributes of the dataset samples and associate them with suggested clusters.
These associations are obtained by calculating a likelihood measure, which indicates the likelihood of a
sample to be associated with a certain cluster.
The current research uses hierarchical clustering methods. These algorithms take the dataset attributes
that need to be clustered and start by initially classifying the dataset so that each sample represents a
cluster. Next, it merges the clusters in steps. Each step merges two clusters into a single cluster until
only one cluster (the dataset) remains. The algorithms differ in the way in which distance is measured
between clusters, mainly by using two parameters: (1) the distance or a likelihood measure, e.g.
Euclidean, Dice, etc.; and (2) the cluster method, e.g. Between-Group Linkage, Nearest-Neighbor, etc.
[14] [17] .
Five hierarchical algorithms were used in this study to classify the datasets. In all of them the commonly
used squared Euclidean distance measure was used as the likelihood measure. This calculates the
distance between two samples as the square root of the sums of all the squared distances between their
attributes.
56
4
As seen above, the algorithms and the likelihood measures differ in their definition of the task, i.e. the
clusters are different and the distance of a sample from a cluster is measured differently. Thus the
resulting dataset classification differs although there is no obvious dependency among the applied
algorithms [14] . The analysis becomes even more complicated if the true classification is unknown and
the decision maker has no means of identifying the core of the correct classification or the samples that
are difficult to classify.
2.2. Cluster Analysis Visualization
Currently, datasets are analyzed according to the following method:
The decision maker selects the best classification algorithm based on his or her experience and
knowledge of the dataset and problem at hand.
The decision maker tunes the chosen classification algorithm by determining parameters such as
the likelihood measure.
The decision maker applies the algorithm to the dataset using one of the following options:
- Predetermination of a fixed number of clusters to divide the dataset into (supervised
classification).
- Deciding on the preferred number of clusters to classify the dataset into based on the algorithm
output (unsupervised classification).
Presently, there are a limited number of visual aids to help the decision maker with the analysis of the
clustering results. These methods are discussed below.
2.2.1. Visualization - Dendrogram
Clustering results can be displayed in numerical tables, in 2D and 3D graphs, and when hierarchical
classification algorithms are applied, also in a dendrogram.
A dendrogram is a tree-like graph that presents the entire “clustering space”, i.e. the merger of clusters
from the initial case, where each sample is a different cluster in the total merger, where the whole
dataset is one cluster. The lines connecting clusters in a dendrogram represent clusters that are joined,
while the distance of the connecting lines represent the likelihood coefficient for a merger. The shorter
the distance, the greater the likelihood that the clusters will merge. Though the dendrogram provides the
57
5
decision maker with some sort of visual representation, the information in the dendrogram relates to the
chosen algorithm and does not compare or utilize additional algorithms. The information itself serves as
a visual aid to joining clusters, however the dendrogram does not provide a clear indication of
inconsistent samples in the sense that while a certain sample was classified to belong to a certain cluster,
this classification might not be accurate and the sample may actually belong to a different cluster.
Dendrogram is a common visual aid used by decision makers but it is not applicable to all algorithms.
Among the tools that utilize the dendrogram visual aid is the Hierarchical Clustering Explorer. This tool
attempts to deal with multidimensional presentation of datasets with multiple variables. It produces the
dashboard in Figure 1 of presentations around the dendrogram that show the classification process of
hierarchical clustering and the scatter plot that is a human readable presentation of the dataset, but
limited to two variables [24] [25] .
Figure 1: HCE Dashboard [25]
Although dendrograms are a popular tool, it is important to note that a dendrogram can only represent a
single algorithm at a time and cannot compare or utilize multiple algorithms simultaneously. Hence, a
dendrogram cannot single out unusual cases and this may result in a misleading interpretation and
inaccurate clustering.
58
6
2.2.2. Visualization - Discriminant Analysis & Factor Analysis
The problem of clustering may be perceived as finding functions applied to the variables that
discriminate between samples and decide on cluster membership. Since usually there are more than two
or even three variables it is difficult to visualize the samples in such multidimensional spaces. Some
methods use discriminating functions, which are a transformation of the original variables, and present
them on two- dimensional plots. Discriminant function analysis is analogous to multiple regression.
Two-group discriminant analysis is also called Fisher linear discriminant analysis [13] . In general, in
this approach we fit a linear equation of the type:
Group = a + b1*x1 + b2*x2 + ... + bm*xm
Where: a is a constant and b1 through bm are regression coefficients.
The variables (attributes) with significant regression coefficients are the ones that contribute most to the
prediction of group membership. However, these coefficients do not tell us which groups the respective
functions discriminate. The means of the functions across groups identify the group’s discrimination.
This can be visualized by plotting the individual scores for the discriminant functions, as illustrated in
Figure 2.
Figure 2: Discriminant Analysis of Fisher’s Iris Dataset [1]
Factor analysis is another way to determine which variables (attributes) define a particular discriminant
function. The former correlations can be regarded as factor loadings of the variables on each
discriminant function. Figure 3 illustrates the visualization of both correlations between the variables in
the model (using adjusted Factor Analysis), and discriminant functions using a tool that combines these
59
7
two methods [1] [23] . Each ray represents one variable (attribute). The angle between any two rays
presents the correlation between these variables (possible factors).
Figure 3: Discriminant Analysis & Factor Analysis of Fisher’s Iris Dataset [23]
2.2.3. Visualization - Self-Organizing Map
Self-Organizing Map (SOM) is another method, based on neural network methods, for clustering data.
This method is an iterative one that flattens the multidimensional data into one or two dimensions, thus
identifying similar clusters by visual attributes such as distance and color [20] . This method presents its
clustering recommendation in a visual manner, but the attributes of the data are not well presented
especially in the case where there are many of them. Research on the presentation of SOM suggested an
extended presentation of the maps by showing not just the sample’s representative color but also how it
is constructed based on the samples attributes [19] . An example of such presentations can be seen in
Figure 4.
Figure 4: SOM presentation and extended presentation [19]
60
8
2.2.4. Cluster Analysis Visualization - Discussion
As described above, these methodologies support visualization of a specific classification, based on a
single set of parameters. For this reason, current methodologies are usually incapable of making
comparisons between different algorithms and leave the decision-making, regarding which algorithm to
choose, to the decision maker. Furthermore most visual aids, though providing a visual interpretation of
the classification by the method of choice, lose some of the relevant information along the way, as in the
case of Discriminant Analysis, where the actual relations among the dataset variables are lost when
projected on the two- dimensional space.
This leaves the decision maker with very limited visual assistance and makes a full view of the relations
between the samples and a comparison between the dataset classifications difficult.
2.3. Cluster Analysis using MAV
2.3.1. The “Tetris-like” format
As said earlier, although dendrograms are a popular tool, they can only represent a single method at a
time and cannot compare or utilize multiple algorithms simultaneously. Hence, a dendrogram cannot
single out unusual cases and this may result in a misleading interpretation and inaccurate clustering.
MAV overcomes these shortcomings by enabling a cross-algorithm presentation in which all clusters are
presented together in a “Tetris-like format” in which each column represents a specific algorithm, each
line represents a specific sample case, and each color represents a “Vote” (i.e., decision suggestion,
formed by a specific algorithm for a specific sample case).
Consider the following illustration. In Figure 5, there are seven algorithms, denoted A1 to A7, and five
samples with a numerical ID denoted S. The three gray scale colors represent three distinctive
categorizations. Samples 174, 175 and 178 have an identical pattern determined by six out of seven
algorithms that voted for dark-gray color categorizations, and one algorithm that voted for the mid-gray
colored categorization. Samples 176 and 177 have identical patterns composed of three colors. Four
algorithms voted for the mid-gray colored categorization, two algorithms voted for the dark-gray color
categorization, and one algorithm voted for the light-gray color categorization. Alternatively, there could
61
9
be a case where all seven algorithms concur (i.e., vote for the same color categorization). By rearranging
the line orders, case 178 would be associated with cases 174 and 175 because all three have the same
color pattern. Thus, we would obtain two clusters: one consisting of samples 174, 175, and 178 and the
other consisting of samples 176 and 177. Finally, we could say that the cluster consists of three samples
that represent the dark gray category. Although the cluster consists of two samples (176 and 177) that
did not achieve the same level of agreement as in the other case, there is a majority for the mid-gray
color, but it could be claimed that there is an influence of both the dark-and light gray categorizations.
Clearly, there are other potential situations (a case in which there is total agreement), or a case in which
no distinctive categorization emerges (e.g. three dark-gray colored categorization, three mid-gray
colored categorization, and one light-gray colored categorization).
Here, we seek to minimize the heterogeneity meter representing the voting consensus. A decision about
which cluster a decision maker should decide to adopt should be based on the level of heterogeneity vs.
homogeneity. The sorted “Tetris block” diagram gives the decision maker a clear and explicit indication
of which cluster should be adopted. As such, a cluster with the minimal heterogeneity (maximum
homogeneity) should be adopted. This resolves the problem of arbitrary decisions concerning the
number of clusters; it is to say where to “cut” a dendrogram.
Figure 5: Tetris-like format (unsorted)
2.3.2. Voting and the Heterogeneity Meter
In order to find the best association, the Heterogeneity Meter needs to be minimized, i.e. identify the
association that makes the votes for each sample as homogeneous as possible.
The Heterogeneity Meter is then used to sort the Voting Matrix, giving the decision maker a clear, two-
dimensional perspective of the clusters and indicating how well each sample is associated with its
62
10
designated cluster. Multiple methods can be used to calculate the Heterogeneity Meter. Including the
following meters:
Squared Vote Error (SVE) is calculated as the square sum of all the algorithms [votes] that did not
vote for the chosen classification. It is calculated as follows:
n
iiMNH
1
2 (1)
Equation 1: SVE Heterogeneity Meter
Where:
H – is the Heterogeneity Meter
N – is the number of algorithms voting for the sample
M – is the maximum number of similar votes according to a specific association obtained for a single sample
i – is the sample number
n – is the total number of samples in the dataset
Distance From Second Best (DFSB) is calculated as the difference in the number of votes that the best
vote, i.e. the vote common to most algorithms, received and the number of votes the second best vote
received. The idea is to find out how much separates the best vote from the rest. This is actually a
Homogeneity meter as a higher score indicates less heterogeneity. It is calculated as follows:
n
iii SBBH
1(2)
Equation 2: DFSB Homogeneity Meter
Where:
H – is the Homogeneity Meter
B – is the Best, i.e. the cluster voted most times; cluster for a given sample
SB – is the Second Best cluster for a given sample
i – is the sample number
n – is the total number of samples in the dataset
To maintain consistency in the association of the clusters a negative value for the DFSB meter is used
changing it into a Heterogeneity meter.
63
11
We used the SVE meter method to associate the algorithm clusters. This meter yields clearer associated
clusters than the DFSB meter that emphasizes the best associated samples. Using the SVE meter, the
decision maker can identify which samples belong to which cluster with the highest significance. Thus
the methodology enables the classification of the dataset and the distribution of the samples within each
cluster for further analysis.
2.4. The case of Car Pricing
Cars are popular commodity, and there are numerous pricing models. This is why we chose to
demonstrate the implementation of the proposed methodology on car pricing.
Previous research has modeled car characteristics into frameworks to estimate their effect on car pricing
for new car models [2] [3] , car price comparisons between different countries [11] or the effect of
Corporate Average Fuel Economy (CAFE) Standards regulations on the automobile industry and on car
prices [15] . Other studies have applied the Mixed Multi-Nomial-Logit (MNL) model on automobiles to
estimate the penetration of alternatives to fuel vehicles [6] [21] .
The fact that this is a well researched area, using, as in the case of cluster analysis, models based on
sample characteristics to estimate the distribution, encouraged us to demonstrate the value and
capabilities of the proposed methodology to analyze a well known car characteristics dataset. This
application visualizes the dataset and provides the decision maker with a display of the full dataset
showing trends and anomalies in an easily grasped format. When used in the initial stages of a study,
this approach provides the decision maker with tools to rapidly decide on what to concentrate, and a
simple means to communicate this to fellow researchers. The researcher, as an expert in the field, can
then fine-tune the findings through different algorithms, different likelihood meters or different
characteristics; in all cases, however, the procedure and the presentation remain the same.
64
12
3. Research Objectives and Environment
The objectives of this study are to give the decision maker a visual aid, difficult to achieve otherwise, of
the distribution of the car market prices in 1993, as presented by the dataset at hand. This is a new
approach to modeling in the car industry, which is usually based on a specific target.
The dataset contains new car model specifications for cars sold in the US in the year 1993. This dataset
appeared in the Journal of Statistics Education [18] and contains random car model data collected from
Consumer Reports: The 1993 Cars - Annual Auto Issue (April 1993) and PACE New Car & Truck 1993
Buying Guide (1993) [9] [22] . The source eliminated Pickup trucks and Sport/Utility Vehicles (SUV)
since their information was incompatible with the rest of the cars. We also eliminated models where
information was incomplete in the original dataset.
The cars in the dataset were classified into three price classes. We decided on a commonly used
classification to three categories:
Economy cars that cost less than $15,000
Middle class cars that cost more than $15,000, but less than $30,000
Luxury cars that cost more than $30,000
We used the following parameters for each car to perform the clustering:
The car manufacturer
The number of air bags in the car
The number of cylinders in the car engine
The car’s engine size
The car’s horse power
The car’s transmission type
The car’s fuel tank capacity
The car’s passenger capacity
The car’s length
The car’s wheel base
The car’s width
The car’s rear seat capacity
The car’s luggage capacity
The car’s origin: domestic or foreign
65
13
We performed the classification using the following five algorithms (marked as M1 - M5) via SPSS
version 13.0 for Windows.
M1 - Average Linkage (between Groups)
This method calculates the distance between two clusters by applying the likelihood measure to all
the samples of one cluster and then comparing it with all the samples of the other cluster. Once
again, the two clusters with the best likelihood measure are then united.
M2 - Average Linkage (within Groups)
This method calculates the distance between two clusters by applying the likelihood measure to all
the samples in the two clusters. The clusters with the best average likelihood measure are then
united.
M3 - Single Linkage
This method, as in the average linkage (between groups) method, calculates the distance between
two clusters by applying the likelihood measure to all the samples of one cluster and then
comparing it with all the samples of the other cluster. The two clusters with the best likelihood
measure, from a pair of samples, are united.
M4 - Median Method
This method calculates the median of each cluster. The likelihood measure is applied to the
medians of the clusters, after which the clusters with the best median likelihood are then united.
M5 - Ward Method
This method calculates the centroid for each cluster and the square of the likelihood measure of
each sample in both the cluster and the centroid. The two clusters, which when united have the
smallest (negative) affect on the sum of likelihood measures, are the clusters that need to be
united.
These algorithms were chosen after initial implementation of multiple hierarchical algorithms on the car
pricing dataset and selecting the algorithms that were able to produce reasonable classification results.
For all classifications, we used the Squared Euclidean likelihood measure.
The True classification is marked as T for reference; Sample ID is marked as S and the Row number as
R for reference.
66
14
Color-coding was used to distinguish between the different classes so that the same color denotes the
same classification after algorithm association; the class numbers in each row are arbitrary numbers
resulting from the individual classification of each algorithm.
Microsoft Excel 2003 was used to perform the analysis and the visualization of the Vote Matrix.
4. Results
4.1. Classification Visual ResultsAfter performing the classification, we ran the model to associate the different classification results
using the SVE method. The results were as follows:
4.1.1. First car price class results (Economy)Figure 6 shows that the M3, Single Linkage, was unable to match this class correctly because this
algorithm classified nearly all the samples as belonging to the same class.
In general, apart from the Single Linkage algorithm, this class was easy to classify by all algorithms,
where samples 54 and 60 were classified as belonging to the second class by many algorithms. This
classification may indicate that these models are under-priced.
Algorithm M4, the Median method, correctly classified all the cars belonging to this class, an indication
that this algorithm is a good candidate if the decision maker needs to find models belonging to this class.
However, as we will see later, using this algorithm may result in faulty classification - models belonging
to other classes may be assigned to this class.
67
15
R S T M1 M2 M3 M4 M5
1 12 1 2 1 1 1 3
2 13 1 2 1 1 1 3
3 20 1 2 1 1 1 3
4 21 1 2 1 1 1 3
5 22 1 2 1 1 1 3
6 25 1 2 1 1 1 3
7 27 1 2 1 1 1 3
8 29 1 2 1 1 1 3
9 31 1 2 1 1 1 3
10 34 1 2 1 1 1 3
11 35 1 2 1 1 1 3
12 37 1 2 1 1 1 3
13 39 1 2 1 1 1 3
14 41 1 2 1 1 1 3
15 48 1 2 1 1 1 3
16 49 1 2 1 1 1 3
17 53 1 2 1 1 1 3
18 55 1 2 1 1 1 3
19 57 1 2 1 1 1 3
20 63 1 2 1 1 1 3
21 64 1 2 1 1 1 3
22 65 1 2 1 1 1 3
23 70 1 2 1 1 1 3
24 71 1 2 1 1 1 3
25 72 1 2 1 1 1 3
26 74 1 2 1 1 1 3
27 75 1 2 1 1 1 3
28 78 1 2 1 1 1 3
29 28 1 1 1 1 1 1
30 40 1 1 1 1 1 1
31 42 1 1 1 1 1 1
32 54 1 1 2 1 1 1
33 60 1 1 2 1 1 1
Figure 6: Cars’ First Price Classification
68
16
4.1.2. Second car price class results (Middle)
Figure 7 shows that algorithm M3, Single Linkage, identified this class correctly in most cases, but this
is because it did not work well with the dataset and identified most of the samples as belonging to
middle class cars with very few exceptions (three).
This class was also classified correctly most of the time by the different algorithms. Using algorithm
M1, Average Linkage (between Groups), is the best classification choice. In addition, the cars in
samples 66-71 were classified as belonging to the first price class by most algorithms, suggesting that
they may be overpriced. On the other hand, the car in sample 72 was classified as belonging to price
class three, suggesting it is under-priced.
69
17
R S T M1 M2 M3 M4 M5
34 8 2 1 2 1 2 1
35 16 2 1 2 1 2 1
36 33 2 1 2 1 2 1
37 80 2 1 2 1 2 1
38 3 2 1 2 1 1 1
39 7 2 1 2 1 1 1
40 9 2 1 2 1 1 1
41 14 2 1 2 1 1 1
42 17 2 1 2 1 1 1
43 19 2 1 2 1 1 1
44 36 2 1 2 1 1 1
45 50 2 1 2 1 1 1
46 58 2 1 2 1 1 1
47 59 2 1 2 1 1 1
48 62 2 1 2 1 1 1
49 66 2 1 2 1 1 1
50 68 2 1 2 1 1 1
51 82 2 1 2 1 1 1
52 5 2 1 2 1 2 2
53 26 2 1 2 1 2 2
54 56 2 1 2 1 2 2
55 44 2 1 2 1 2 2
56 67 2 1 2 1 2 2
57 1 2 1 1 1 1 1
58 18 2 1 1 1 1 1
59 32 2 1 1 1 1 1
60 38 2 1 1 1 1 161
69 2 1 1 1 11
62 73 2 1 1 1 11
63 76 2 1 1 1 1 1
64 77 2 1 1 1 1 165
79 2 1 1 1 1 1
66 6 2 2 1 1 1 3
67 15 2 2 1 1 1 3
68 23 2 2 1 1 1 3
69 30 2 2 1 1 1 3
70 61 2 2 1 1 1 371
81 2 2 1 1 1 3
72 24 2 3 3 3 3 2
Figure 7: Cars’ Second Price Classification
Overpriced carsUnderpriced cars
70
18
4.1.3. Third car price class results (Luxury)
This price class proved to be the hardest one to classify. This makes sense since usually prices of luxury
items in general and cars in particular are more effected by unobserved characteristics. This is consistent
with previous research [2] on car pricing.
Algorithm M5, the Ward Method, is an exception to the rule and proved to be quite effective in
classifying cars belonging to this class. However it wrongly classified some samples belonging to the
second price class as belonging to this class.
R S T M1 M2 M3 M4 M573
11 3 3 3 2 3 2
74 43 3 3 3 2 3 2
75 52 3 1 2 1 2 2
76 47 3 1 2 1 2 2
77 2 3 1 2 1 2 2
78 10 3 1 2 1 2 2
79 45 3 1 2 1 22
80 4 3 1 2 1 1 1
81 46 3 1 2 1 1 1
82 51 3 1 1 1 1 1
Figure 8: Cars’ Third Price Classification
5. Summary and Discussion
Visual presentation of multi-classifications allows the decision maker to identify the right models, not
just within the context of the whole dataset, but also for specific tasks. The car pricing case study
findings reveal that the visual presentation shows the following:
Which clustering algorithms are suitable for different tasks
Which pricing categories can be easily identified
Which cars might be relatively overpriced
Which cars might be relatively under-priced
M5 is best for the problematic third price category
71
19
Specifically in our case study we identified the following:
Average Linkage (between Groups) is a good algorithm for identifying cars belonging to the first
two price categories, but it is not the best way to identify the cars belonging to the luxury price
category.
The Ward Method is a good algorithm for classifying cars in general and is the only algorithm that
identified cars belonging to the luxury price category. The algorithm is however, not the best
algorithm for classifying the rest of the categories, e.g. it might identify cars from the second price
category as belonging to the luxury price category more often than M1, Average Linkage (between
Groups).
Cars belonging to the luxury price category are difficult to identify using conventional parameters
such as classification factors.
We identified cars that are suspected to be overpriced and under-priced, based on conventional
classification factors. Potential buyers can use this as a guide for deciding which car to buy, and it
may assist car manufacturers with pricing policies.
This type of supportive model and DSS impact the ultimate business utility-decision in a significant
manner. Not only can it save critical time, but it also pinpoints irregular sample-cases, which may
require specific examination. In this way, the decision process focuses on the main issues instead of
wasting time on technical details.
6. Future Research
While the said methodology provides an effective tool for DSS, there are diverse directions for further
research, such as using the distribution association algorithms used in the methodology for associating
the different proposed clusters in additional areas where multiple associations need to be matched.
Furthermore, optimization of the association algorithms to allow better scalability over a large number
of clusters and clustering algorithms, using weights for the different proposed clusters and finding
methods to eliminate, or reduce the weight of ineffective clustering algorithms.
72
20
7. References
[1] H. Abdi, “Discriminant Correspondence Analysis”, In N.J. Salkind (Ed.): Encyclopedia of Measurement and Statistics. Sage., (2007)
[2] S. Berry, J. Levinsohn, and A. Pakes, "Automobile Prices in Market Equilibrium", Econometrica, (1995), 63(4), 841-890.
[3] S. Berry, J. Levinsohn, and A. Pakes, “Differentiated Products Demand Systems from a Combination of Micro and Macro Data: The New Car Market”, Journal of Political Economy, (2004), 112(1), 68-105.
[4] R.M. Bittmann, and R. Gelbard, “Decision-making method using a visual approach for cluster analysis problems; indicative classification algorithms and grouping scope”, Expert Systems, (2007), 24(3), 171-187.
[5] L. Boudjeloud and F. Poulet, "Visual interactive evolutionary algorithm for high dimensional data clustering and outlier detection", Lecture Notes in Artificial Intelligence, (2005), 3518, 426-431.
[6] D. Brownstone, D. Bunch, T. Golob, and W. Ren, “Transactions choice model for forecasting demand for alternative-fueled vehicles”, Research in Transportation Economics, (1996), 87-129.
[7] Cadez, D. Heckerman, C. Meek, P. Smyth and S. White, “Model-Based Clustering and Visualization of Navigation Patterns on a Web Site”, Data Mining and Knowledge Discovery, (2003), 7(4), 399–424.
[8] H.T. Clifford and W. Stevenson, “An Introduction to Numerical Classification”, Academic Press, (1975)
[9] Consumer Reports: The 1993 Cars - Annual Auto Issue, Yonkers, NY: Consumers Union, (1993).
[10] M.C.F. de Oliveira and H. Levkowitz, "From visual data exploration to visual data mining: A survey", IEEE Transactions on Visualization and Computer Graphics, (2003), 9(3), 378-394.
[11] H. Degryse and F. Verboven, “Car Price Differentials in the European Union: An Economic Analysis”, Centre for economic policy research, (2000).
[12] Z. Erlich, R. Gelbard and I. Spiegler, “Data Mining by Means of Binary Representation: A Model for Similarity and Clustering”, Information Systems Frontiers, (2002), 4(2), 187-197.
[13] R.A. Fisher, "The Use of Multiple Measurements in Taxonomic Problems" Annual Eugenics, (1936), 7, 179-188.
[14] R. Gelbard, O. Goldman and I. Spiegler, “Investigating Diversity of Clustering Methods: An Empirical Comparison”, Data & Knowledge Engineering, (2007), 63(1), 155-166.
[15] P.K. Goldberg, “The Effects of the Corporate Average Fuel Efficiency Standards in the US,” The Journal of Industrial Economics, (1998), 46(1), pp 1-33.
[16] J. Grabmier and A. Rudolph, “Techniques of Cluster Algorithms in Data Mining”, Data Mining and Knowledge Discovery, (2002), 6(4), 303-360.
73
21
[17] A.K. Jain, M.N. Murty and P.J. Flynn, “Data Clustering: A Review”, ACM Communication Surveys, (1999), 31(3), 264-323.
[18] R.H. Lock, “1993 New Car Data”, Journal of Statistics Education, (1993), 1(1).
[19] Y. Kim, “Weighted Order-dependent Clustering and Visualization of Web Navigation Patterns”, Decision Support Systems, (2007), 43(4), 1630-1645.
[20] T. Kohonen, Self-Organizing Maps (Third ed., Vol. 30). Berlin, Heidelberg, New York: Springer, (2001).
[21] D. McFadden, and K. Train, “Mixed MNL Models for Discrete Response,” Journal of Applied Economics, (2000), 15(5), 447-470.
[22] PACE New Car & Truck 1993 Buying Guide, Milwaukee, WI: Pace Publications Inc., (1993).
[23] A. Raveh, "Co-plot: A Graphic Display Method for Geometrical Representations of MCDM", European Journal of Operational Research, (2000), 125(3), 670-678.
[24] J. Seo, and B. Shneiderman, “Interactively Exploring Hierarchical Clustering Results," IEEE Computer, (2002), 35(7), 80-86.
[25] J. Seo, and B. Shneiderman, “A Rank-by-Feature Framework for Interactive Exploration of Multidimensional Data”, Information Visualization, (2005), 4(2), 99-113.
[26] R. Sharan and R. Shamir, "Algorithmic approaches to clustering gene expression data", In Jiang T. et al. (eds): Current Topics in Computational Molecular Biology, MIT Press, (2002), 269–300.
[27] T.R. Shultz, D. Mareschal, and W.C. Schmidt, “Modeling Cognitive Development on Balance Scale Phenomena”, Machine Learning, (1994), 16(1-2), 57-86.
[28] S. Thomassey and A. Fiordaliso, “A Hybrid Sales Forecasting System Based on Clustering and Decision Trees”, Decision Support Systems, (2006), 42(1), 408-421.
[29] N. Wu and J. Zhang, “Factor-analysis Based Anomaly Detection and Clustering”, Decision Support Systems, (2006), 42(1), 375-389.
74
75
References Abdi, H. (2007). Discriminant Correspondence Analysis. In N. J. Salkind (Ed.),
Encyclopedia of Measurement and Statistics (pp. 270-275). Thousand Oaks, CA, USA:
Sage Publications.
Berry, S., Levinsohn, J., & Pakes, A. (1995). Automobile Prices in Market Equilibrium.
Econometrica , 63 (4), 841-890.
Berry, S., Levinsohn, J., & Pakes, A. (2004). Differentiated Products Demand Systems
from a Combination of Micro and Macro Data: The New Car Market. Journal of Political
Economy , 112 (1), 68-105.
Bittmann, R. M., & Gelbard, R. M. (2007). Decision-making method using a visual
approach for cluster analysis problems; indicative classification algorithms and grouping
scope. Expert Systems , 24 (3), 171-187.
Bittmann, R. M., & Gelbard, R. M. (2008). DSS Using Visualization of Multi-Algorithms
Voting. In F. Adam, & P. Humphreys (Eds.), Encyclopedia of Decision Making and
Decision Support Technologies. IGI Global.
Boudjeloud, L., & Poulet, F. (2005). Visual Interactive Evolutionary Algorithm for High
Dimensional Data Clustering and Outlier Detection. In T. B. Ho, D. Cheung, & H. Liu
(Eds.), Advances in Knowledge Discovery and Data Mining (Vol. 3518, pp. 426-431).
Berlin / Heidelberg: Springer.
Brownstone, D., Bunch, D. S., Golob, T. F., & Ren, W. (1996). A Transactions Choice
Model For Forecasting Demand For Alternative-Fuel Vehicles. (S. B. McMullen, Ed.)
Research in Transportation Economics , 4, 87-129.
Cadez, I. V., Heckerman, D., Meek, C., Smyth, P., & White, S. (2003). Model-Based
Clustering and Visualization of Navigation Patterns on a Web Site. Data Mining and
Knowledge Discovery , 7 (4), 399-424.
76
Chapman, G. B., & Johnson, E. J. (1994). The Limits of Anchoring. Journal of
Behavioral Decision Making , 7 (4), 223-242.
Clifford, H. T., & Stephenson, W. (1975). An Introduction to Numerical Classification.
New York, NY, USA: Academic Press.
(1993). Consumer Reports: The 1993 Cars - Annual Auto Issue. Yonkers, NY, USA:
Consumers Union.
de Olivera, F. M., & Levkowitz, H. (2003). From Visual Data Exploration to Visual Data
Mining: A Survey. Transactions on Visualization and Computer Graphics , 9 (3), 378-
394.
Degryse, H., & Verboven, F. (2000). Car Price Differentials in the European Union: An
Economic Analysis. Centre For Economic Policy Research, London.
Erlich, Z., Gelbard, R. M., & Spiegler, I. (2002). Data Mining by Means of Binary
Representation: A Model for Similarity and Clustering. Information Systems Frontiers , 4
(2), 187-197.
Fisher, R. A. (1936). The Use of Multiple Measurements in Taxonomic Problems. Annals
Eugenics , 7, 179-188.
Gelbard, R. M., Goldman, O., & Spiegler, I. (2007). Investigating Diversity of Clustering
Methods: An Empirical Comparison. Data & Knowledge Engineering , 63 (1), 155-166.
Goldberg, P. K. (1998). The Effects of the Corporate Average Fuel Efficiency Standards
in the US. Journal of Industrial Economics , 46 (1), 1-33.
Grabmeier, J., & Rudolph, A. (2002). Techniques of Cluster Algorithms in Data Mining.
Data Mining and Knowledge Discovery , 6 (4), 303-360.
Henderson, D. (2001). Assessing the Finite-Time Performance of Local Search
Algorithms. PhD Dissertation, Virginia Polytechnic Institute and State University,
Industrial and Systems Engineering, Blacksburg, Virginia.
77
Jain, A. K., Murthy, M. N., & Flynn, P. J. (1999). Data Clustering: A Review. ACM
Computing Surveys (CSUR) , 31 (3), 264-323.
Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., & Wu, A. Y.
(2002). A local search approximation algorithm for k-means clustering. Proceedings of
the eighteenth annual symposium on Computational geometry (pp. 10-18). Barcelona,
Spain: ACM.
Kim, Y. (2007). Weighted Order-dependent Clustering and Visualization of Web
Navigation Patterns. Decision Support Systems , 43 (4), 1630-1645.
Kohonen, T. (2001). Self-Organizing Maps (Third ed., Vol. 30). Berlin, Heidelberg, New
York: Springer.
Lock, R. H. (1993). 1993 New Car Data. Journal of Statistics Education , 1 (1).
McFadden, D., & Train, K. (2000). Mixed MNL Models for Discrete Response. Journal
of Applied Econometrics , 15 (5), 447-470.
PACE New Car & Truck 1993 Buying Guide. (1993). Milwaukee, WI, USA: Pace
Publications Inc.
Raveh, A. (2000). Co-plot: A Graphic Display Method for Geometrical Representations
of MCDM. European Journal of Operational Research , 125 (3), 670-678.
Sensen, N. (1999). Algorithms for a Job-Scheduling Problem within a Parallel Digital
Library. Proceedings of the 1999 International Conference on Parallel Processing (p.
422). Washington, DC, USA: IEEE Computer Society.
Seo, J., & Shneiderman, B. (2005). A Rank-by-Feature Framework for Interactive
Exploration of Multidimensional Data. Information Visualization , 4 (2), 96-113.
Seo, J., & Shneiderman, B. (2002). Interactively Exploring Hierarchical Clustering
Results. Computer , 35 (7), 80-86.
78
Shamir, R., & Sharan, R. (2002). Algorithmic Approaches to Clustering Gene Expression
Data. In T. Jiang, Y. Xu, & M. Q. Zhang (Eds.), Current Topics in Computational
Molecular Biology (pp. 269-300). Cambridge, MA, USA: MIT Press.
Shultz, T. R., Mareschal, D., & Schmidt, W. C. (1994). Modeling Cognitive
Development on Balance Scale Phenomena. Machine Learning , 16 (1-2), 57-86.
Tang, Q. C., & Cheng, H. K. (2005). Optimal Location and Pricing of Web Services
Intermediary. Decision Support Systems , 40 (1), 129-141.
Thomassey, S., & Fiordaliso, A. (2006). A Hybrid Sales Forecasting System Based on
Clustering and Decision Trees. Decision Support Systems , 42 (1), 408-421.
Wu, N., & Zhang, J. (2006). Factor-analysis Based Anomaly Detection and Clustering.
Decision Support Systems , 42 (1), 375 - 389.
א
תקציר
ומשאבי , שיווק, קיבוץ נתונים היא שיטה מקובלת לקבלת החלטות בנושאים מגוונים כגון מכירות
בכדי לקבוע באופן נפרד כיום מקבלי ההחלטות נאלצים לנתח אלגוריתמים ומשתנים . אנוש
מקבלי ההחלטות אין מודל או כלי המאפשר להשוות ל. לתעדף את ההחלטות העומדות בפניהםו
.שהתקבלו מהפעלת שיטות שונות או על סמך משתנים שוניםבין שיטות החלוקה
אשר פותחה ) MAV(מבוססת על שיטת ההצבעה מרובת האלגוריתמים המוצעת מתודולוגיהה
כאשר כל אחד , מהפעלת מספר אלגוריתמיםהמתקבלות על מנת לנתח ולהציג תוצאות
התצוגה משתמשת בפורמט דמוי משחק הטטריס אשר בו . מהאלגוריתמים מציע החלטה שונה
כל קיבוץ המוצע על ידי אלגוריתם מסוים מיוצג . מסודר במטריצה) ההחלטות(קיבוץ הנתונים
" טות מקומיותהחל. "וכל דגימה מיוצגת בשורה באותה המטריצה, בטור במטריצה האמורה
מיוצגות על ידי תגים בתאים המתאימים ) החלטות לגבי דגימה מסוימת על ידי אלגוריתם מסוים(
.במטריצה האמורה
על ידי אלגוריתם ,זה לזה מתאימה את התגים שניקבעו באופן שרירותי MAV - שיטת ה
אשר פותח ,)Local Search(ממשפחת האלגוריתמים המממשים חיפוש מקומי אופטימיזציה
הצבעים . לדוגמא באמצעות צבע שונה, כל התאמה מיוצגת בצורה חזותית. לצורך המחקר
מיצגים קבוצה , אפילו בשורות שונות, וצבעים דומים, שומרים על אחידות בכל המטריצה
למרות שהשימוש באלגוריתם נעשה לצורך התאמת החלוקות במסגרת ניתוח .זהה) החלטה(
באלגוריתם ובשיטת התצוגה שפותחה הוא כללי ומתאים לכל הבעיות הרי שהשימוש, הקיבוץ
.הדורשות התאמה וניתוח של פיזור נתונים
את איכות . המיצגת דגימה, מחשבת את איכות ההתאמה עבור כל שורה MAV - שיטת ה
או הביזור ) Homogeneity(באמצעות האחידות ,מבלי לפגוע בכלליות, ההתאמה ניתן לחשב
)Heterogeneity ( של ההתאמה של דגימה בודדת על כל האלגוריתמים המופעלים בניתוח
.ההתאמה הטובה ביותר מוצגת על סמך מדד האיכות בו השתמשו. הנתונים
התוצאות המתקבלות על ידי הפעלת מספר רב של הצגתמאפשרת לא רק את MAV-שיטת ה
.טיב התוצאה לשאלא גם אפיון כמותי , אלגוריתמים
תוכן העניינים
i................................................................................................................................... )אנגלית(תקציר
1.................................................................................................................................................. מבוא
1.................................................................................................................................. הקדמה .1
3....................................................................................................................... מטרות המחקר .2
4........................................................................................................................... רקע תאורטי .3
4............................................................................................. אלגוריתמים -קיבוץ נתונים .3.1
7...................................................................................................... תצוגה -קיבוץ נתונים .3.2
11 ........................................................................................... ”חיפוש מקומי“ אלגוריתמי .3.3
12 .................................................................................................................... הנחות המחקר .4
13 ...................................................................................................................... המודל המוצע .5
13 ................................................................................................................ עקרון המודל .5.1
16 ..................................................................................................................... שיטות המחקר .6
16 .................................................................................................................. כלי המחקר .6.1
22 ........................................................................................................... הערכת המחקר .6.2
23 ..................................................................................................... מבנה המחקר ופירסומים .7
24 ........................................................................................................................... ודיון סיכום .8
25 ......................................................................................................................................... 'מאמר א
42 .......................................................................................................................................... 'מאמר ב
53 .......................................................................................................................................... 'מאמר ג
75 ................................................................................................................................ רשימת מקורות
א ................................................................................................................................. )עברית(תקציר
תודות
.מורה דגול וידיד אמיתי. הנחה ותמך, רועי גלברד שלימד' לדר
.אברהם כרמלי על עזרתו הרבה' לפרופ
.אדם גדול ואובדן גדול. הידע, ל על המסירות"יורי זלוטניקוב ז' לדר
.שבורכתי במשפחה נפלאה שנתנה לי את הכח להמשיךוגיא על , ולילדי עדי, לאשתי אורנה
.אירית ביטמן שעטפתם אותי באהבה ותמיכה כל חיי' ולאחותי דרלהורי מרדכי ורות ביטמן
.ל שהצבתם את הרף אליו אני שואף"ל וסבתי גרדה ז"יוליוס סימון ז' לסבי דר
מנהל לאברהם כרמלי מבית הספר ' רועי גלברד ופרופ' עבודה זו נעשתה בהדרכתם של דר
.אילן-של אוניברסיטת בר עסקים