Date post: | 27-Dec-2015 |
Category: |
Documents |
Upload: | sugan-pragasam |
View: | 3 times |
Download: | 1 times |
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
1
Correspondence Analysis
Chapter 14
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
2
Correspondence analysis
• Multivariate statistical technique which looks into the association of two or more categorical variables and display them jointly on a bivariate graph
• It can be used to apply multidimensional scaling to categorical variable.
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
3
Correspondence analysisand data reduction techniques
• Factor and principal component analyses are only applied to metric (interval or ratio) quantitative variables
• Traditional multidimensional scaling deals with non-metric preference and perceptual data when those are on an ordinal scale
• Correspondence analysis allows data reduction (and graphical representation of dissimilarities) on non-metric nominal (categorical) variables
• The issue with categorical (non-ordinal) variables is how to measure distances between two objects: Correspondence analysis exploits contingency tables and association measures
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
4
Example (Trust data)
• Do consumers with different jobs (q55) show preferences for some specific type of chicken (q6)?
Correspondence Table
17 50 10 17 94
11 74 14 28 127
6 19 4 8 37
0 7 6 14 27
1 18 7 3 29
1 1 1 0 3
0 4 2 3 9
11 31 1 1 44
47 204 45 74 370
If employed, what is youroccupation?I am not employed
Non manual employee
Manual employee
Executive
Self employedprofessional
Farmer / agriculturalworker
Employer / Entrepreneur
Other
Active Margin
'Value'chicken
'Standard'chicken
'Organic'chicken
'Luxury'chicken Active Margin
In a typical week, what type of fresh or frozen chicken do you buy foryour household's home consumption?
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
5
Independence
• If the two characters are independent then the number in the cells of the table should simply depend on the row and column totals (lecture 9)
• Measure the distance between the expected frequency in each cell and the actual (observed) frequency
• Compute a statistic (the Chi-square statistic) which allows one to test whether the difference between the expected and actual value is statistically significant
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
6
Reducing the number of dimensions
• The elements composing the Chi-square statistic are standardized metric values, one for each of the cells
• They become larger as the association between two specific characters increases
• These elements can be interpreted as a metric measure of distance
• The resulting matrix is similar to a covariance matrix
• A method similar to principal component analysis can be applied to this matrix to reduce the number of dimensions
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
7
coordinates• The principal component scores provide
standardized values that can be used as coordinates
• One may apply the same data reduction technique• first by rows (synthesizing occupation as a function
of types of chicken)• then by column (synthesizing types of chicken as a
function of occupation)• The first two components for each application
generate a bivariate plot which shows both the occupation and the type of chicken in the same space
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
8
Output fromCorrespondence Analysis
Executives prefer “Luxury” chicken
Unemployed are closer to “Value” chicken
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
9
Applications
• It is possible to represent on the same graph consumer preferences for different brands and characteristics of a specific product (e.g. car brands together with colour, power, size, etc.)
• This allows one to explore brand choice in relation to characteristics opening the way to product modifications and innovations to meet consumer preferences
• Correspondence analysis is particularly useful when the variables have many categories
• The application to metric (continuous) data is not ruled out but data need to be categorized first
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
10
Summary
• Correspondence analysis is a compositional technique which starts from a set of product attributes to portrait the overall preference for a brand
• This technique is very similar to PCA and can be employed for data reduction purposes or to plot perceptual maps
• Because of the way it is constructed correspondence analysis can be applied to either the row or the columns of the data matrix
• For example if rows represent brands and columns are different attributes:1. By applying the method by rows one obtains the coordinates
for the brands2. The application by columns allows one to represent the
attributes in the same graph
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
11
Steps to run correspondence analysis
• Represent the data in a contingency table • Translate the frequencies of the contingency
table into a matrix of metric (continuous) distances through a set of Chi-square association measures on the row and column profiles
• Extract the dimensions (in a similar fashion to PCA)
• Evaluate the explanatory power of the selected number of dimensions
• Plot row and column objects in the same co-ordinate space
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
12
The frequency table
y1 y2 … yj … yl
x1 f11 f12 f1j f1l f10
x2 f21 f22 f2j f2l f20
… …
xi fi1 fij fil fi0
… …
xk fk1 fj2 fkj fkl fkl
f01 f02 f0j f0l 1
Categorical variable Y (l categories)C
ateg
ori
cal
vari
able
X (
k ca
teg
ori
es)
Row profile
Row masses
Column profile Column masses
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
13
Interpretation of coordinates
• The categories of the x variable can be seen as different coordinates for the points identified by the y variable
• The categories of the y variable can be seen as different coordinates for the points identified by the x variable
• Thus it is possible to represent the x and y categories as points in space, imposing (as in multidimensional scaling) that they respect some distance measure
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
14
Representations• Take the row profile (the categories of x) and
plot the categories in a bi-dimensional graph, using the categories of y to define the distances
• This allows one to compare nominal categories within the same variable: those categories of x which show similar levels of association with a given category of y can be considered as closer than those with very different levels of association with the same category of y
• The same procedure is carried out transposing the table which means that the categories of y can be represented using the categories of x to define the distances
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
15
Computing the distances
• When the coordinates are defined simultaneously for the categories of x and y the Chi-square value can be computed for each cell as follows
• Obtain the expected table frequencies
• Where nij and fij are the absolute and relative frequencies, respectively, ni0 and n0j (or fi0 and f0j) are the marginal totals for row i and column j (the row masses and column masses) respectively and n00 is the sample size (hence the total relative frequency f00 equals one)
• The Chi-square value can now be computed for each cell (i,j)
0 0 0 0*0 0
00 00
i j i jij i j
n n f ff f f
n f
* 22
*
( )ij ijij
ij
f f
f
These are the quadratic distances between category i and category j of the x variable
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
16
The distance matrix• The matrix 2 measures all of the associations
between the categories of the first variable and those of the second one.
• A generalization of the multivariate case (MCA is possible by stacking the matrix• Stacking: compose a large matrix by blocks, where each block
is the contingency matrix for two variables (all possible associations are taken into consideration)
• The stacked matrix is referred to as the Burt Table • To obtain similarity values from the 2 matrix:
• compute the square root of the elemental Chi-square values• use the the appropriate sign (the sign of the difference fij –fij
*)• large positive values correspond to strongly associated
categories• large negative values identify those categories where the
association is strong but negative indicating dissimilarity
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
17
Estimation• The resulting matrix D contains metric and continuous
similarity data• It is possible to apply PCA to translate such a matrix
into coordinates for each of the categories first those of x then those of y
• Before PCA can be applied some normalization is required so that the input matrix becomes similar to a correlation matrix
• The use of the square root of the row masses (columns) for normalizing the values in D represents the key difference from PCA
• The rest of the estimation process follows the results of the PCA
• As for PCA eigenvalues are computed, one for each dimension, which can be used to evaluate the proportion of dissimilarity maintained by that dimension
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
18
Inertia• Inertia is a measure of association between two
categorical variables based on the Chi-squared statistic. • In correspondence analysis the proportion of inertia
explained by each of the dimensions can be regarded as a measure of goodness-of-fit because the effectiveness of correspondence analysis depends on the degree of association between x and y
• Total inertia – is a measure of the overall association between x and y– is equal to the sum of the eigenvalues – corresponds to the Chi-square value divided by the number of
observations– A total inertia above 0.20 is expected for adequate
representations • Inertia values can be computed for each of the
dimensions and represent the contribution of that dimension to the association (Chi-square) between the two variables
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
19
SPSS example
• EFS data set: • economic position of
the household reference person (a093)
• type of tenure (a121)
• Their Pearson Chi-square value is 274, which means significant association at the 99.9% confidence level)
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
20
AnalysisDefine the range, i.e. the categories for each variable that enter the analysis
Some categories can be indicated as supplementary: they appear in the graphical representation, but do not influence the actual estimation of the scores
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
21
Model optionsChoose the number of dimensions to be retained
Choice of distance measure
Standardization (only for Euclidean distance)
Normalization
Which variable should be privileged?
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
22
Number of dimensions• The maximum number of dimensions for the
analysis is equal to • the number of rows minus one, or • the number of columns minus one (whichever the
smaller)• In our example, the maximum number of
dimensions would be five which reduces to four due to missing values in one row category.
• As shown later in this section one may then choose to graphically represent only a sub-set of the extracted dimensions (usually two or three) to make interpretation easier
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
23
Distance measure
• Chi-square distance (as discussed earlier) • Euclidean distance
• uses the square root of the sum of squared differences between pairs of rows and pairs of columns
• this also requires one to choose a method for centering the data (see the SPSS manual for details)
• For this example standard correspondence analysis (with the Chi-square distance) does not require a standardization method.
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
24
Normalization method• Defines how correspondence analysis is run: whether to give priority to
comparisons between the categories for x (row) or those for y (columns)
• This choice influence the way distances are summarized by the first dimensions
• Row principal normalization: the Euclidean distances in the final bivariate plot of x and y are as close as possible to the Chi-square distances between the rows, that is the categories of x
• The opposite is valid for the column principal method• Symmetrical normalization: the distances on the graph resemble as
much as possible distances for both x and y by spreading the total inertia symmetrically
• Principal normalization: inertia is first spread over the scores for x, then y
• Weighted normalization: defines a weighting value between minus one and plus one where minus one is the column principal zero is symmetrical and plus one is the row principal
• EFS example: the row principal method is more appropriate as it is more relevant to see how differences in socio-economic conditions impact on the tenure type than it is by looking at distances between tenure types.
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
25
Additional statistics
Although CA is a nonparametric method, it is possible to compute standard deviations and correlations under the assumption of multinomial distribution of the cell frequencies, (when data are obtained as a random sample from a normally distributed population)
Allows one to order the categories of x and y using scores obtained from CA
E.g. the tenure types and the socio-economic conditions might follow some ordering but cannot be defined with sufficient precision to consider these variables as ordinal. One can use the scores in the first dimension (or the first two) to order the categories and produce a permutated correspondence table.
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
26
Plots
Three graphs:
•Biplot (both x & y)
• x only (rows)
• y only (columns)
One usually chooses to represent only the first two or three of the extracted dimensions
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
27
Output
Summary
.669 .447 .850 .850 .031 .094 -.032 -.022
.209 .044 .083 .933 .055 .011 .081
.173 .030 .057 .990 .055 -.042
.072 .005 .010 1.000 .053
.526 231.402 .000a 1.000 1.000
Dimension1
2
3
4
Total
SingularValue Inertia Chi Square Sig. Accounted for Cumulative
Proportion of Inertia
StandardDeviation 2 3 4
Correlation
Confidence Singular Value
24 degrees of freedoma.
The SV is the square root of inertia (the eigenvalue)
The Chi-square stat suggests strong and significant association
The first dimensin explains 85%, the first two 93% of total inertia. However, note that total inertia does not correspond to total variability, but to the variability of the extracted dimensions
Usually a value of total inertia above 0.2 is regarded as acceptable
These precision measures are based on the multinomial distribution assumption
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
28
Row scores
Overview Row Pointsb
.080 .296 .025 .433 -.164 .024 .016 .001 .496 .407 .290 .002 .620 .089 1.000
.539 .527 .049 -.039 .026 .152 .334 .030 .027 .071 .984 .008 .005 .002 1.000
.077 -.239 -.409 -.352 -.143 .028 .010 .295 .318 .300 .156 .453 .336 .055 1.000
.018 -.154 -1.223 .509 .241 .033 .001 .622 .157 .202 .013 .814 .141 .032 1.000
.000 . . . . . .000 .000 .000 .000 . . . . .
.286 -.999 .089 .015 .019 .288 .639 .052 .002 .020 .992 .008 .000 .000 1.000
1.000 .526 1.000 1.000 1.000 1.000
Economic position ofHousehold ReferencePersonSelf-employed
Fulltime employee
Pt employee
Unemployed
Work related govt trainprog
a
Ret unoc over min ni age
Active Total
Mass 1 2 3 4
Score in Dimension
Inertia 1 2 3 4
Of Point to Inertia of Dimension
1 2 3 4 Total
Of Dimension to Inertia of Point
Contribution
Supplementary pointa.
Row Principal normalizationb.
The mass column shows the relative weight of each category on the sample
Scores are computed for each category but the supplemental one, provided there are no missing data
Scores are the coordinates for the map
Shows how total inertia has been distributed across rows (similar to communalities)
These categories have a higher relevance because they are more important categories in the original correspondence table. These two categories (especially retirement) strongly contribute to explaining the first dimension
The second dimension is characterized by unemployed and part-time employees
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
29
Column scores
• The same exercise is carried out on columns, however the row principal method does not normalize by column
Overview Column Pointsb
.098 -.699 -1.993 .051 1.106 .039 .048 .388 .000 .120 .548 .436 .000 .016 1.000
.066 -.781 -1.263 2.821 -1.273 .039 .040 .105 .524 .107 .462 .118 .405 .014 1.000
.050 .487 -2.023 -2.190 .891 .022 .012 .205 .240 .040 .245 .413 .333 .010 1.000
.032 .531 -1.098 -2.270 -4.585 .014 .009 .038 .164 .669 .284 .119 .349 .248 1.000
.457 .971 .371 .233 .133 .196 .431 .063 .025 .008 .982 .014 .004 .000 1.000
.002 1.179 1.120 -1.287 5.002 .002 .003 .003 .004 .057 .725 .064 .058 .153 1.000
.295 -1.244 .819 -.382 .018 .214 .457 .198 .043 .000 .954 .040 .006 .000 1.000
.009 -.957 -1.039 -2.996 -3.705 .007 .000 .000 .000 .000 .512 .059 .338 .090 1.000
1.000 .526 1.000 1.000 1.000 1.000
Tenure - typeLocal Authority rentedunfurnished
Housing association
Other rented unfurnished
Rented furnished
Owned with mortgage
Owned by rentalpurchase
Owned outright
Rent freea
Active Total
Mass 1 2 3 4
Score in Dimension
Inertia 1 2 3 4
Of Point to Inertia of Dimension
1 2 3 4 Total
Of Dimension to Inertia of Point
Contribution
Supplementary pointa.
Row Principal normalizationb. By column the first dimension is especially related to the “owned by mortgage” and “owned outright” categories
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
30
Bi-plot
Employed individuals are closer to owned accommodations
Retired individuals are also close to owned accommodations
Part-time employees and unemployed individuals are closer to rented accommodations and other forms of accommodations
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
31
Multiple Correspondence Analysis(MCA)
When all variables are multiple nominal, then optimal scaling applies MCA
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
32
Plot with 3 variables
The analysis now also includes the government office region
Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi
33
SAS correspondence analysis
• SAS procedure: proc CORRESP• simple correspondence analysis• multiple correspondence analysis (option
MCA)• same types of normalization as SPSS
• option PROFILE (ROW, COLUMN or BOTH)