+ All Categories

cr

Date post: 27-Dec-2015
Category:
Upload: sugan-pragasam
View: 3 times
Download: 1 times
Share this document with a friend
Description:
c a
33
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 1 Correspondence Analysis Chapter 14
Transcript
Page 1: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

1

Correspondence Analysis

Chapter 14

Page 2: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

2

Correspondence analysis

• Multivariate statistical technique which looks into the association of two or more categorical variables and display them jointly on a bivariate graph

• It can be used to apply multidimensional scaling to categorical variable.

Page 3: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

3

Correspondence analysisand data reduction techniques

• Factor and principal component analyses are only applied to metric (interval or ratio) quantitative variables

• Traditional multidimensional scaling deals with non-metric preference and perceptual data when those are on an ordinal scale

• Correspondence analysis allows data reduction (and graphical representation of dissimilarities) on non-metric nominal (categorical) variables

• The issue with categorical (non-ordinal) variables is how to measure distances between two objects: Correspondence analysis exploits contingency tables and association measures

Page 4: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

4

Example (Trust data)

• Do consumers with different jobs (q55) show preferences for some specific type of chicken (q6)?

Correspondence Table

17 50 10 17 94

11 74 14 28 127

6 19 4 8 37

0 7 6 14 27

1 18 7 3 29

1 1 1 0 3

0 4 2 3 9

11 31 1 1 44

47 204 45 74 370

If employed, what is youroccupation?I am not employed

Non manual employee

Manual employee

Executive

Self employedprofessional

Farmer / agriculturalworker

Employer / Entrepreneur

Other

Active Margin

'Value'chicken

'Standard'chicken

'Organic'chicken

'Luxury'chicken Active Margin

In a typical week, what type of fresh or frozen chicken do you buy foryour household's home consumption?

Page 5: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

5

Independence

• If the two characters are independent then the number in the cells of the table should simply depend on the row and column totals (lecture 9)

• Measure the distance between the expected frequency in each cell and the actual (observed) frequency

• Compute a statistic (the Chi-square statistic) which allows one to test whether the difference between the expected and actual value is statistically significant

Page 6: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

6

Reducing the number of dimensions

• The elements composing the Chi-square statistic are standardized metric values, one for each of the cells

• They become larger as the association between two specific characters increases

• These elements can be interpreted as a metric measure of distance

• The resulting matrix is similar to a covariance matrix

• A method similar to principal component analysis can be applied to this matrix to reduce the number of dimensions

Page 7: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

7

coordinates• The principal component scores provide

standardized values that can be used as coordinates

• One may apply the same data reduction technique• first by rows (synthesizing occupation as a function

of types of chicken)• then by column (synthesizing types of chicken as a

function of occupation)• The first two components for each application

generate a bivariate plot which shows both the occupation and the type of chicken in the same space

Page 8: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

8

Output fromCorrespondence Analysis

Executives prefer “Luxury” chicken

Unemployed are closer to “Value” chicken

Page 9: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

9

Applications

• It is possible to represent on the same graph consumer preferences for different brands and characteristics of a specific product (e.g. car brands together with colour, power, size, etc.)

• This allows one to explore brand choice in relation to characteristics opening the way to product modifications and innovations to meet consumer preferences

• Correspondence analysis is particularly useful when the variables have many categories

• The application to metric (continuous) data is not ruled out but data need to be categorized first

Page 10: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

10

Summary

• Correspondence analysis is a compositional technique which starts from a set of product attributes to portrait the overall preference for a brand

• This technique is very similar to PCA and can be employed for data reduction purposes or to plot perceptual maps

• Because of the way it is constructed correspondence analysis can be applied to either the row or the columns of the data matrix

• For example if rows represent brands and columns are different attributes:1. By applying the method by rows one obtains the coordinates

for the brands2. The application by columns allows one to represent the

attributes in the same graph

Page 11: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

11

Steps to run correspondence analysis

• Represent the data in a contingency table • Translate the frequencies of the contingency

table into a matrix of metric (continuous) distances through a set of Chi-square association measures on the row and column profiles

• Extract the dimensions (in a similar fashion to PCA)

• Evaluate the explanatory power of the selected number of dimensions

• Plot row and column objects in the same co-ordinate space

Page 12: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

12

The frequency table

y1 y2 … yj … yl

x1 f11 f12 f1j f1l f10

x2 f21 f22 f2j f2l f20

… …

xi fi1 fij fil fi0

… …

xk fk1 fj2 fkj fkl fkl

f01 f02 f0j f0l 1

Categorical variable Y (l categories)C

ateg

ori

cal

vari

able

X (

k ca

teg

ori

es)

Row profile

Row masses

Column profile Column masses

Page 13: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

13

Interpretation of coordinates

• The categories of the x variable can be seen as different coordinates for the points identified by the y variable

• The categories of the y variable can be seen as different coordinates for the points identified by the x variable

• Thus it is possible to represent the x and y categories as points in space, imposing (as in multidimensional scaling) that they respect some distance measure

Page 14: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

14

Representations• Take the row profile (the categories of x) and

plot the categories in a bi-dimensional graph, using the categories of y to define the distances

• This allows one to compare nominal categories within the same variable: those categories of x which show similar levels of association with a given category of y can be considered as closer than those with very different levels of association with the same category of y

• The same procedure is carried out transposing the table which means that the categories of y can be represented using the categories of x to define the distances

Page 15: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

15

Computing the distances

• When the coordinates are defined simultaneously for the categories of x and y the Chi-square value can be computed for each cell as follows

• Obtain the expected table frequencies

• Where nij and fij are the absolute and relative frequencies, respectively, ni0 and n0j (or fi0 and f0j) are the marginal totals for row i and column j (the row masses and column masses) respectively and n00 is the sample size (hence the total relative frequency f00 equals one)

• The Chi-square value can now be computed for each cell (i,j)

0 0 0 0*0 0

00 00

i j i jij i j

n n f ff f f

n f

* 22

*

( )ij ijij

ij

f f

f

These are the quadratic distances between category i and category j of the x variable

Page 16: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

16

The distance matrix• The matrix 2 measures all of the associations

between the categories of the first variable and those of the second one.

• A generalization of the multivariate case (MCA is possible by stacking the matrix• Stacking: compose a large matrix by blocks, where each block

is the contingency matrix for two variables (all possible associations are taken into consideration)

• The stacked matrix is referred to as the Burt Table • To obtain similarity values from the 2 matrix:

• compute the square root of the elemental Chi-square values• use the the appropriate sign (the sign of the difference fij –fij

*)• large positive values correspond to strongly associated

categories• large negative values identify those categories where the

association is strong but negative indicating dissimilarity

Page 17: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

17

Estimation• The resulting matrix D contains metric and continuous

similarity data• It is possible to apply PCA to translate such a matrix

into coordinates for each of the categories first those of x then those of y

• Before PCA can be applied some normalization is required so that the input matrix becomes similar to a correlation matrix

• The use of the square root of the row masses (columns) for normalizing the values in D represents the key difference from PCA

• The rest of the estimation process follows the results of the PCA

• As for PCA eigenvalues are computed, one for each dimension, which can be used to evaluate the proportion of dissimilarity maintained by that dimension

Page 18: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

18

Inertia• Inertia is a measure of association between two

categorical variables based on the Chi-squared statistic. • In correspondence analysis the proportion of inertia

explained by each of the dimensions can be regarded as a measure of goodness-of-fit because the effectiveness of correspondence analysis depends on the degree of association between x and y

• Total inertia – is a measure of the overall association between x and y– is equal to the sum of the eigenvalues – corresponds to the Chi-square value divided by the number of

observations– A total inertia above 0.20 is expected for adequate

representations • Inertia values can be computed for each of the

dimensions and represent the contribution of that dimension to the association (Chi-square) between the two variables

Page 19: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

19

SPSS example

• EFS data set: • economic position of

the household reference person (a093)

• type of tenure (a121)

• Their Pearson Chi-square value is 274, which means significant association at the 99.9% confidence level)

Page 20: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

20

AnalysisDefine the range, i.e. the categories for each variable that enter the analysis

Some categories can be indicated as supplementary: they appear in the graphical representation, but do not influence the actual estimation of the scores

Page 21: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

21

Model optionsChoose the number of dimensions to be retained

Choice of distance measure

Standardization (only for Euclidean distance)

Normalization

Which variable should be privileged?

Page 22: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

22

Number of dimensions• The maximum number of dimensions for the

analysis is equal to • the number of rows minus one, or • the number of columns minus one (whichever the

smaller)• In our example, the maximum number of

dimensions would be five which reduces to four due to missing values in one row category.

• As shown later in this section one may then choose to graphically represent only a sub-set of the extracted dimensions (usually two or three) to make interpretation easier

Page 23: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

23

Distance measure

• Chi-square distance (as discussed earlier) • Euclidean distance

• uses the square root of the sum of squared differences between pairs of rows and pairs of columns

• this also requires one to choose a method for centering the data (see the SPSS manual for details)

• For this example standard correspondence analysis (with the Chi-square distance) does not require a standardization method.

Page 24: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

24

Normalization method• Defines how correspondence analysis is run: whether to give priority to

comparisons between the categories for x (row) or those for y (columns)

• This choice influence the way distances are summarized by the first dimensions

• Row principal normalization: the Euclidean distances in the final bivariate plot of x and y are as close as possible to the Chi-square distances between the rows, that is the categories of x

• The opposite is valid for the column principal method• Symmetrical normalization: the distances on the graph resemble as

much as possible distances for both x and y by spreading the total inertia symmetrically

• Principal normalization: inertia is first spread over the scores for x, then y

• Weighted normalization: defines a weighting value between minus one and plus one where minus one is the column principal zero is symmetrical and plus one is the row principal

• EFS example: the row principal method is more appropriate as it is more relevant to see how differences in socio-economic conditions impact on the tenure type than it is by looking at distances between tenure types.

Page 25: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

25

Additional statistics

Although CA is a nonparametric method, it is possible to compute standard deviations and correlations under the assumption of multinomial distribution of the cell frequencies, (when data are obtained as a random sample from a normally distributed population)

Allows one to order the categories of x and y using scores obtained from CA

E.g. the tenure types and the socio-economic conditions might follow some ordering but cannot be defined with sufficient precision to consider these variables as ordinal. One can use the scores in the first dimension (or the first two) to order the categories and produce a permutated correspondence table.

Page 26: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

26

Plots

Three graphs:

•Biplot (both x & y)

• x only (rows)

• y only (columns)

One usually chooses to represent only the first two or three of the extracted dimensions

Page 27: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

27

Output

Summary

.669 .447 .850 .850 .031 .094 -.032 -.022

.209 .044 .083 .933 .055 .011 .081

.173 .030 .057 .990 .055 -.042

.072 .005 .010 1.000 .053

.526 231.402 .000a 1.000 1.000

Dimension1

2

3

4

Total

SingularValue Inertia Chi Square Sig. Accounted for Cumulative

Proportion of Inertia

StandardDeviation 2 3 4

Correlation

Confidence Singular Value

24 degrees of freedoma.

The SV is the square root of inertia (the eigenvalue)

The Chi-square stat suggests strong and significant association

The first dimensin explains 85%, the first two 93% of total inertia. However, note that total inertia does not correspond to total variability, but to the variability of the extracted dimensions

Usually a value of total inertia above 0.2 is regarded as acceptable

These precision measures are based on the multinomial distribution assumption

Page 28: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

28

Row scores

Overview Row Pointsb

.080 .296 .025 .433 -.164 .024 .016 .001 .496 .407 .290 .002 .620 .089 1.000

.539 .527 .049 -.039 .026 .152 .334 .030 .027 .071 .984 .008 .005 .002 1.000

.077 -.239 -.409 -.352 -.143 .028 .010 .295 .318 .300 .156 .453 .336 .055 1.000

.018 -.154 -1.223 .509 .241 .033 .001 .622 .157 .202 .013 .814 .141 .032 1.000

.000 . . . . . .000 .000 .000 .000 . . . . .

.286 -.999 .089 .015 .019 .288 .639 .052 .002 .020 .992 .008 .000 .000 1.000

1.000 .526 1.000 1.000 1.000 1.000

Economic position ofHousehold ReferencePersonSelf-employed

Fulltime employee

Pt employee

Unemployed

Work related govt trainprog

a

Ret unoc over min ni age

Active Total

Mass 1 2 3 4

Score in Dimension

Inertia 1 2 3 4

Of Point to Inertia of Dimension

1 2 3 4 Total

Of Dimension to Inertia of Point

Contribution

Supplementary pointa.

Row Principal normalizationb.

The mass column shows the relative weight of each category on the sample

Scores are computed for each category but the supplemental one, provided there are no missing data

Scores are the coordinates for the map

Shows how total inertia has been distributed across rows (similar to communalities)

These categories have a higher relevance because they are more important categories in the original correspondence table. These two categories (especially retirement) strongly contribute to explaining the first dimension

The second dimension is characterized by unemployed and part-time employees

Page 29: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

29

Column scores

• The same exercise is carried out on columns, however the row principal method does not normalize by column

Overview Column Pointsb

.098 -.699 -1.993 .051 1.106 .039 .048 .388 .000 .120 .548 .436 .000 .016 1.000

.066 -.781 -1.263 2.821 -1.273 .039 .040 .105 .524 .107 .462 .118 .405 .014 1.000

.050 .487 -2.023 -2.190 .891 .022 .012 .205 .240 .040 .245 .413 .333 .010 1.000

.032 .531 -1.098 -2.270 -4.585 .014 .009 .038 .164 .669 .284 .119 .349 .248 1.000

.457 .971 .371 .233 .133 .196 .431 .063 .025 .008 .982 .014 .004 .000 1.000

.002 1.179 1.120 -1.287 5.002 .002 .003 .003 .004 .057 .725 .064 .058 .153 1.000

.295 -1.244 .819 -.382 .018 .214 .457 .198 .043 .000 .954 .040 .006 .000 1.000

.009 -.957 -1.039 -2.996 -3.705 .007 .000 .000 .000 .000 .512 .059 .338 .090 1.000

1.000 .526 1.000 1.000 1.000 1.000

Tenure - typeLocal Authority rentedunfurnished

Housing association

Other rented unfurnished

Rented furnished

Owned with mortgage

Owned by rentalpurchase

Owned outright

Rent freea

Active Total

Mass 1 2 3 4

Score in Dimension

Inertia 1 2 3 4

Of Point to Inertia of Dimension

1 2 3 4 Total

Of Dimension to Inertia of Point

Contribution

Supplementary pointa.

Row Principal normalizationb. By column the first dimension is especially related to the “owned by mortgage” and “owned outright” categories

Page 30: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

30

Bi-plot

Employed individuals are closer to owned accommodations

Retired individuals are also close to owned accommodations

Part-time employees and unemployed individuals are closer to rented accommodations and other forms of accommodations

Page 31: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

31

Multiple Correspondence Analysis(MCA)

When all variables are multiple nominal, then optimal scaling applies MCA

Page 32: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

32

Plot with 3 variables

The analysis now also includes the government office region

Page 33: cr

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

33

SAS correspondence analysis

• SAS procedure: proc CORRESP• simple correspondence analysis• multiple correspondence analysis (option

MCA)• same types of normalization as SPSS

• option PROFILE (ROW, COLUMN or BOTH)


Recommended