Date post: | 15-Jan-2016 |
Category: |
Documents |
Upload: | ashley-madlock |
View: | 228 times |
Download: | 1 times |
Contingency Table and Correspondence Analysis
Nishith KumarDepartment of Statistics
BSMRSTU
Mohammed NasserDepartment of Statistics
RU
and
Overview
Contingency table.Some real world problem for contingency tablePearson chi-squared testProbabilistic interpretation of matricesContingency tables: Homogeneity and HeterogeneityHistorical background of correspondence analysis Correspondence analysis (CA)Correspondence analysis and eigenvalues.Singular value decomposition.Calculation procedure of CAInterpretation of correspondence analysisR code and examplesConclusion
2
Contingency Table
In statistics, a contingency table (also referred to as cross tabulation or cross tab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. The term contingency table was first used by Karl Pearson in 1904.
Sometimes contingency table is called incidence matrix.
Contingency tables are often used in social sciences (such as sociology, education, psychology). These tables can be considered as frequency tables. Rows and columns are some categorical variables. If variables are continuous then we can use bins for these continuous variables and convert them into categorical ones.
3
Real Problem
Very Good Good Regular Bad Very Bad
16-24 243 789 167 18 6
25-34 220 809 164 35 6
35-44 147 658 181 41 8
45-54 90 469 236 50 16
55-64 53 414 306 106 30
65-74 44 267 284 98 20
75+ 20 136 157 66 17
Cross-tabulation of age groups by perceived health status
1. Is there any relation between different age group and perceived health status?2. How can you visualize this type of relationship?
3. How can you find the similarity of row category?4. How can you Interprete – distances between categories of row and column variables
4
Real Problem
Smoking behavior Totalnone light medium heavy
Senior Managers 4 2 3 2 11Junior managers 4 3 7 4 18Senior employees 25 10 12 4 51Junior employees 18 24 33 13 88Secretaries 10 6 7 2 25TOTAL 61 45 62 25 193
Suppose we have the following contingency table
2. How can we converts frequency table data into graphical displays.
1. How can we analyze contingency table type data?
3. How can we find the similarity of column category?
4. How can we find the similarity of row category?
5. How can we find the relationship of row and column category simultaneously?5
Survey of effects of four different drug types. Patients gave score for each drug type (excellent, very good, good, fair, poor). Number of all elements is 121.
excellent very good good fair poor Drug A 6 8 10 1 5 Drug B 12 8 3 3 5 Drug C 0 3 12 6 10 Drug D 1 1 8 12 7
1. Is there is association between columns and rows? 2. If there is some association then how can we find some structure
in this data table?3. Can we order columns and rows by their closeness? 4. Can we find associations between columns and rows?
Real Problem
6
Pearson chi-squared test
Suppose that we have a data matrix X that has I rows and J columns. Elements of the matrix are xij. Let us use the following notations:
1 1
( ), ( )
I J
iji j
n x , X / n,
diag diag
T
r c
1 1 T Tr c
P r P1, c P 1
D r D c
R D P, C D P Q P rc
r and c are row and column sums, R and C are row and column profiles, respectively. Q is difference between P and product of row and column sums.
7
Pearson chi-squared test (Cont.)
More notations and relations:
( ) ( ) the total inertia of rows
( ) ( ) = the total inertia of columns
in I tr
in J tr
T 1 T Tr c
T 1 T Tc r
D (R 1c )D (R 1c )
D (C 1r )D (C 1r )
2
( ) ( )
( )
( )
/
Tr c
in J tr
tr
tr
n
T 1 T Tc r
1 T T 1 1 T T Tc c r c
1 1
D (C 1r )D (C 1r )
D (D P 1r )D (D P 1r )
Q D QD
relation ( ) ( ) is true.in I in J
2
( ) ( )
( )
)
/
in I tr
tr
tr
n
T 1 T Tr c
1 T 1 1 T Tr r c r
1 T 1c r
D (R 1c )D (R 1c )
D (D P 1c )D (D P 1c )
(QD Q D
Row and column inertias are multiple of chi-squared with degrees of freedom (I-1)(J-1). Multiplicity is 1/n. If P would be probability then if there would be no association between rows and columns then Q would be 0. It is equivalent to saying that rows and columns are independent
8
Pearson chi-squared test (Cont.)
For Smoke Data:
Chi squared = 16.4416, df = 12, p-value = 0.1718
Principal inertias: 1 2 3 Value 0.074759 0.010017 0.000414Percentage 87.76% 11.76% 0.49%
Rows: SM JM SE JE SCInertia 0.002673 0.011881 0.038314 0.026269 0.006053
Columns: none light medium heavyInertia 0.049186 0.007059 0.012610 0.016335
We have seen, Chi square value = total Inertia * Grand total, df= (no. of row - 1 ) * (no. of Column -1)
R Code:
library(ca)library(MASS)ca(smoke)chisq.test(smoke)
9
Pearson chi-squared test (Cont.)
Drag Data Principal inertias : 1 2 3 inertias 0.304667 0.077342 0.007015Percentage 78.32% 19.88% 1.8%
Rows: Drug A Drug B Drug C Drug DInertia 0.055280 0.143372 0.071340 0.119030
Columns: excellent verygood good fair poorInertia 0.152430 0.060843 0.044719 0.111385 0.019646
Chi square value = total Inertia * Grand total, df= (no. of row - 1 ) * (no. of Column -1) Chi squared = 47.0718, df = 12, p-value = 4.53e-06
I.e. there is strong evidence that there is row-column association.
R Code:
library(ca)library(MASS)ca(drug)chisq.test(drug))
10
Pearson chi-squared test (Cont.)
Health Data: Principal inertias: 1 2 3 4 Value 0.136603 0.00209 0.001292 0.000474Percentage 97.25% 1.49% 0.92% 0.34%
Rows: 16-24 25-34 35-44 45-54 55-64 65-74 75+Inertia 0.027020 0.021316 0.006900 0.001667 0.022711 0.033288 0.027557 Columns: VG GOOD REG BAD VBInertia 0.024279 0.022368 0.045823 0.037955 0.010034
Chi square value = total Inertia * Grand total, df= (no. of row - 1 ) * (no. of Column -1) Chi squared = 894.8607, df = 24, p-value < 2.2e-16
I.e. there is strong evidence that there is row-column association.
R Code:
library(ca)library(MASS)ca(health)chisq.test(health))
11
Probabilistic Interpretation of Matrices
, If the matrix P would be a probability matrix i.e. each element pij are probability of happening rows and columns simultaneously then we can have the following interpretation of the involved matrices:
X / nP
1) Elements of r are the marginal probabilities of rows. Elements of c are the marginal probabilities of columns.
2) Elements of Q are differences between joint probability and product of individual probabilities. In some sense this matrix represents the degree of dependencies of rows and columns
3) Elements of R are the conditional probabilities of columns when row is known
4) Elements of C are the conditional probabilities of rows when column is known
5) Total inertia is the total indicator of dependencies of rows and columns.
12
Marginal probability of Drag Data
excellent very good good fair poor Total
Drug A 6 8 10 1 5 30
Drug B 12 8 3 3 5 31
Drug C 0 3 12 6 10 31
Drug D 1 1 8 12 7 29
Total 19 20 33 22 27 121
Excellent Very Good Good Fair Poor
Marginal Probability of
Drug typeDrug A 0.0495868 0.066116 0.08264 0.00826 0.0413 0.248Drug B 0.0991736 0.066116 0.02479 0.02479 0.0413 0.256Drug C 0 0.024793 0.09917 0.04959 0.0826 0.256Drug D 0.0082645 0.008264 0.06612 0.09917 0.0579 0.24Marginal
Probability of Patient Score
0.1570248 0.165289 0.27273 0.18182 0.2231 1
X
XP
n
1) Elements of r are the marginal probabilities of columns. Elements of c are the marginal probabilities of rows
13
Degree of dependencies of rows and columns
excellent very good good fair poorDrug A 0.01065501 0.02513490 0.0150262960 -0.036814425 -0.014001776Drug B 0.05894406 0.02376887 -0.0450788881 -0.021788129 -0.015845912Drug C -0.04022949 -0.01755345 0.0293012772 0.003005259 0.025476402Drug D -0.02936958 -0.03135032 0.0007513148 0.055597295 0.004371286
2. Elements of Q are differences between joint probability and product of individual probabilities. In some sense this matrix represents the degree of dependencies of rows and columns
Q
14
See slide no. 19 for R code
Conditional Probabilities and Inertias
3) Elements of R are the conditional probabilities of columns when row is known
excellent very good good fair poorDrug A 0.20000000 0.26666667 0.33333333 0.03333333 0.1666667Drug B 0.38709677 0.25806452 0.09677419 0.09677419 0.1612903Drug C 0.00000000 0.09677419 0.38709677 0.19354839 0.3225806Drug D 0.03448276 0.03448276 0.27586207 0.41379310 0.2413793
R
4) Elements of C are the conditional probabilities of rows when column is known
Drug A Drug B Drug C Drug DExcellent 0.31578947 0.63157895 0.0000000 0.05263158Very good 0.40000000 0.40000000 0.1500000 0.05000000good 0.30303030 0.09090909 0.3636364 0.24242424Fair 0.04545455 0.13636364 0.2727273 0.54545455poor 0.18518519 0.18518519 0.3703704 0.25925926
C
5) Total inertia is the total indicator of dependencies of rows and columns. Small inertia indicate there is no row column association. 15
Similarly we can find the following measurement for Smoke data and Health Status data. i)Marginal probabilities , ii)Degree of dependencies of row and columniii)Conditional probabilities iv)Inertias
16
Contingency Tables: Homogeneity and Heterogeneity
t=in(I)=in(J)=X2/n is the coefficient of association called as Pearson’s mean-square contingency.
It is the total inertia. The total inertia is a measure of homogeneity/heterogeneity of the table.
If t is large it is a measure of heterogeneity and if t is small it is a measure of homogeneity of the table.
Homogeneity means that there is no row-column association.
I
i
J
jjjiiji ccrprt
1 1
2 ]/)/[(
t can also be calculated using:
17
Contingency Tables: Homogeneity and Heterogeneity( Cont.)
I
i
J
jjjiiji ccrprt
1 1
2 ]/)/[(
We can interpret the following formula by the following way
Second summation is sum of a weighted squared distance between the vector of relative frequency of the ith row (i.e. jth row profile – pij/ri) and the average row profile – c. Inverse of the elements of c are the weights.
• It is known as chi-squared distance between ith row profile and the average row profile.
• The total inertia is further weighted sums of I chi-squared distances.
• The weights are the elements of r.
• If all elements of row profiles are close to the average row profile then table is homogenous. Otherwise table is heterogeneous.
We can do similar calculations for the column profiles. It is done easily by changing roles of r and c. 18
Calculations of Inertia to Find Out the Homogeneity or Heterogeneity
I
i
J
jjjiiji ccrprt
1 1
2 ]/)/[(We can calculate t by R from the following code,
library(ca)library(MASS)######Read Data############## Probability Matrix#######pdrag<-drug/121c<-colSums(pdrag)r<-rowSums(pdrag)Dr<-diag(r)Dc<-diag(c)q<-pdrag-r%*%t(c)R<-ginv(Dr)%*%as.matrix(pdrag)C<-ginv(Dc)%*%t(as.matrix(pdrag))
sp<-0tsp<-0t<-0for(i in 1:4){ for (j in 1:5){ sp[j]<-((((pdrag[i,j]/r[i])-c[j])*((pdrag[i,j]/r[i])-c[j]))/c[j]) }tsp[i]<-colSums(as.matrix(sp))t[i]<-r[i]*tsp[i]}ti<-colSums(as.matrix(t))
Total inertia for Drug data is t = 0.3890234 19
Historical Background of Correspondence Analysis
The CA solution was shown by (Greenacre 1984)
Correspondence analysis (CA) was first proposed by Hirschfeld 1935
Later CA was developed by Jean-Paul Benzécri 1973Hirschfeld 1935
It is incorporated in R in 200920
Correspondence Analysis
Correspondence analysis is a statistical technique used to analyze categorical data (Benzecri, 1992) and provides a graphical representation of cross tabulations or contingency tables.
Correspondence analysis (CA) can be viewed as a generalized principal component analysis tailored for the analysis of qualitative data.
Although CA was originally created to analyze cross tabulation but CA is so multipurpose that it is used with a lot of other numerical data table types. It is formally applicable to any data matrix with nonnegative entries.
21
Objectives of CA
The main objectives of CA are to transform a dataset into two factor scores (rows and columns) that give the best representation of the similarity structure of the rows and columns of the table.
Correspondence analysis is used to reduce the dimension of a data matrix as in principal component analysis. So using CA we can visualize the data two or three dimensionally.
22
Correspondence analysis and eigenvalues
For a given contingency table we calculate row and column profiles. Now we want to find a vector (g) when multiplied by row profiles from the left will have highest possible variance. It means that we want to maximize
max ) g1c(RgDg)1c(Rg Tr
TT
To make this problem solvable we add an additional constraint (similar to PCA). We want weighted norm of the vector to be unit and weighted mean to be 0. Weights are column sums.
0 ,1 gcgDg Tc
T
So we have to maximizeT -1
rP D D maxr T T 1 T T 1
r r r(Rg) D Rg g D Pg g P D Pg
23
Correspondence analysis and eigenvalues (cont.)
maximize subject to condition T T T 1r r(Rg) D Rg g P D Pg
To maximize the function we can use the Lagrange multipliers technique.Thus the Lagrange function
(1 )L T T 1 Tr cg P D Pg g D g
1Tcg D g
0L
g
Now differentiating L by g and put that equal to zero
( ) ( )
Tc
Tc
C D g
C D g
T 1r c
T 1r c
T 1r c
P D Pg D g
P D D g
P D D g
Thus the problem reduces to the eigenvalue problem. As a result we will have principal coordinates for columns. Similarly we can find principal coordinates for row.
This problem easily and compactly solved if we use singular value decomposition. 24
Singular Value Decomposition
X
m×n
= U
m×n
ΛVT
n×n n×n
Real,
where (n≤ m)column orthonormal containing the eigenvectors of XXT.
Diagonal matrix,
containing the
singular values
of matrix X.
Row orthonormalcontaining the eigenvectors of XTX.
XV=U Λ, The columns U Λ indicate the PCs
Left singular vector shows the structure of observations.Right singular vector shows the structure of variables.
25
Correspondence Analysis Calculation Procedure
XP
n
1/ 2 1/ 2( )T Tr cD P rc D U V
The principal coordinates of rows: 1/ 2rF D U
The principal coordinates of columns: 1/2cG D V
Standard row and column coordinates are 1/ 2 1/ 2 and r cD U D V respectively
U is a (m×n) column orthonormal matrix (UTU=I), containing the eigenvectors of the symmetric matrix PPT and VT is a (nxn) row orthonormal matrix (VTV=I), containing the eigenvectors of the symmetric matrix PTP.
X Grand total
rRow total
c
P
Column Total r cD r D c
DiagonalMatrix
To obtain coordinates using SVD, the computational algorithm of the row and column profiles with respect to principle axes are given below
Calculate the matrix of standardized residuals
[Using SVD]
First few (one or two) elements of F and G are usually taken and plotted simultaneously. 26
Interpretation of Correspondence analysis
Elements of Λ are called the principal inertias. They are also related to the canonical correlations given by the package R.
Larger value of Λ means that the corresponding element has higher importance. It is usual to use one or two elements of F and G. Then these elements are used for various plots.
For pictorial representation either columns or rows are plotted in and ordered form or biplots is used to find possible association between rows and columns as well as their order.
Correspondence Analysis can be considered as a dimension reduction technique and can be used together with others (for example PCA).
Comparative application of different dimension reduction technique may give insight to the problem and structure in the data. 27
Algorithm of Correspondence Analysis
1. Take a contingency table (X) and find sum of all elements (total sum= n)
2. Divide all elements by the total sum (call it P)
3. Find row and column sums (r and c)
4. Calculate the matrix of standardized residuals,
5. Find generalized SVD of the S.
6. Find principal row and column coordinates. Take few elements and plot them
7. Analyze the results (order and closeness of columns and rows, possible associations between columns and rows).
1/ 2 1/ 2( )Tr cS D P rc D
28
Correspondence Analysis in Drug data
drug<- read.table(text = "
qlt excellent verygood good fair poor DrugA 6 8 10 1 5 DrugB 12 8 3 3 5 Drugc 0 3 12 6 10 DrugD 1 1 8 12 7", row.names = 1, header = TRUE)
plot(ca(drug), mass = c(TRUE, TRUE))plot(ca(drug), mass = c(TRUE, TRUE), arrows = c(FALSE, TRUE))Summary(ca(drug))
R code:
29
Biplot of Drug data using Correspondence Analysis
Principal inertias (eigenvalues):
dim value % cum% 1 0.304667 78.3 78.3 2 0.077342 19.9 98.2 3 0.007015 1.8 100.0 -------- ----- Total: 0.389023 100.0
30
Correspondence analysis in Smoke Data
Principal inertias (eigen values):
dim value % cum% scree plot 1 0.074759 87.8 87.8 ************************* 2 0.010017 11.8 99.5 *** 3 0.000414 0.5 100.0
library(ca)data("smoke")plot(ca(smoke), mass = c(TRUE, TRUE))Summary(ca(smoke))
31
Biplot using Correspondence analysis
library(ca)data("smoke")plot(ca(smoke), mass = c(TRUE, TRUE), arrows = c(FALSE, TRUE))
32
Three Dimensional plot using Correspondence analysis
library(ca)data("smoke")plot3d.ca(ca(smoke, nd=3))
33
Correspondence analysis in Health Data
library(ca)health<- read.table(text = "age VG GOOD REG BAD VB 16-24 243 789 167 18 625-34 220 809 164 35 635-44 147 658 181 41 845-54 90 469 236 50 1655-64 53 414 306 106 3065-74 44 267 284 98 2075+ 20 136 157 66 17", row.names = 1, header = TRUE)
plot(ca(health), mass = c(TRUE, TRUE))
34
library(ca)health<- read.table(text = "age VG GOOD REG BAD VB 16-24 243 789 167 18 625-34 220 809 164 35 635-44 147 658 181 41 845-54 90 469 236 50 1655-64 53 414 306 106 3065-74 44 267 284 98 2075+ 20 136 157 66 17", row.names = 1, header = TRUE)
plot(ca(health), mass = c(TRUE, TRUE), arrows = c(FALSE, TRUE))
Biplot of Health Data Correspondence analysis
35
Conclusion
In conclusion we can say that correspondence analysis can
1.Converts frequency table data into graphical displays
2.Show the similarity of row category
3.Show the similarity of column category
4.Show the relationship of row and column category simultaneously
Although CA was originally created to analyze cross tabulation but CA is so multipurpose that it is used with a lot of other numerical data table types. It is formally applicable to any data matrix with nonnegative entries.
36
Future Studies
37
1. Study Multiple correspondence analysis.
2. High dimensional data analysis using Correspondence Analysis.
3. Assess the effect of outliers.
4. The 1st CA axis is reliable, but 2nd and later axes are quadratic distortions of the first – produces the “arch effect”. So my future study is how to solve this problem.
5. Application of CA in Microarray data to find out the gene pattern and similarity of gene structure.
6. Missing value and outlier is a general problem in microarray data. So solving missing value and outlier problem my target is to propose a robust correspondence analysis method that can handle both outlier and missing value problem
References1. Benzécri, J.-P. (1973). L'Analyse des Données. Volume II. L'Analyse des
Correspondances. Paris, France: Dunod.
2. Greenacre, Michael (1983). Theory and Applications of Correspondence Analysis. London: Academic Press. ISBN 0-12-299050-1
3. Greenacre, Michael (2007). Correspondence Analysis in Practice, Second Edition. London: Chapman & Hall/CRC.
4. Greenacre, M. and Nenadic,O. (2007), “Correspondence Analysis in R, with Two- and Three-dimensional Graphics: The ca Package”, Journal of Statistical Software,Vol-20 ,Issue-30.
5. Hirschfeld, H.O. (1935) "A connection between correlation and contingency", Proc. Cambridge Philosophical Society, 31, 520–524.
38
Thank You so Much for Your Patience
39