Date post: | 04-Dec-2023 |
Category: |
Documents |
Upload: | independent |
View: | 0 times |
Download: | 0 times |
Multivariate dependence analysis
José Antônio Cordeiro
Abstract. Introduction. In this work we propose some features for analysing cross-
classification multidimensional tables, and test their effectiveness on articially constructed and
on real data. Methods. Measures of similarity between two levels of one variable by their
profiles along all combinations of levels of the remaining variables, and coefficient that
measures the strenght of dependence of the levels of one variable, and of the whole variable as
well, are proposed based on Hellinger distance. Tests base on these coefficients are propose for
reduction of complexity in the array through its collapsability along an independent variable, and
test for total dependence in the array as well. Results. The tests when applied on artificial data
behaved as it was expected to, for all the analysed situations were constructed and we knew what
the association structures were, and when analysing the real data the whole methodology work
well permitting new insights on already analysed data.
Key words. Multivariate dependence, cross-classification array, reducibility of
multidimensional tables. Hellinger distance.
Introduction. Analysis of cross-classification tables and arrays is a very studied theme
among statisticians, and the developed techniques are much applied in areas of knowledge, like
for instance medicine, biology, social sciences. The most published and used methods are log-
linear modeling (Bishop, Fienberg and Holland, 1977, Fienberg, 1980 , Agresti, 2002), ordered
and unordered association models (Goodman, 1954, 1981, 1996), weighted least-square
modeling (Grissle, Starmer and Koch, 1969), simple (CA) (Lebart et al., 1984, Greenacre, 1984)
and multiple correspondence analysis (MCA) (Greenacre & Blasius, 2006).
MCA is useful to analyse multidimensional contingency tables and utilyses the so called
Burt matrix C which is a cross-classification table including all crossings of all levels of all
variables one each other. This work treats I1I2IK cells arrays generated by K cross-
classified variables gathered together in a simple random sample of size n.
Let T[ ] be the observed array, with 1 ik Ik and k1,2,,K, and let P[
] be the array of associated distribution of joint probabilities such that the K crossed
variables are distributed complete multinomial(n, P), with n=total of T cells, and total
of P cells1.
To analyse the cross-classification array T I propose the Hellinger-Matusita's (Hellinger,
1909; Matusita, 1967) distance to measure the disagreement between its "observed" and
"expected" frequencies, with which we can derive a series of measures and coefficients to help
us to understand its association structure, and to make inference on it as well.
Dependence and codependence. In general the concept of independence is put on the
whole array, but let us take a variable k and its category . If this level were independent of the
other variables then, for all crossings (i1,,ik−1,ik+1,,iK) of the remaining variables it is true
, where is the marginal probability of category ik of
variable k, and the marginal probability of the cell (i1,,ik−1,ik+1,,iK). If variable
k had all its levels independent like above, then we can say itself is independent of the other
variables.
Goodman (1981) introduced the concept of "reducible contingency table" based on the
joining each other of two or more columns (rows) with equal probability profiles, without also
attending to the fact of reduction by deleting independent columns, i.e. those with the same
probability profiles as the marginal. In the same paper he says "Any reducible table can be
transformed into an irreducible table without affecting the total nonindependence in the table"
(sic). In this paper I am going to apply the reducibility principle to arrays but by deletion of a
variable, if it is nominal.
To study the condition for reducibility, and to measure degrees of association between
group of variables, it is usefull to construct the notion of dependence of a category of a variable
and of codependence between two of its categories.
The codependence of the categories ik and of variable k is
This formula shows codependence of two categories is a balance of the differences of
condicional probabilistic profiles of the remaining variables, given each of them, and the
marginal profile of these same variables.
The dependence of category ik of a variable k, is defined by . It is easy to
show this dependence equals the marginal probability of category ik times 1 minus the affinity
(Matusita, 1967) between the condicional probabilistic distribution of the remaining variables,
given this level of variable k, and their marginal distribution. Then the higher the affinity
between marginal and conditional profiles the lesser is the dependence of the category, and
conversely, the higher is its frequency in the population the higher is its dependence. So if
category ik of variable k were independent of the other variables its dependence would be zero,
as well its codependences with all other categories of variable k.
To eliminate the influence of the populational probabilities of the categories on
codependence categories ik and i*k of variable k, as the correlation coefficient eliminates the
influence of measure unities of quantitative variables, case and one defines
as the codependence coefficient of categories ik and i*k of variable k.
This coefficient measures an association of the two categories, and by properties of summation
and Cauchy-Schwartz inequality it can be said that 1< +1. Hewitt & Stromberg (1975,
pp. 190-191) give two necessary and sufficient conditions for equality to 1 via Hölder's
inequality which combined, from | |+1 and the positivity of the envolved probabilities, and
the summation over all configurations and some calculations lead one to conclude 1 iff
1, and this property of the affinity is known
to be equivalent (Matusita, 1967) to equality of the probability profiles themselves.
The dependence of a category ik of a variable k can be extended to the dependence of k∂²
variable k by summing all the dependencies of its categories. One can easely see that k∂²0 iff
over all cells in the remaining array complementary to variable k,
that is: all of its categories are independent of the remaining variables, so variable k is itself
independent of the remaining variables. If a variable is independent of the others we can
generalize Goodman's principle of reducibility and it can be deleted from the analysis to lessen
problem complexity.
The total dependence in the array is ,
and it is true ∂²0 iff k∂²0 for all 1kK, as we can see by the argument: the necessity is
straightforward from total dependence definition and some calculations over dependence of a
variable and consequences. For the sufficiency let us suppose that all variables are independent
of the remaining ones, so in all the
cells in the array. Summing along variable 1 ...
and, by proper substitution of the second factor of right side of former
equations these, one gets . Summing all right side
equalities along variable 2, one gets
. Put this into the already altered
expression for cell joint probability, and gets = . Doing this
iterativelly one concludes that, in all cells of the array, and
so, all variables are mutually independent.
Note that the total dependence in an array is not, in general and as in 2fold cross
classification tables, just the sum of the dependences of all variables, for it can exist group
codependences or independences in it. These aspects are not going to be studied in this paper.
Inference. It was presented the array T constructed in a complete multinomial model
with array probability P whose maximum likelihood estimate is F[ ], the
array of relative frequencies. The marginal relative frequencies representations are
straightforward from the definitions, and just by substituting the probabilities by their estimates,
sample codependence of the categories ik and ik* of variable k, dependence of category ik, the
codependence coefficient between categories and are respectively denoted by ,
and . Also for sample variable k dependence,
and as sample total dependence.
Test for independence in the array. To test H0: ∂0 against HA: ∂> 0. where ∂² is the
total dependence measure, Goldstein, Wolf and Dillon (1976) gave test function 8nD² which,
under H0, is asymptotically distributed as a chi square variate with R
degrees of freedom.
If there is evidence of dependencies, one shall continue to search them. There are many
ways to do this, but here I am going to present only one of them.
Test for collapsibility in the array. For the sake of reducing problem complexity, one
can test H0k: k∂0 against HAk: k∂> 0 using the statistic (8n)kD² which, under H0k, is
asymptotically distributed as a chi square variate with Rk df., as it is
shown in appendix A. Independence of variable k means the codependencies of any pair of
categories of any other remaining variable are the same as if the variable k did not existed. This
can be seen at the collolary in appendix B. If the array cannot collapse along a variable then the
association structures between the remaining variables vary according to its levels, and this
worth to explore in practice.
Searching for conditional codependencies. A non independent variable k means that
the dependence structure of the remaining variables depend also on its levels, so its collapsability
being rejected one may search for dependencies between the remaining variables, conditioned to
its levels. One way to do this is to analyse the dependence structure of the remaining variables at
each category of variable k, and this can be done throught conditioning.
We then need the concept of conditional codependence. The conditional distribution of
the others variables, given the level ik of variable k is defined by ,
so we can define the conditional codependence of two categories and of variable k* (k*< k,
say, to easy the notation), given the category ik of variable k, is by
, (12)
where is the marginal conditional probability of
category of variable k*, given the category ik of variable k, and
is the joint marginal conditional probability of the listed categories of the
remaining variables others than k*, given the category ik of variable k.
By the expression (11) the definition (12) can be expressed as
. (13)
To accomplish all above definitions to this universe of condicional distribution by
measuring conditioned degrees of associations, we propose the following.
Definition 7. The conditional dependence of category ik* of variable k*, given the
category ik of variable k (k*< k, say), is given by
(14)
Definition 8. The conditional codependence coefficient between categories ik* and of
variable k*, given the category ik of variable k, case and , is given by
, (15)
where and are, respectively, the standard conditional dependencies of category ik* and
of variable k*, given the category ik of variable k. This measure of association has the same
properties as the unconditional codependence coefficient given at (3) for conditional probability
distributions are themselves probability distributions.
Applications. To show how the proposed coefficients and tests behave we will use
simulated data and real examples. The simulated examples were done on some choosed
structures as to show how the tools act in known situations, and the real ones were choosed in the
literature and among the large amount of data we have in our archives.
The simulations were done using command rmultinom in R.2.10.1 (2009) to generate
structured independence, and dependence as well. Both independence and dependence structures
arrays were generated under dimensions 3×3×2×2 as to exemplify all tests above descripted.
For constructing (3×3)×(2×2) arrays ―presented below as panels with four 3×3 tables―
in each one of the four 3×3 cells tables are imposed structures of independence, firstly global
independence in the array, and secondly independence in each one of the four tables but
differenttly distributed of each other. All four tables were equally weighed.
To create the joint probability distribution subjacent each simulated 3×3 table in the
array, and governing the outcome of its frequencies, first the conditional distribution of column j
[p1|j,p2|j,p3|j]t for each of the four tables was chosen, with the row marginal distribution for
columns rp+t=[1/3,1/3,1/3]t. The joint distributions in each of the four tables were gotten by the
transformation pij=pi|j×p+j. Under global independence each 3×3 table had all column conditional
distributions equal to (p1|j,p2|j,p3|j)t=(0.5,0.3,0.2)t , so that the array joint distribution was as shown
at Array 1.
So, to achieve the simulation of a multinomial distribution with parameters n=360 ―chosen as to
expect frequency 6 at the lowest probability cells― and p, first the R.2.10.1 (2009) command
rmultinom for a 4nomial(360.(0.25, 0.25, 0.25, 0.25)t) distribution was runned, then the four 3×3
frequency tables were respectivelly constructed by running four 9nomial(m,p) distributions with
the sample sizes m corresponding to the just before generated numbers (now sample sizes) with
p equal to each of the four equal portions of vector P at Array1, lexicographically ordered. The
generated data are in Array 2 bellow.
(Array 1 here)
(Array 2 here)
The first four runned sample sizes m for the primary tables based on uniform distribution
between them were: table v3.1×v4.1=103, table v3.2×v4.1=86, table v3.1×v4.2=78 and table
v3.2×v4.2=93. Applying the in this paper developed dependence test in the array we get: total
dependence=0.0083, which gave a with df=29 =23.91 and p-value=0.498, what confirmed the
global independence structure at Array 1. In a real application situation we would stop the
analysis at this point for the test did not showed evidence of any association in the array.
But to show the capability of the tests developed above we go forward analysing this case
for the possibility of discarding some of the dimensions. The results were: for dependence of v4,
the fourth dimension: p-value=0.372; of v3: p-value=0.196; of v2; p-value=0.564 and of v1:
0.747. Of course if there is no evidence of global association it should not have been found no
dependence of any variable whatsoever with high probability.
Going forward and applying the tests for simple dependence, presented in Cordeiro
(1987) and Khan & Ali (1973), in each on of the four subtables the p-values, at the same order
presented above, were: 0.220. 0.462, 0.870 and 0.836 as it with high probability should be by
construction.
Let us now study a different situation, in which in each of the four tables there is
independence, i.e. conditioned to each level of the cartesian product of variables v3×v4 there is
independence between v1 and v2, though with different structures. The global vector Q of
probabilities is given in Array 3 and the result of running a complete multinomial sample with
parameters (360. Q) the same way as had been done above is in Array 4.
(Array 3 here))
(Array 4 here)
The firstly runned multinomial frequencies for the four primary tables based on uniform
distribution between them were: for table v3.1×v4.1=107, for table v3.2×v4.1=90. for table
v3.1×v4.2=83 and for table v3.2×v4.2=80.
Applying global dependence test in the array we got: total dependence=0.0299, which
produced a with df=29 8nD²=86.02, and p-value=1.5×10−7, as it was expected to be by the
handled dependence structure at Array 3.
The tests for dependence of the dimensions as seeking for collapsibility in the array gave:
p-value=6.7×10−7 for v4, 0.008 for v3, 0.880 for v2 and p-value=4.0×10−5 for v1, what means v2
could be eliminated of the association problem for not being evident it is globally associated with
the remaining dimensions.
To see why v2 could be eliminated of the problem let us take a look at Array 5, where we
find the conditional probabilitiy distributions of v2 given each case of array v1×v3×v4, the
marginal distribution of v2, and the conditional distributions of array v1×v3×v4 given the levels
of v2 as well. These distributions are exactly equal, so the three levels of v2 do not affect the
corresponding distributions of the arrays they determine, neither the levels of v1×v3×v4 alter the
distribution of v2.
(Array 5 here)
So, our problem can from now on be reduced by collapsing Array 4 along v2 to the array
v1×v3×v4, and what we see, when applying the tests for global and dimension dependencies, are
all p-values indicating evidence of association between all variables in all directions: for global
dependence: p-value=4×10−12; for dependence of v4: p-value=2.9×10−10; of v3: p-value=0.004
and of v1: p-value=1.1×10−12.
Applying simple dependence analysis (Cordeiro, 1987) to the conditioned tables out of
the reduced array of generated data we had the results: for table v1×v3|v4=v4.1 we got p-
value=0.001, and for v1×v3|v4=v4.2 the p-value was 0.011; for table v1×v4|v3=v3.1 p-
value=6×10−9, and for v1×v4|v3=v3.2 p-value=0.0005. Now for v3×v4|v1=v1.1 we got p-
value=0.016, for v3×v4|v1=v1.2 it was 0.85 and given for v3×v4|v1=v1.3 p-value was 0.19.
The dependence significancies shown at the last paragraph are explained for not only the
strenght but also the association forms along the levels of dimensions v1, v3 and v4 were very
different, what could be seen by the dependence diagrams (Cordeiro, 1987) not showed here.
Real data example 1. In a survey in Catanduva, State São Paulo, Brazil, 1510 subjects
older than or 65 years aged were classified as having Alzheimer disease or not, been illiterated or
not and the sex. The data are resumed in the Array 6.
(Array 6)
Testing for joint independence between the three variables, 8nD263.4 (p-value=5×10−13)
indicates a highly significant association in the array. To see if the array can collapse along any
variable, one has: for disease p-value=2×10−6, for sex p-value=1×10−13, for illiteracy p-
value=2×10−8, and concludes the array cannot collapse along no dimension, i.e. there is evidence
that any two variables have different association structure according to the levels of the third
one. Interestingly this can include significant association at one level e non-significant at the
other (see p-values and dependence diagrams at Panel 1 bellow).
The codependence coefficients show the structure of association, in spite of
codependence coefficients in 2×2 tables be always non negative, in this array is complex:
0.986, 0.982 and 0.997, indicating the disease, sex and each
illiteracy levels have concurrent combinations of categories of the others variables.
To show just one aspect of these association structures we can use the features of simple
dependence analysis (Cordeiro, 1987), here dependence diagram, through it we can see the
structure of dependence between Alzheimer disease and illiteracy, according to sex, as presented
at Panel 1).
As 98% were not diseased among the elderly males of the study a p-value=0.31 shows no
evidence of association between disease and illiteracy. Looking at the figure for the males we see
the two illiteracy categories ploted opposite each other in relation to the horizontal line, just on
the vertical line near the center of the figure what represents a balance for illiteracy. The "No
Alzheimer" point very near the figure center is due to its high predominance in male population
what makes the marginal profile of disease almost equal its conditional profile, given sex=male.
(here (title→Panel 1. Dependence diagram of Alzheimer disease and illiteracy in elderly, according to gender - Catanduva-SP, Brazil) figures males and females side by side)
Notwithstanding, amongst the elderly females with a p-value=0.050 for dependence test,
we see these categories also on opposite sides of the horizontal, but not so vertical and further
from the center than in males (notice the diagramms scales are the same). In this female
population the cases of not diseased were estimated as 40%, much lower than between males,
and at the diagram we see Alzheimer disease and illiteracy on the same quadrant, what can be
interpreted as positive association between the two categories. This sex effect could be
explained, despite significant smaller percentage of illiterate elderly women (40 against 53
among men), by cultural behavior of brazilians of country interior, mainly between elderly who
lived in times were women stood at home meanwhile men went out to work and amuse. Men
exercised the brain much more than women at those old times, and this could be an explanation
for what we saw.
So the sex effect on the association structures of illiteracy and Alzheimer disease in the
above array captured by the p-value=1×10−13 could be interpreted as a stronger effect of illiteracy
towards rising the probability of disease in females than in males of that brazilian city.
Real data example 2. In a brazilian multicenter study on multiple organ failure (MOF)
553 individuals, among others variables, were evaluated about being or not the case for urgent
treatment (Not.Urgent/Urgent), having or not slow heart rate ( Not.slow.HR/Slow.HR).
(Array 7 here)
Testing for global dependence a p-value=2×10−13 indicates very significant dependence in
the array. Tests for collapsing gave the p-values: 7×10−11 for Urgency, 1×10−13 for MOF and
0.008 for heart rate slowness. So every dimension is important for the problem.
To see how each dimension influences the association structures between the remaining
two, let us begin with association between Urgency and MOF, given HR slowness. Testing for
dependence at "Not slow HR" a p-value=3×10−10 indicates high degree of association between
Urgency and MOF at this level of HR slowness, and p-value=0.009 suggests association at
"Slow HR" too. Panel 2 shows dependence diagrams for levels of HR slowness.
Notwithstanding association evidence at both HR slowness levels, this dimension's effect
test p-value=0.0008 evidenciates some difference between the two association structures. What
simple dependence analysis (Cordeiro, 1987) diagram shows is that the association form is more
or less the same at the two levels of HR slowness, but depite knowing that p-value is sample size
sensitive, we can say at "Not slow HR" the association between MOF and Urgency is stronger
than at "Slow HR", for the displayed "Urgent" and "MOF" points at "Not slow HR" are further
placed from balance center (0.0) point than at "Slow HR". (Notice the same scale in both figures)
(here (title→Panel 2. Dependence diagram for Urgency and MOF, according to HR slowness)
figures "Not slow heart rate" and "Slow heart rate" side by side)
Along Urgency an dependence test p-value=7×10−11 evidenciates strong effect of this
dimension on the association structures between "HR slowness" and MOF. At the "Urgent" level
p-value=0.12 suggests no evidence of association, meanwhile at "Not urgent" p-value=0.001
says we should accept association.
Panel 3 shows the dependence diagrams of "HR slowness"×MOF, according to Urgency,
to illustrate these association structures. We see at "Not urgent" level diagram the "HR slowness"
and MOF points farther from the balance center of the diagram than at Urgent one, what reflects
the strongest degree of association at the first level.
(here (title→Panel 3. Dependence diagram for HR slowness and MOF, according to urgency)
figures "Not urgent" and "Urgent" side by side)
These results altogether can help to modelate a concatenation of HR slowness with
urgency so as to predict MOF as the best ordered combinations 0_"HR slowness=No,
Urgency=No", 1_"HR slowness=Yes, Urgency=No", 2_"HR slowness=No, Urgency=Yes" and
3_"HR slowness=Yes, Urgency=Yes".
After applying a binary logistic regression for MOF over this combination, having 0_"HR
slowness=No, Urgency=No" as reference state, we respectively got the adjusted odds ratios: 5.4
(ci95%: 2.1-13.8), 9.7 (ci95%: 4.4-21.2) and 17.8 (ci95%: 7.2-43.9), and an odds ratio tendency
test p-value<0.0005, indicating Urgency as more important than HR slowness to predict MOF.
This very example is a good one to show that this analysis can help others more known to
achieve good final goals.
Real data example 3. Fienberg (1980) shows an example of a sample of 192 adult males
Anolis lizards of Bemini from two species (A. sagrei and A. anguticeps) counted in structural
habitat as in the following array.
(Array 8 here)
Testing for in the array global independence a p-value=4×10−13 indicates a highly
significant association amongst the variables. To see if the array can collapse along any variable,
one has for Species p-value=2×10−12, for Diameter p-value=4×10−8, for Height p-value=2×10−12,
and concludes the array cannot collapse along no variable. These results show the structures of
association in this array are also complex. The codependence coefficients in the array are:
0.960. 0.968 and 0.917.
The results of simple dependence analysis (Cordeiro, 1987) on cross tables of
Species×Diameter, given perch Height, show no evidence of association at "low height" but a
highly significant one at "high heigth", as is shown in Panel 4.
At Panel 4, "perch height ≤ 5feet" figure, we see both diameter categories, and the most
frequent specie A. sagrei as well, plotted just near the balance center (non association center),
and A. angusticeps a little farther from it, what explains the high p-value=0.30, in spite of to
much low A. anguticeps observed frequency at this perch height, do not evidenciating
association between species and perch diameter.
On the contrary when the rest places are higher, at "perch height > 5feet", the species
seam to prefer different perch to rest: A. sagrei on thicker ones and A. angusticeps on thinner.
The significance of perch height (p-value=2×10−12) worth this difference in association structure.
(here (title→Panel 4. Dependence diagram of species×diameter, given perch height) figures
"≤5feet" and ">5feet" side by side)
Analysing the association Species×Height, given Perch diameter, what simple
dependence analysis shows is no evidence of association when perch is thicker (p-value=0.54)
but a high significant one when it is thinner (p-value=7×10−9) as is shown at Panel 5 bellow.
Again a case of significance of Perch diameter (p-value=4×10−8) meaning not only
different association structures but evidentiating a contrast amongst association and non-
association: at thicker perchs figure we see, at Panel 5, a cluster of points plotted just near the
non-association center with both perch heights and A. sagrei, and A. angusticeps a little further
but also near the cluster, with evidence of non-association. Au contraire when diameter is thinner
the significant association translate the preference of A. angusticeps for higher perchs and A.
sagrei for lower ones.
(here (title→Panel 5. Dependence diagram of species×height, given diameter) figures "≤2.5inch"
and ">2.5inch" side by side)
As is written in Bemini Biological Field Station site (2010) it is known A. sagrei is
"found more frequently on the ground and A. angusticeps more on trees and is not commonly
seen as some of the other Anolis lizards due to it's camouflage color" (sic).
What the current analysis says is that there is no evidence A. sagrei whould prefer
specific places on a tree (Panel 6, left dependence diagram), unless lower perchs when they are
thin (Panel 5, left dependence diagram), while A. angusticeps with lower frequency, in total 27
individuals and a p-value very near 0.05, seems to less prefer thinner and higher perchs than they
prefer thicker and lower.
(here (title→Panel 6. Dependence diagram of diameter × height, given species) figures A. sagrei
and A. angusticeps side by side)
Conclusion. We presented the foundation of a method for dealing with multidimensional
contingency tables or contingency arrays, under the simple sample design, which differently of
multiple correspondence analysis (Greenacre & Blasius, 2006) works the multivariate way, and
permits evaluation of a kind of association's strenght between levels of the same variable with
corresponding significant tests.
We presented also applications with generated data, which showed the tests working well
at both situations of global independence, and of structured dependence included of reducing
complexity by collapsation along the second dimension as in simulated example 2.
With this method we can reduce complexitity of the array by collapsing it along
independent variables, and these given founding ideas enable us to forwardly develop
codependece between two groups of variables as, for example, in medicine between clinical and
demographic variables, path analysis for categorical variables as it was done by Wright (1921),
and canonical codependence as with numerical variables.
Three real examples were presented, the first two with local data. These two local
examples allowed some intepretations between the variables analysed, and at the second real
example the analysis permited to construct a concatenation of variables to predict multiple
organs failure (MOF) wich proved to be good by binary logistic regression. The third one
already presented by other authors in the specialized literature, whoose presentation permited
some new insights on Bemini lizards behavior.
Aknowlegments. This research was supported by FAPESP-Foundation for Research
Support in State São Paulo-Brazil (Proc. 2008/55101-8). We thank very much to Dr. Carlos
Alberto Ribeiro Diniz for his attentious reading of the manuscript and for his important
suggestions for improving it.
References.
Agresti, A., 2002. Categorical Data Analysis. Wiley, Hoboken, NJ-USA.
Bickel, P.J., Doksum, K.A., 1977. Mathematical Statistics: Basic Ideas and Selected Topics.
Holden-Day, Oakland, Ca.
Biological Benini Field Station, in: http://www6.miami.edu/sharklab/aboutbimini_reptiles.html
Bishop, Y.M.M., Fienberg, S.E. and Holland, P.W., 1975 Discrete Multivariate Analysis: Theory
and Practice. The MIT Press, Cambridge, Mass.
Cordeiro, J.A. Analysis of Dependency, 1987. Technical Report n. 48/87, State University of
Campinas-SP, Brazil.
Fienberg, S.E., 1980. The analysis of cross-classified categorical data. The MIT Press,
Cambridge, Mass.
Goldstein, M., Wolf, E., Dillon, W., 1976. On a test for independence for contingency tables.
Commun. Statist.-Theor. Meth., A5(2), 159-169.
Goodman, L. A., 1954. Measures of association for cross-classification tables. JASA 49 (268),
732-764.
Goodman, L. A., 1981. Criteria for Determaning Whether Certain Categories in Cross-
Classification Table Should Be Combined, with Special Reference to Occupational
Categories in an Occupational Mobility Table. The American Journal of Sociology, 87(3),
612-650
Goodman, L. A., 1996. A Single General Method for the Analysis of Cross-Classified Data:
Reconciliation and Synthesis of Some Methods of Pearson, Yule, and Fisher, and Also Some
Methods of Correspondence Analysis and Association Analysis. JASA 91(433), 408-428.
Greenacre, M.J., 1984. Theory and Applications of Correspondence Analysis. AC Press,
London.
Greenacre, M.J. and Blasius, J., 2006. Multiple Correspondence Analysis and Related Methods.
Chapman & Hall/CRC.
Grizzle, J.E., 1969. Starmer, C.F. and Koch, G.G. Analysis of categorical data by linear models.
Biometrics 25 489-504.
Hellinger, E., 1909. Quadratischen Formen von unendlichvielen Veränderlichen. J. reine u. ang.
Math., 136, 210-71.
Hewitt, E., Stromberg, K., 1975 Real and Abstract Analysis. Springer-Verlag, NY.
Khan, A.H., Ali, S.M., 1973. A new coefficient of association. Ann. Inst. Stat. Math. 25(1), pp.
41-50.
Lebart, L., Morineau, A. and Warwick, K.M., 1984. Multivariate Descriptive Statistical
Analysis, Correspondence Analysis and Related Techniques for Large Matrices, Wiley, NY.
Matusita, K., 1967. On the notion of affinity of several distributions and some of its applications.
Ann. Inst. Statist. Math. 19, 181-192.
R Project for Statistical Computing, The, 2009. in: http://www.r-project.org.
Wright, S.,1921. Correlation and causation. J. Agricultural Research, 20, 557-585.
Appendix A. Test for independence of a dimension k. Following Khan and Ali (1973)
and definition (4) above with the parameters substituted by their maximum likelihood estimates,
after some algebraic manipulations, one has
.
Under the condition –1< <1 for all (i1,,ik-1,ik+1,,iK) the
McLaurin approximation and the substitution of the relative frequencies by the absolute ones
give
, where || is of order
n3/2 while is of order n1/2. Then, under H0k: , (8n) is
asymptotically distributed as a variate with chi-squared distribution with Rk
degrees of freedom.
Appendix B. Consequences of eliminating an independent level ik of a variable k. To
analyse what happens when a level ik of a variable k is independent, let us suppose the
hypothesis , i.e. for all configurations (i1,,ik-
1,ik+1,,iK), 1ijIj, jk.
At first one can see what happens with de codependence of any two categories and
of variable k, others than , and then with any two categories of any other variable k*. One sees
the probabilities involved with category ik are not utilized to calculate the codependence of other
categories and of variable k, and consequently the nullity of dependence of category ik do
not alter their codependence.
Now, if and are indices of two categories of any other variable k*, k< k* say, then
where is the marginal codependence of categories and of variable k* when the
variable k is eliminated from the array, as the marginal probability
, and .
Corollary. The dependence of a variable k is null, i.e. , means according to
definition (4) that for all categories of variable k, what results in for any
two categories and of any other variable k*, as is shown above. That is: no alterations
whatsoever occur with the codependence of two levels of any remaining variable.
Appendix C. Test for independence of level ik of variable k, and for independence of
variable k. The same methods, and similar arguments and conditionings on expression (2''), as in
appendix A, give the result for testing independence of level ik of variable k as follows:
or, equivalently,
which, under and
Slutsky's theorem (Bickel & Doksum, 1977, p. 461], is distributed as a variate with chi-squared
distribution with degrees of freedom.
For testing we just use the sum , which under the conditions
already posed above is asymptotically distributed as a variate with chi-squared distribution with
degrees of freedom.