+ All Categories
Home > Documents > Multivariate dependence analysis

Multivariate dependence analysis

Date post: 04-Dec-2023
Category:
Upload: independent
View: 0 times
Download: 0 times
Share this document with a friend
21
Multivariate dependence analysis José Antônio Cordeiro [email protected] Abstract. Introduction. In this work we propose some features for analysing cross- classification multidimensional tables, and test their effectiveness on articially constructed and on real data. Methods. Measures of similarity between two levels of one variable by their profiles along all combinations of levels of the remaining variables, and coefficient that measures the strenght of dependence of the levels of one variable, and of the whole variable as well, are proposed based on Hellinger distance. Tests base on these coefficients are propose for reduction of complexity in the array through its collapsability along an independent variable, and test for total dependence in the array as well. Results. The tests when applied on artificial data behaved as it was expected to, for all the analysed situations were constructed and we knew what the association structures were, and when analysing the real data the whole methodology work well permitting new insights on already analysed data. Key words. Multivariate dependence, cross-classification array, reducibility of multidimensional tables. Hellinger distance. Introduction. Analysis of cross-classification tables and arrays is a very studied theme among statisticians, and the developed techniques are much applied in areas of knowledge, like for instance medicine, biology, social sciences. The most published and used methods are log- linear modeling (Bishop, Fienberg and Holland, 1977, Fienberg, 1980 , Agresti, 2002), ordered and unordered association models (Goodman, 1954, 1981, 1996), weighted least-square modeling (Grissle, Starmer and Koch, 1969), simple (CA) (Lebart et al., 1984, Greenacre, 1984) and multiple correspondence analysis (MCA) (Greenacre & Blasius, 2006). MCA is useful to analyse multidimensional contingency tables and utilyses the so called Burt matrix C which is a cross-classification table including all crossings of all levels of all
Transcript

Multivariate dependence analysis

José Antônio Cordeiro

[email protected]

Abstract. Introduction. In this work we propose some features for analysing cross-

classification multidimensional tables, and test their effectiveness on articially constructed and

on real data. Methods. Measures of similarity between two levels of one variable by their

profiles along all combinations of levels of the remaining variables, and coefficient that

measures the strenght of dependence of the levels of one variable, and of the whole variable as

well, are proposed based on Hellinger distance. Tests base on these coefficients are propose for

reduction of complexity in the array through its collapsability along an independent variable, and

test for total dependence in the array as well. Results. The tests when applied on artificial data

behaved as it was expected to, for all the analysed situations were constructed and we knew what

the association structures were, and when analysing the real data the whole methodology work

well permitting new insights on already analysed data.

Key words. Multivariate dependence, cross-classification array, reducibility of

multidimensional tables. Hellinger distance.

Introduction. Analysis of cross-classification tables and arrays is a very studied theme

among statisticians, and the developed techniques are much applied in areas of knowledge, like

for instance medicine, biology, social sciences. The most published and used methods are log-

linear modeling (Bishop, Fienberg and Holland, 1977, Fienberg, 1980 , Agresti, 2002), ordered

and unordered association models (Goodman, 1954, 1981, 1996), weighted least-square

modeling (Grissle, Starmer and Koch, 1969), simple (CA) (Lebart et al., 1984, Greenacre, 1984)

and multiple correspondence analysis (MCA) (Greenacre & Blasius, 2006).

MCA is useful to analyse multidimensional contingency tables and utilyses the so called

Burt matrix C which is a cross-classification table including all crossings of all levels of all

variables one each other. This work treats I1I2IK cells arrays generated by K cross-

classified variables gathered together in a simple random sample of size n.

Let T[ ] be the observed array, with 1 ik Ik and k1,2,,K, and let P[

] be the array of associated distribution of joint probabilities such that the K crossed

variables are distributed complete multinomial(n, P), with n=total of T cells, and total

of P cells1.

To analyse the cross-classification array T I propose the Hellinger-Matusita's (Hellinger,

1909; Matusita, 1967) distance to measure the disagreement between its "observed" and

"expected" frequencies, with which we can derive a series of measures and coefficients to help

us to understand its association structure, and to make inference on it as well.

Dependence and codependence. In general the concept of independence is put on the

whole array, but let us take a variable k and its category . If this level were independent of the

other variables then, for all crossings (i1,,ik−1,ik+1,,iK) of the remaining variables it is true

, where is the marginal probability of category ik of

variable k, and the marginal probability of the cell (i1,,ik−1,ik+1,,iK). If variable

k had all its levels independent like above, then we can say itself is independent of the other

variables.

Goodman (1981) introduced the concept of "reducible contingency table" based on the

joining each other of two or more columns (rows) with equal probability profiles, without also

attending to the fact of reduction by deleting independent columns, i.e. those with the same

probability profiles as the marginal. In the same paper he says "Any reducible table can be

transformed into an irreducible table without affecting the total nonindependence in the table"

(sic). In this paper I am going to apply the reducibility principle to arrays but by deletion of a

variable, if it is nominal.

To study the condition for reducibility, and to measure degrees of association between

group of variables, it is usefull to construct the notion of dependence of a category of a variable

and of codependence between two of its categories.

The codependence of the categories ik and of variable k is

This formula shows codependence of two categories is a balance of the differences of

condicional probabilistic profiles of the remaining variables, given each of them, and the

marginal profile of these same variables.

The dependence of category ik of a variable k, is defined by . It is easy to

show this dependence equals the marginal probability of category ik times 1 minus the affinity

(Matusita, 1967) between the condicional probabilistic distribution of the remaining variables,

given this level of variable k, and their marginal distribution. Then the higher the affinity

between marginal and conditional profiles the lesser is the dependence of the category, and

conversely, the higher is its frequency in the population the higher is its dependence. So if

category ik of variable k were independent of the other variables its dependence would be zero,

as well its codependences with all other categories of variable k.

To eliminate the influence of the populational probabilities of the categories on

codependence categories ik and i*k of variable k, as the correlation coefficient eliminates the

influence of measure unities of quantitative variables, case and one defines

as the codependence coefficient of categories ik and i*k of variable k.

This coefficient measures an association of the two categories, and by properties of summation

and Cauchy-Schwartz inequality it can be said that 1< +1. Hewitt & Stromberg (1975,

pp. 190-191) give two necessary and sufficient conditions for equality to 1 via Hölder's

inequality which combined, from | |+1 and the positivity of the envolved probabilities, and

the summation over all configurations and some calculations lead one to conclude 1 iff

1, and this property of the affinity is known

to be equivalent (Matusita, 1967) to equality of the probability profiles themselves.

The dependence of a category ik of a variable k can be extended to the dependence of k∂²

variable k by summing all the dependencies of its categories. One can easely see that k∂²0 iff

over all cells in the remaining array complementary to variable k,

that is: all of its categories are independent of the remaining variables, so variable k is itself

independent of the remaining variables. If a variable is independent of the others we can

generalize Goodman's principle of reducibility and it can be deleted from the analysis to lessen

problem complexity.

The total dependence in the array is ,

and it is true ∂²0 iff k∂²0 for all 1kK, as we can see by the argument: the necessity is

straightforward from total dependence definition and some calculations over dependence of a

variable and consequences. For the sufficiency let us suppose that all variables are independent

of the remaining ones, so in all the

cells in the array. Summing along variable 1 ...

and, by proper substitution of the second factor of right side of former

equations these, one gets . Summing all right side

equalities along variable 2, one gets

. Put this into the already altered

expression for cell joint probability, and gets = . Doing this

iterativelly one concludes that, in all cells of the array, and

so, all variables are mutually independent.

Note that the total dependence in an array is not, in general and as in 2fold cross

classification tables, just the sum of the dependences of all variables, for it can exist group

codependences or independences in it. These aspects are not going to be studied in this paper.

Inference. It was presented the array T constructed in a complete multinomial model

with array probability P whose maximum likelihood estimate is F[ ], the

array of relative frequencies. The marginal relative frequencies representations are

straightforward from the definitions, and just by substituting the probabilities by their estimates,

sample codependence of the categories ik and ik* of variable k, dependence of category ik, the

codependence coefficient between categories and are respectively denoted by ,

and . Also for sample variable k dependence,

and as sample total dependence.

Test for independence in the array. To test H0: ∂0 against HA: ∂> 0. where ∂² is the

total dependence measure, Goldstein, Wolf and Dillon (1976) gave test function 8nD² which,

under H0, is asymptotically distributed as a chi square variate with R

degrees of freedom.

If there is evidence of dependencies, one shall continue to search them. There are many

ways to do this, but here I am going to present only one of them.

Test for collapsibility in the array. For the sake of reducing problem complexity, one

can test H0k: k∂0 against HAk: k∂> 0 using the statistic (8n)kD² which, under H0k, is

asymptotically distributed as a chi square variate with Rk df., as it is

shown in appendix A. Independence of variable k means the codependencies of any pair of

categories of any other remaining variable are the same as if the variable k did not existed. This

can be seen at the collolary in appendix B. If the array cannot collapse along a variable then the

association structures between the remaining variables vary according to its levels, and this

worth to explore in practice.

Searching for conditional codependencies. A non independent variable k means that

the dependence structure of the remaining variables depend also on its levels, so its collapsability

being rejected one may search for dependencies between the remaining variables, conditioned to

its levels. One way to do this is to analyse the dependence structure of the remaining variables at

each category of variable k, and this can be done throught conditioning.

We then need the concept of conditional codependence. The conditional distribution of

the others variables, given the level ik of variable k is defined by ,

so we can define the conditional codependence of two categories and of variable k* (k*< k,

say, to easy the notation), given the category ik of variable k, is by

, (12)

where is the marginal conditional probability of

category of variable k*, given the category ik of variable k, and

is the joint marginal conditional probability of the listed categories of the

remaining variables others than k*, given the category ik of variable k.

By the expression (11) the definition (12) can be expressed as

. (13)

To accomplish all above definitions to this universe of condicional distribution by

measuring conditioned degrees of associations, we propose the following.

Definition 7. The conditional dependence of category ik* of variable k*, given the

category ik of variable k (k*< k, say), is given by

(14)

Definition 8. The conditional codependence coefficient between categories ik* and of

variable k*, given the category ik of variable k, case and , is given by

, (15)

where and are, respectively, the standard conditional dependencies of category ik* and

of variable k*, given the category ik of variable k. This measure of association has the same

properties as the unconditional codependence coefficient given at (3) for conditional probability

distributions are themselves probability distributions.

Applications. To show how the proposed coefficients and tests behave we will use

simulated data and real examples. The simulated examples were done on some choosed

structures as to show how the tools act in known situations, and the real ones were choosed in the

literature and among the large amount of data we have in our archives.

The simulations were done using command rmultinom in R.2.10.1 (2009) to generate

structured independence, and dependence as well. Both independence and dependence structures

arrays were generated under dimensions 3×3×2×2 as to exemplify all tests above descripted.

For constructing (3×3)×(2×2) arrays ―presented below as panels with four 3×3 tables―

in each one of the four 3×3 cells tables are imposed structures of independence, firstly global

independence in the array, and secondly independence in each one of the four tables but

differenttly distributed of each other. All four tables were equally weighed.

To create the joint probability distribution subjacent each simulated 3×3 table in the

array, and governing the outcome of its frequencies, first the conditional distribution of column j

[p1|j,p2|j,p3|j]t for each of the four tables was chosen, with the row marginal distribution for

columns rp+t=[1/3,1/3,1/3]t. The joint distributions in each of the four tables were gotten by the

transformation pij=pi|j×p+j. Under global independence each 3×3 table had all column conditional

distributions equal to (p1|j,p2|j,p3|j)t=(0.5,0.3,0.2)t , so that the array joint distribution was as shown

at Array 1.

So, to achieve the simulation of a multinomial distribution with parameters n=360 ―chosen as to

expect frequency 6 at the lowest probability cells― and p, first the R.2.10.1 (2009) command

rmultinom for a 4nomial(360.(0.25, 0.25, 0.25, 0.25)t) distribution was runned, then the four 3×3

frequency tables were respectivelly constructed by running four 9nomial(m,p) distributions with

the sample sizes m corresponding to the just before generated numbers (now sample sizes) with

p equal to each of the four equal portions of vector P at Array1, lexicographically ordered. The

generated data are in Array 2 bellow.

(Array 1 here)

(Array 2 here)

The first four runned sample sizes m for the primary tables based on uniform distribution

between them were: table v3.1×v4.1=103, table v3.2×v4.1=86, table v3.1×v4.2=78 and table

v3.2×v4.2=93. Applying the in this paper developed dependence test in the array we get: total

dependence=0.0083, which gave a with df=29 =23.91 and p-value=0.498, what confirmed the

global independence structure at Array 1. In a real application situation we would stop the

analysis at this point for the test did not showed evidence of any association in the array.

But to show the capability of the tests developed above we go forward analysing this case

for the possibility of discarding some of the dimensions. The results were: for dependence of v4,

the fourth dimension: p-value=0.372; of v3: p-value=0.196; of v2; p-value=0.564 and of v1:

0.747. Of course if there is no evidence of global association it should not have been found no

dependence of any variable whatsoever with high probability.

Going forward and applying the tests for simple dependence, presented in Cordeiro

(1987) and Khan & Ali (1973), in each on of the four subtables the p-values, at the same order

presented above, were: 0.220. 0.462, 0.870 and 0.836 as it with high probability should be by

construction.

Let us now study a different situation, in which in each of the four tables there is

independence, i.e. conditioned to each level of the cartesian product of variables v3×v4 there is

independence between v1 and v2, though with different structures. The global vector Q of

probabilities is given in Array 3 and the result of running a complete multinomial sample with

parameters (360. Q) the same way as had been done above is in Array 4.

(Array 3 here))

(Array 4 here)

The firstly runned multinomial frequencies for the four primary tables based on uniform

distribution between them were: for table v3.1×v4.1=107, for table v3.2×v4.1=90. for table

v3.1×v4.2=83 and for table v3.2×v4.2=80.

Applying global dependence test in the array we got: total dependence=0.0299, which

produced a with df=29 8nD²=86.02, and p-value=1.5×10−7, as it was expected to be by the

handled dependence structure at Array 3.

The tests for dependence of the dimensions as seeking for collapsibility in the array gave:

p-value=6.7×10−7 for v4, 0.008 for v3, 0.880 for v2 and p-value=4.0×10−5 for v1, what means v2

could be eliminated of the association problem for not being evident it is globally associated with

the remaining dimensions.

To see why v2 could be eliminated of the problem let us take a look at Array 5, where we

find the conditional probabilitiy distributions of v2 given each case of array v1×v3×v4, the

marginal distribution of v2, and the conditional distributions of array v1×v3×v4 given the levels

of v2 as well. These distributions are exactly equal, so the three levels of v2 do not affect the

corresponding distributions of the arrays they determine, neither the levels of v1×v3×v4 alter the

distribution of v2.

(Array 5 here)

So, our problem can from now on be reduced by collapsing Array 4 along v2 to the array

v1×v3×v4, and what we see, when applying the tests for global and dimension dependencies, are

all p-values indicating evidence of association between all variables in all directions: for global

dependence: p-value=4×10−12; for dependence of v4: p-value=2.9×10−10; of v3: p-value=0.004

and of v1: p-value=1.1×10−12.

Applying simple dependence analysis (Cordeiro, 1987) to the conditioned tables out of

the reduced array of generated data we had the results: for table v1×v3|v4=v4.1 we got p-

value=0.001, and for v1×v3|v4=v4.2 the p-value was 0.011; for table v1×v4|v3=v3.1 p-

value=6×10−9, and for v1×v4|v3=v3.2 p-value=0.0005. Now for v3×v4|v1=v1.1 we got p-

value=0.016, for v3×v4|v1=v1.2 it was 0.85 and given for v3×v4|v1=v1.3 p-value was 0.19.

The dependence significancies shown at the last paragraph are explained for not only the

strenght but also the association forms along the levels of dimensions v1, v3 and v4 were very

different, what could be seen by the dependence diagrams (Cordeiro, 1987) not showed here.

Real data example 1. In a survey in Catanduva, State São Paulo, Brazil, 1510 subjects

older than or 65 years aged were classified as having Alzheimer disease or not, been illiterated or

not and the sex. The data are resumed in the Array 6.

(Array 6)

Testing for joint independence between the three variables, 8nD263.4 (p-value=5×10−13)

indicates a highly significant association in the array. To see if the array can collapse along any

variable, one has: for disease p-value=2×10−6, for sex p-value=1×10−13, for illiteracy p-

value=2×10−8, and concludes the array cannot collapse along no dimension, i.e. there is evidence

that any two variables have different association structure according to the levels of the third

one. Interestingly this can include significant association at one level e non-significant at the

other (see p-values and dependence diagrams at Panel 1 bellow).

The codependence coefficients show the structure of association, in spite of

codependence coefficients in 2×2 tables be always non negative, in this array is complex:

0.986, 0.982 and 0.997, indicating the disease, sex and each

illiteracy levels have concurrent combinations of categories of the others variables.

To show just one aspect of these association structures we can use the features of simple

dependence analysis (Cordeiro, 1987), here dependence diagram, through it we can see the

structure of dependence between Alzheimer disease and illiteracy, according to sex, as presented

at Panel 1).

As 98% were not diseased among the elderly males of the study a p-value=0.31 shows no

evidence of association between disease and illiteracy. Looking at the figure for the males we see

the two illiteracy categories ploted opposite each other in relation to the horizontal line, just on

the vertical line near the center of the figure what represents a balance for illiteracy. The "No

Alzheimer" point very near the figure center is due to its high predominance in male population

what makes the marginal profile of disease almost equal its conditional profile, given sex=male.

(here (title→Panel 1. Dependence diagram of Alzheimer disease and illiteracy in elderly, according to gender - Catanduva-SP, Brazil) figures males and females side by side)

Notwithstanding, amongst the elderly females with a p-value=0.050 for dependence test,

we see these categories also on opposite sides of the horizontal, but not so vertical and further

from the center than in males (notice the diagramms scales are the same). In this female

population the cases of not diseased were estimated as 40%, much lower than between males,

and at the diagram we see Alzheimer disease and illiteracy on the same quadrant, what can be

interpreted as positive association between the two categories. This sex effect could be

explained, despite significant smaller percentage of illiterate elderly women (40 against 53

among men), by cultural behavior of brazilians of country interior, mainly between elderly who

lived in times were women stood at home meanwhile men went out to work and amuse. Men

exercised the brain much more than women at those old times, and this could be an explanation

for what we saw.

So the sex effect on the association structures of illiteracy and Alzheimer disease in the

above array captured by the p-value=1×10−13 could be interpreted as a stronger effect of illiteracy

towards rising the probability of disease in females than in males of that brazilian city.

Real data example 2. In a brazilian multicenter study on multiple organ failure (MOF)

553 individuals, among others variables, were evaluated about being or not the case for urgent

treatment (Not.Urgent/Urgent), having or not slow heart rate ( Not.slow.HR/Slow.HR).

(Array 7 here)

Testing for global dependence a p-value=2×10−13 indicates very significant dependence in

the array. Tests for collapsing gave the p-values: 7×10−11 for Urgency, 1×10−13 for MOF and

0.008 for heart rate slowness. So every dimension is important for the problem.

To see how each dimension influences the association structures between the remaining

two, let us begin with association between Urgency and MOF, given HR slowness. Testing for

dependence at "Not slow HR" a p-value=3×10−10 indicates high degree of association between

Urgency and MOF at this level of HR slowness, and p-value=0.009 suggests association at

"Slow HR" too. Panel 2 shows dependence diagrams for levels of HR slowness.

Notwithstanding association evidence at both HR slowness levels, this dimension's effect

test p-value=0.0008 evidenciates some difference between the two association structures. What

simple dependence analysis (Cordeiro, 1987) diagram shows is that the association form is more

or less the same at the two levels of HR slowness, but depite knowing that p-value is sample size

sensitive, we can say at "Not slow HR" the association between MOF and Urgency is stronger

than at "Slow HR", for the displayed "Urgent" and "MOF" points at "Not slow HR" are further

placed from balance center (0.0) point than at "Slow HR". (Notice the same scale in both figures)

(here (title→Panel 2. Dependence diagram for Urgency and MOF, according to HR slowness)

figures "Not slow heart rate" and "Slow heart rate" side by side)

Along Urgency an dependence test p-value=7×10−11 evidenciates strong effect of this

dimension on the association structures between "HR slowness" and MOF. At the "Urgent" level

p-value=0.12 suggests no evidence of association, meanwhile at "Not urgent" p-value=0.001

says we should accept association.

Panel 3 shows the dependence diagrams of "HR slowness"×MOF, according to Urgency,

to illustrate these association structures. We see at "Not urgent" level diagram the "HR slowness"

and MOF points farther from the balance center of the diagram than at Urgent one, what reflects

the strongest degree of association at the first level.

(here (title→Panel 3. Dependence diagram for HR slowness and MOF, according to urgency)

figures "Not urgent" and "Urgent" side by side)

These results altogether can help to modelate a concatenation of HR slowness with

urgency so as to predict MOF as the best ordered combinations 0_"HR slowness=No,

Urgency=No", 1_"HR slowness=Yes, Urgency=No", 2_"HR slowness=No, Urgency=Yes" and

3_"HR slowness=Yes, Urgency=Yes".

After applying a binary logistic regression for MOF over this combination, having 0_"HR

slowness=No, Urgency=No" as reference state, we respectively got the adjusted odds ratios: 5.4

(ci95%: 2.1-13.8), 9.7 (ci95%: 4.4-21.2) and 17.8 (ci95%: 7.2-43.9), and an odds ratio tendency

test p-value<0.0005, indicating Urgency as more important than HR slowness to predict MOF.

This very example is a good one to show that this analysis can help others more known to

achieve good final goals.

Real data example 3. Fienberg (1980) shows an example of a sample of 192 adult males

Anolis lizards of Bemini from two species (A. sagrei and A. anguticeps) counted in structural

habitat as in the following array.

(Array 8 here)

Testing for in the array global independence a p-value=4×10−13 indicates a highly

significant association amongst the variables. To see if the array can collapse along any variable,

one has for Species p-value=2×10−12, for Diameter p-value=4×10−8, for Height p-value=2×10−12,

and concludes the array cannot collapse along no variable. These results show the structures of

association in this array are also complex. The codependence coefficients in the array are:

0.960. 0.968 and 0.917.

The results of simple dependence analysis (Cordeiro, 1987) on cross tables of

Species×Diameter, given perch Height, show no evidence of association at "low height" but a

highly significant one at "high heigth", as is shown in Panel 4.

At Panel 4, "perch height ≤ 5feet" figure, we see both diameter categories, and the most

frequent specie A. sagrei as well, plotted just near the balance center (non association center),

and A. angusticeps a little farther from it, what explains the high p-value=0.30, in spite of to

much low A. anguticeps observed frequency at this perch height, do not evidenciating

association between species and perch diameter.

On the contrary when the rest places are higher, at "perch height > 5feet", the species

seam to prefer different perch to rest: A. sagrei on thicker ones and A. angusticeps on thinner.

The significance of perch height (p-value=2×10−12) worth this difference in association structure.

(here (title→Panel 4. Dependence diagram of species×diameter, given perch height) figures

"≤5feet" and ">5feet" side by side)

Analysing the association Species×Height, given Perch diameter, what simple

dependence analysis shows is no evidence of association when perch is thicker (p-value=0.54)

but a high significant one when it is thinner (p-value=7×10−9) as is shown at Panel 5 bellow.

Again a case of significance of Perch diameter (p-value=4×10−8) meaning not only

different association structures but evidentiating a contrast amongst association and non-

association: at thicker perchs figure we see, at Panel 5, a cluster of points plotted just near the

non-association center with both perch heights and A. sagrei, and A. angusticeps a little further

but also near the cluster, with evidence of non-association. Au contraire when diameter is thinner

the significant association translate the preference of A. angusticeps for higher perchs and A.

sagrei for lower ones.

(here (title→Panel 5. Dependence diagram of species×height, given diameter) figures "≤2.5inch"

and ">2.5inch" side by side)

As is written in Bemini Biological Field Station site (2010) it is known A. sagrei is

"found more frequently on the ground and A. angusticeps more on trees and is not commonly

seen as some of the other Anolis lizards due to it's camouflage color" (sic).

What the current analysis says is that there is no evidence A. sagrei whould prefer

specific places on a tree (Panel 6, left dependence diagram), unless lower perchs when they are

thin (Panel 5, left dependence diagram), while A. angusticeps with lower frequency, in total 27

individuals and a p-value very near 0.05, seems to less prefer thinner and higher perchs than they

prefer thicker and lower.

(here (title→Panel 6. Dependence diagram of diameter × height, given species) figures A. sagrei

and A. angusticeps side by side)

Conclusion. We presented the foundation of a method for dealing with multidimensional

contingency tables or contingency arrays, under the simple sample design, which differently of

multiple correspondence analysis (Greenacre & Blasius, 2006) works the multivariate way, and

permits evaluation of a kind of association's strenght between levels of the same variable with

corresponding significant tests.

We presented also applications with generated data, which showed the tests working well

at both situations of global independence, and of structured dependence included of reducing

complexity by collapsation along the second dimension as in simulated example 2.

With this method we can reduce complexitity of the array by collapsing it along

independent variables, and these given founding ideas enable us to forwardly develop

codependece between two groups of variables as, for example, in medicine between clinical and

demographic variables, path analysis for categorical variables as it was done by Wright (1921),

and canonical codependence as with numerical variables.

Three real examples were presented, the first two with local data. These two local

examples allowed some intepretations between the variables analysed, and at the second real

example the analysis permited to construct a concatenation of variables to predict multiple

organs failure (MOF) wich proved to be good by binary logistic regression. The third one

already presented by other authors in the specialized literature, whoose presentation permited

some new insights on Bemini lizards behavior.

Aknowlegments. This research was supported by FAPESP-Foundation for Research

Support in State São Paulo-Brazil (Proc. 2008/55101-8). We thank very much to Dr. Carlos

Alberto Ribeiro Diniz for his attentious reading of the manuscript and for his important

suggestions for improving it.

References.

Agresti, A., 2002. Categorical Data Analysis. Wiley, Hoboken, NJ-USA.

Bickel, P.J., Doksum, K.A., 1977. Mathematical Statistics: Basic Ideas and Selected Topics.

Holden-Day, Oakland, Ca.

Biological Benini Field Station, in: http://www6.miami.edu/sharklab/aboutbimini_reptiles.html

Bishop, Y.M.M., Fienberg, S.E. and Holland, P.W., 1975 Discrete Multivariate Analysis: Theory

and Practice. The MIT Press, Cambridge, Mass.

Cordeiro, J.A. Analysis of Dependency, 1987. Technical Report n. 48/87, State University of

Campinas-SP, Brazil.

Fienberg, S.E., 1980. The analysis of cross-classified categorical data. The MIT Press,

Cambridge, Mass.

Goldstein, M., Wolf, E., Dillon, W., 1976. On a test for independence for contingency tables.

Commun. Statist.-Theor. Meth., A5(2), 159-169.

Goodman, L. A., 1954. Measures of association for cross-classification tables. JASA 49 (268),

732-764.

Goodman, L. A., 1981. Criteria for Determaning Whether Certain Categories in Cross-

Classification Table Should Be Combined, with Special Reference to Occupational

Categories in an Occupational Mobility Table. The American Journal of Sociology, 87(3),

612-650

Goodman, L. A., 1996. A Single General Method for the Analysis of Cross-Classified Data:

Reconciliation and Synthesis of Some Methods of Pearson, Yule, and Fisher, and Also Some

Methods of Correspondence Analysis and Association Analysis. JASA 91(433), 408-428.

Greenacre, M.J., 1984. Theory and Applications of Correspondence Analysis. AC Press,

London.

Greenacre, M.J. and Blasius, J., 2006. Multiple Correspondence Analysis and Related Methods.

Chapman & Hall/CRC.

Grizzle, J.E., 1969. Starmer, C.F. and Koch, G.G. Analysis of categorical data by linear models.

Biometrics 25 489-504.

Hellinger, E., 1909. Quadratischen Formen von unendlichvielen Veränderlichen. J. reine u. ang.

Math., 136, 210-71.

Hewitt, E., Stromberg, K., 1975 Real and Abstract Analysis. Springer-Verlag, NY.

Khan, A.H., Ali, S.M., 1973. A new coefficient of association. Ann. Inst. Stat. Math. 25(1), pp.

41-50.

Lebart, L., Morineau, A. and Warwick, K.M., 1984. Multivariate Descriptive Statistical

Analysis, Correspondence Analysis and Related Techniques for Large Matrices, Wiley, NY.

Matusita, K., 1967. On the notion of affinity of several distributions and some of its applications.

Ann. Inst. Statist. Math. 19, 181-192.

R Project for Statistical Computing, The, 2009. in: http://www.r-project.org.

Wright, S.,1921. Correlation and causation. J. Agricultural Research, 20, 557-585.

Appendix A. Test for independence of a dimension k. Following Khan and Ali (1973)

and definition (4) above with the parameters substituted by their maximum likelihood estimates,

after some algebraic manipulations, one has

.

Under the condition –1< <1 for all (i1,,ik-1,ik+1,,iK) the

McLaurin approximation and the substitution of the relative frequencies by the absolute ones

give

, where || is of order

n3/2 while is of order n1/2. Then, under H0k: , (8n) is

asymptotically distributed as a variate with chi-squared distribution with Rk

degrees of freedom.

Appendix B. Consequences of eliminating an independent level ik of a variable k. To

analyse what happens when a level ik of a variable k is independent, let us suppose the

hypothesis , i.e. for all configurations (i1,,ik-

1,ik+1,,iK), 1ijIj, jk.

At first one can see what happens with de codependence of any two categories and

of variable k, others than , and then with any two categories of any other variable k*. One sees

the probabilities involved with category ik are not utilized to calculate the codependence of other

categories and of variable k, and consequently the nullity of dependence of category ik do

not alter their codependence.

Now, if and are indices of two categories of any other variable k*, k< k* say, then

where is the marginal codependence of categories and of variable k* when the

variable k is eliminated from the array, as the marginal probability

, and .

Corollary. The dependence of a variable k is null, i.e. , means according to

definition (4) that for all categories of variable k, what results in for any

two categories and of any other variable k*, as is shown above. That is: no alterations

whatsoever occur with the codependence of two levels of any remaining variable.

Appendix C. Test for independence of level ik of variable k, and for independence of

variable k. The same methods, and similar arguments and conditionings on expression (2''), as in

appendix A, give the result for testing independence of level ik of variable k as follows:

or, equivalently,

which, under and

Slutsky's theorem (Bickel & Doksum, 1977, p. 461], is distributed as a variate with chi-squared

distribution with degrees of freedom.

For testing we just use the sum , which under the conditions

already posed above is asymptotically distributed as a variate with chi-squared distribution with

degrees of freedom.


Recommended