Date post: | 25-Dec-2015 |
Category: |
Documents |
Upload: | cory-bruce |
View: | 215 times |
Download: | 1 times |
Xuhua Xia
Smoking and Lung Cancer
This chest radiograph demonstrates a large squamous cell carcinoma of the right upper lobe.
This is a larger squamous cell carcinoma in which a portion of the tumor demonstrates central cavitation, probably because the tumor outgrew its blood supply. Squamous cell carcinomas are one of the more common primary malignancies of lung and are most often seen in smokers.
Xuhua Xia
Smoker Non-smokerLung Cancer 105 3No Lung Cancer 99895 99996
Sub-total 100000 100000
Smoking and Lung Cancer
The number of smokers and non-smokers sampled from the population
Xuhua Xia
Association between being sick and taking medicine:
Taking medicine Not taking medicineSick 990 111Healthy 10 889
Sub-total 1000 1000
Sickness and Medication
Biological and statistical questions
“Taking medicine” is strongly associated with “Sick”. Can we say that “Sick” is caused by “Taking medicine”?
Xuhua Xia
Simpson’s paradox
Treatment A Treatment B
Kidney stones 78% (273/350) 83% (289/350)
Small Stones 93% (81/87) 87% (234/270)
Large Stones 73% (192/263) 69% (55/80)
C. R. Charig et al. 1986. Br Med J (Clin Res Ed) 292: 879–882
Treatment A: all open procedures
Treatment B: percutaneous nephrolithotomy
Question: which treatment is better?
Conclusion changed when a new dimension is added.
Xuhua Xia
What is a Contingency Table?
• A contingency table: a table of counts cross-classified according to categorical variables.
• A contingency table has r rows and c columns, and is referred to as an r x c contingency table.
• The simplest contingency table is a 2 x 2 table.• The most typical null hypothesis: The counts found
in the rows are independent of the counts found in columns.
Xuhua Xia
Contingency Tables and 2-Test
• Chi-Square test is based on 2 distribution.• Chi-Square test is typically used in tests for goodness
of fit, i.e., how well the observed values fit the expected values
• The SAS procedure FREQ can be used to output Chi-Square statistics.
• Chi-square test and Yates correction for continuity.
Xuhua Xia
What is a Contingency Table?
SexFavour Oppose
Male 61 34Female 43 52
Response
SexFavour Oppose
Male n11 n12 n1.
Female n21 n22 n2.
n.1 n.2 n..
Response
Marginal totals(Column totals)
Marginal totals(Row totals)
TotalCell
Xuhua Xia
What is a Contingency Table?
SexFavour Oppose
Male 61 34Female 43 52
Response
The null hypothesis: The response is independent of sex (i.e., the response is the same for both sexes).
Another way of stating the null hypothesis is that the sex ratio is the same for each response category.
The null hypothesis can be tested with the Chi-square test of goodness-of-fit.
Xuhua Xia
X2-test of a Contingency Table?
SexFavour Oppose
Male 61 34 95Female 43 52 95
104 86 190
Response
• Marginal totals• Expected frequencies (the test should be done on counts,
not on proportions).• Degree of freedom• X2 value: 0 if the data is perfectly consistent with the null
hypothesis.• p: the probability of obtaining the observed X2 value given
that the null hypothesis is true, i.e., p(X2|H0).
SexFavour Oppose
Male 61 34Female 43 52
Response
Xuhua Xia
X2-test of a Contingency Table?
SexFavour Oppose
Male 61 34 95Female 43 52 95
104 86 190
Response
52190
10495ˆ.,.,ˆ
..
1..111
..
..
n
nnnge
n
nnn jiij
ij
ijij
n
nn
ˆ
)ˆ( 22
52
52 43
43
• Do hand-calculation of X2.
• What is the df associated with the test?
• df = (r-1)(c-1)
Xuhua Xia
Chi-square Distribution
0
0.1
0.2
0.3
0.4
0.5
0.6
0 5 10 15 20
x
f(x)
= 2
= 4
= 8
2 distribution is a special case of gamma distribution with = /2 and = 2.
1 /
/ 2 1 / 2
/ 2
( ; , )( )
( ; / 2, 2)2 ( / 2)
x
x
x ep x
x ep x
In EXCEL, p = chidist(x,DF) = 1-gammadist(x,DF/2,2,true)0
1 ( ; / 2,2)x
p p x v dx The p value in chi-square test:
Xuhua Xia
Sex | Response---------+--------+--------+ |Favour |Oppose |---------+--------+--------+male | 61 | 34 |---------+--------+--------+female | 43 | 52 |---------+--------+--------+
Categorical Data & Associated Tests
2 by 2 contingency table
Data BigIssue; input gender $ response $ wt @@;cards;Male Favour 61 Female Favour 43Male Oppose 34 Female Oppose 52;proc freq; table gender*response / chisq; weight wt;run;
Request X2-test and measures of association.
Xuhua Xia
SAS Output
GENDER RESPONSEFrequency|Percent |Row Pct |Col Pct |Favour |Oppose | Total---------+--------+--------+Female | 43 | 52 | 95 | 22.63 | 27.37 | 50.00 | 45.26 | 54.74 | | 41.35 | 60.47 |---------+--------+--------+Male | 61 | 34 | 95 | 32.11 | 17.89 | 50.00 | 64.21 | 35.79 | | 58.65 | 39.53 |---------+--------+--------+Total 104 86 190 54.74 45.26 100.00
Xuhua Xia
SAS Output
Statistic DF Value Prob ------------------------------------------------------ Chi-Square 1 6.883 0.009 Likelihood Ratio Chi-Square 1 6.927 0.008 Continuity Adj. Chi-Square 1 6.139 0.013 Mantel-Haenszel Chi-Square 1 6.847 0.009 Fisher's Exact Test (Left) 0.997 (Right) 6.50E-03 (2-Tail) 0.013 Phi Coefficient 0.190 Contingency Coefficient 0.187 Cramer's V 0.190
---------+--------+--------+ |Favour |Oppose |---------+--------+--------+male | 61 | 34 |---------+--------+--------+female | 43 | 52 |---------+--------+--------+
2
1 1 1
2
11 22 11 212
1 2 1 2
2 ln( ) ln( ) ln( ) ln( )
| |2
c cr rN NN N
ij ij i i i ii j i j
c
G f f R R C C n n
nn f f f f
R R C C
Xuhua Xia
Formulas for different statistics
X n m m
where m n n n
G n n m
Q n r
n n n n
n n n nfor tables
X n otherwise
PX
X n
VX
n R C
ij ij ijji
ij i j
ij ij ijji
MH
2 2
2
2
11 22 12 21
1 2 1 2
2
2
2
2
2
1
2 2
1 1
( ) /
/ .
ln( / )
( )
;
/ .
min( , )
Statistic for significance tests
Measures of association: note that Phi can be used only with contingency table, otherwise the value may be greater than 1.
Correlation between the two categorical variables coded in binary
Xuhua Xia
2 and Measures of Association
SexFavour Oppose
Male 2 6 8Female 6 2 8
8 8 16
Response
SexFavour Oppose
Male 1 3 4Female 3 1 4
4 4 8
Response
The same pattern as above, except that the sample size is doubled.
Should the two data set have the same measure of association? Should they yield the same X2 value?
Xuhua Xia
Sex and Hair Color
GENDER COLOR | Black | Blond | Brown | Red | Total---------+--------+--------+--------+--------+Female | 55 | 64 | 65 | 16 | 200---------+--------+--------+--------+--------+Male | 32 | 16 | 43 | 9 | 100---------+--------+--------+--------+--------+Total 87 80 108 25 300
Write a SAS program to test the association between Gender and Hair Color.
Xuhua Xia
SAS OutputStatistic DF Value Prob------------------------------------------------------Chi-Square 3 8.987 0.029Likelihood Ratio Chi-Square 3 9.512 0.023Mantel-Haenszel Chi-Square 1 0.459 0.498Phi Coefficient 0.173Contingency Coefficient 0.171Cramer's V 0.173
Sample Size = 300
The Mantel-Haenszel statistic is appropriate only when the two classification variables are on an ordinal scale (e.g., poor, average, good, excellent).
2
1 1 1
2
11 22 11 212
1 2 1 2
2 ln( ) ln( ) ln( ) ln( )
| |2
c cr rN NN N
ij ij i i i ii j i j
c
G f f R R C C n n
nn f f f f
R R C C
Xuhua Xia
Why There Are More Blondes?
• An evolutionary explanation• A genetic explanation• A simple chemical explanation• The limitation of statistics
Xuhua Xia
Log-linear model
• Preferred statistical tool for analyzing multi-way contingency table
• Use likelihood ratio test to choose the best model• Main effects and interactions can be interpreted in a
similar manner as ANOVA
2
1 1 1
2
1 1 1 1 1 1
2 ln( ) ln( ) ln( ) ln( )
2 ln( ) ln( ) ln( ) ln( ) 2 ln( )
c cr r
c t c cr r
N NN N
ij ij i i i ii j i j
N N N NN N
ijk ijk i i i i k ki j k i j j
G f f R R C C n n
G f f R R C C T T n n
Xuhua Xia
Log-linear modelDisease Present Disease absent
Loc1 Loc2 Loc1 Loc2
Race1 44 12 38 10
Race2 28 22 20 18
data Disease; do Race= 1 to 2; do Disease = 1 to 2; do Loc=1 to 2; input wt @@; output; end; end; end;datalines; 44 12 38 1028 22 20 18; proc catmod; weight wt; model Race*Disease*Loc=_response_ / noparm pred=freq; loglin Race|Disease|Loc @ 2; quit;
1. Do two races distribute similarly in the two locations?
2. Do races differ in their susceptibility to the disease?
3. Is the disease more prevalent in one location than the other?
4. Significant 3-way interactions (e.g., one race is more susceptible to disease in one location but less susceptible to disease in the other location)?
Run and explain
Xuhua Xia
Log-linear modeldata YeastBPS; input S1 $ S2 $ S3 $ S4 $ S5 $ S6 $ S7 $ wt; datalines; U A C U A A C 212A A C U A A C 11A A C U A A U 5C A C U A A C 8G A C U A A C 8U A C U A A U 4U A C U G A C 2U A U U A A C 3U G C U A A C 3C G C U A A C 1; proc catmod; weight wt; model S1*S2*S3*S5*S7=_response_ / noparm pred=freq; loglin S1|S2|S3|S5|S7 @ 3; run;
Xuhua Xia
Goodness of fit tests
• Deviation of sex ratio from 1:1• Deviation from Mendelian 3:1 ratio• Deviation from Mendelian 9:3:3:1 ratio
Xuhua Xia
Spatial Statistics
The spatial distribution of animals and plants has been described as random, contagious and even. We will learn some basic statistical techniques to detect these spatial patterns.
Xuhua Xia
Starfish Bay
Xuhua Xia
Quadrat Sampling
Xuhua Xia
Three Distribution Patterns
Random Even Contagious
Xuhua Xia
Quadrat Sampling
Quadrat N
1 2
2 2
3 3
4 0
5 6
. .
. .
100 1
Mean
Variance
Xuhua Xia
Three Distribution Patterns
2 2 2
Xuhua Xia
Three Probability Distributions
• Poisson distribution (random distribution)2 =
• Binomial distribution (even distribution)2 <
• Negative binomial distribution (contagious distribution)2 >
Xuhua Xia
Random Distribution
!)(
xexP
xC1*C2 P(x) N(x) X20 14 0 0.1395 13.9457 0.00021 27 27 0.2747 27.4730 0.00812 27 54 0.2706 27.0609 0.00013 18 54 0.1777 17.7700 0.00304 9 36 0.0875 8.7517 0.00705 4 20 0.0345 3.4482 0.08836 1 6 0.0155 1.5505 0.19557 0 08 0 09 0 0
Sum 100 197 1 100 0.3023Mean 1.97 P 0.999486
Number of individuals
in a quadratNumber of
quadrats
Var = [14*(0-1.97)2+27*(1-1.97)2+27*(2-1.97)2+18*(3-1.97)2
+9*(4-1.97)2+4*(5-1.97)2+1*(6-1.97)2]/(100-1) = 1.91 < Mean.
Does the distribution deviate significantly from Poisson?
Conclusion: The spatial distribution of the species does not deviate significantly from random distribution.
Xuhua Xia
Contagious Distributionx N(x) N(x)' C1*C2 SS P(x) N(x) X2
0 14 30 0 126.075 0.129 12.873 22.7851 27 20 20 22.050 0.264 26.391 1.5482 27 15 30 0.037 0.271 27.050 5.3683 18 14 42 12.635 0.185 18.484 1.0884 9 9 36 34.223 0.095 9.473 0.0245 4 4 20 34.810 0.039 3.884 0.0036 1 3 18 46.808 0.018 1.844 0.7257 0 2 14 49.0058 0 2 16 70.8059 0 1 9 48.303
Sum 100 205 444.750 1 100 31.541Mean 2.05 P 0.0000Var 4.492
Compare the two columns headed with N(x). The first N(x) is from the previous slide, and fits closely to a Poisson distribution. N(x) is for another species. Is the distribution in this species more contagious or more even?
Conclusion: The spatial distribution of the species is not random. Because var >> mean, the distribution is contagious.
If you are still not sure, then look at the mean and the variance. The variance is more than twice as large as the mean. Does this indicate a contagious or even distribution? Does the distribution really deviate significantly from the Poisson?
Lump the last four categories to increase n
Xuhua Xia
Even Distribution
Compare again the two columns headed with N(x). The first N(x) fits closely to a random distribution. Is the distribution in the second species more contagious or more even?
Conclusion: The spatial distribution of the species is not random. Because var << mean, the distribution is even.
If you are still not sure, then look at the mean and the variance. The variance is smaller than the mean. Does this indicate a contagious or even distribution? Does the distribution really deviate significantly from the Poisson?
x N(x) N(x)' C1*C2 SS P(x) N(x) X20 14 0 0 0 0.016 1.608 1.6081 27 0 0 0 0.066 6.642 6.6422 27 0 0 0 0.137 13.716 13.7163 18 0 0 0 0.189 18.883 18.8834 9 90 360 1.521 0.195 19.496 254.9595 4 7 35 5.298 0.161 16.104 5.1476 1 3 18 10.491 0.111 11.085 5.8977 0 0 0 0 0.065 6.540 6.5408 0 0 0 0 0.034 3.376 3.3769 0 0 0 0 0.025 2.549 2.549
Sum 100 413 17.31 1 100 319.318Mean 4.13 P 0.0000Var 0.175