Statistics in Biology and Medicine
Richard Tseng
Pharmamatrix Workshop 2010
July 14, 2010
The goal of statistics is to analyze, interpret and present data collected to study systems of interest!!
Outline• Descriptive statistics• Inferential statistics
– Probability theory– Hypothesis test– Regression
• Some other tools– Tools for component analysis– Bayesian statistics
• Summary
Descriptive statistics
• Definitions– Set: A well-defined collection of objects and each
object is called an element– Operation of sets: union and intersection
For example, A = {1, 2, 3, 4} and B = {3, 4, 5, 6}
4} {3, BA
6} 5, 4, 3, 2, {1, BA
• Data type– Interval scale
• For example, body weight (g): 1, 1.5, 2, 3 …
– Ordinal scale• For example, scores for patient responses to treatment
– Nominal scale • Categorical data. For example, factors to influence
treatments
Response Much worse
Bit worse About same
Bit better Much better
Score -2 -1 0 1 2
• How large are the numbers?– Mean – Median
[1]
[1]
• How variable are the numbers?– Standard deviation (SD) – Coefficient of variance
(CV = SD/mean)
Inferential Statistics: Probability theory
• Law of large numbers– The mean of elements in a set converges to the
expected value when the number of elements close to infinite
• Law of small numbers– There are not enough small numbers to satisfy all
the demands placed on them
• Central limit theorem– states conditions under which the mean of a
sufficiently large number of independent random variable, each with finite mean and variance, will be approximately normally distributed
http://www.stat.sc.edu/~west/javahtml/CLT.html
• Probability– Meaning
• Frequency interpretation: A number are associated with the rate of occurrence of an event in a well defined random physical systems
• Bayesian interpretation: A number assigned to any statement whatsoever, even when no random process is involved, as a way to represent the degree to which the statement is supported by the available evidence
• Probability– Basic rules
• Subtraction
• Addition
• Multiplication
BAPBPAPBAP
'1 APAP
BAPAPBAP
• Probability– Bayesian rule
APABP
BPBAP
LikelihoodPriorPosterior
• Probability– Maximum entropy principle: The most honest
probability distribution assignment to a system is the one that maximizes the entropy of the system subject to any information available in hand.
Inferential Statistics: Regression
• Goal: To correlate the study outcomes of systems of interest and possible factors.
• Model: – Linear model
– Logistic model
bxaR
1exp
exp
bxa
bxaR
• OptimizationSuppose there are n outcomes di of a study
– Least-square method
– Maximum Likelihood estimate: Supoose a likelihood function is given by L(a,b|d)
i
iibbaa
dR 2
ˆ,ˆmin
dbaLbbaa
,maxˆ,ˆ
• Regression tests– Residual analysis
– Standard errors of regression coefficients
– Coefficient of determination
ii dR residual
n
ii
n
iii
dRnn
dRE
1
2
1
2
21S
2
2
)(
)(ˆ
dSD
RSDbR
• Example 1: Linear regression
• Example 2: MLE solution of Emax and EC50 in Michales-Menten equation
Likelihood function
MLE solution
Inferential Statistics: Hypothesis test
• Goal: Test of significance• Rationale
– Null hypothesis: H0, outcomes of a study purely result from chance
– Alternative hypothesis: H1, outcomes of a study are influenced from non-random sources
– Appropriate model: Normal distribution, t-distribution…
• Rationale– Appropriate analysis method
• P-value: The probability of observing a sample statistic as extreme as the test statistic, assuming the null hypothesis is true.
• Parametric method: t-test, F-test, Chi-square test• Non-parametric method: Kolmogorov-Smirnov test,
Mann-Whitney test
P-value for significant test: 1. What is the probability of a test value from a random
population? One or two tailed?
2. If p-value is less than the confidence level the null hypothesis is rejected
t-distribution
http://socr.ucla.edu/htmls/dist/StudentT_Distribution.html
Test method Test statistic Null hypothesis
one sample t-test the means of two normally distributed populations are equal
two sample F-test the means of normally distributed populations, all having the same standard deviation, are equal
Pearson Chi-sqaure test whether theoretical population R and real population d are different
nSD
dRt
/
2
1
SD
SDF
n
i i
ii
d
dR
1
22
•Parametric test
Two sample t-test: (Online calculator http://www.usablestats.com/calcs/2samplet)
N Mean StDev SE Mean
Sample 1 15 0.633 0.2162 0.056
Sample 2 15 0.931 0.2021 0.052
Observed difference (Sample 1 - Sample 2): -0.298Standard Deviation of Difference : 0.0764
Unequal VariancesDF : 2795% Confidence Interval for the Difference ( -0.4548 , -0.1412 )T-Value -3.9005 Population 1 ≠ Population 2: P-Value = 0.0006 Population 1 > Population 2: P-Value = 0.9997 Population 1 < Population 2: P-Value = 0.0003
Equal VariancesPooled Standard Deviation: 0.2093 Pooled DF: 2895% Confidence Interval for the Difference ( -0.4545 , -0.1415 ) T-Value -3.8992 Population 1 ≠ Population 2: P-Value = 0.0006 Population 1 > Population 2: P-Value = 0.9997 Population 1 < Population 2: P-Value = 0.0003
Some Statistics Worth to Know
• Tool for component analysis: – Principle Component Analysis (PCA): A way to
identify patterns in data, and express in a way to highlight their similarities and differences
– Independent Component Analysis (ICA): A way to separate independent components in data
– Variable and model selection: Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC)
• Bayesian statistics
Summary
• What is “right” null hypothesis?• What is the appropriate distribution function?• What is the appropriate test statistics?
“Know” your data before analyze that!!
Information theory based statistics: Bayesian statstics
• Goal: Using Bayesian method to design and analyze data
• Bayesian inference– Appropriate distribution functions– Appropriate sampling techniques
• Maximum entropy method based inference– Appropriate form of entropy– Appropriate constriants
Information theory based statistics: Method of maximum
entropy
Reference
[1] P. Rowe, Essential Statistics for Pharmaceutical Sciences, Wiley 2007.