Post on 20-Jan-2016
description
transcript
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 1/29
Multivariate analysis: Introduction
Third training ModuleEpiSouth
Madrid, 15th to 19th June, 2009
Dr D. Hannoun
National Institute of Public HealthAlgeria
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 2/29
Introduction: GeneralityGenerality
Stratification allows us:
• Control confounding
• Reveal effect modification
Limits of stratification:
• Only a few number of confounders could be controlled simultaneously
• The joint effect of confounders cannot be analysed correctly +++
• Choice of classes with quantitative variables
Other tools: MULTIVARIATE ANALYSIS
Assess the reality of the effect of exposure on the disease
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 3/29
Introduction: Joint effectJoint effect
Example: Hepatitis B SEP
Potential confounders: Age (children/adults), immunity(good/deficient)
Joint effect: the effect of two/more factors combined together Marginal effect: the effect of one confounder alone without taking in
consideration the other potential confounders
Control on Strate 1F+
Strate 2F-
Strate 3 Strate 4 Crude effect
Adjusted Measure
2.0
F1+/F2+ F1-/F2- F1+/F2- F1-/F2+
Age (F1) 2,0 2,0 2,0
Immunity (F2) 2,0 2,0 2,0
Factors 1+2 1,0 1,0 1 1 1,0
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 4/29
Multivariate analysis: DefinitionDefinition
Definition:
• Simultaneously, adjust for several variables
• Simultaneously, control for several potential confounders
Several models:
• Multiple linear regression
• Logistic regression
• Cox regression ….
Vocabulary
• Disease Y= dependant variable
• Risk factors= independant variables or predictors
Procedures, at the analysis phase, that
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 5/29
Multivariate analysis: DefinitionDefinition
How:
• Representation of the disease Y as a function of other variables
•Risk factors
•Potential confounders
By modelling the relationship studied
Set of variables
Stati
stica
l pr
oced
ures
: M
ultiv
aria
te
anal
ysis
:The best Subset of variables describes
the relationship between RF and
disease
Measure of the relationship: parameters
To describe the disease via an
equation
The best model fitting the data
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 6/29
Multivariate analysis: DefinitionDefinition
Writing Model:
• E(Y/E, X1, X2…, Xp) = f(E, X1, X2…, Xp)• Y: a given Disease• E: Exposure
• X1,X2…: other variables
Example:
• F= linear function
E(Y/E, X1, X2…, Xp) = α + βE + β1X1 + β2X2 + … + βpXp
• β, β1, β2… measure the relation between the exposure E, the others risk factors X1, X2… and the disease Y controlled on the other variables
• If β =0 No relationship between exposure and the disease
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 7/29
Multivariate analysis: DefinitionDefinition
The adjusted measures of association we obtain from multivariable analysis are:
For each variable in the model, we obtain the effect measure of the relationship between this variable and the disease controlled on the other variables
Direct effects and not total effects
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 8/29
Multivariate analysis: AdvantagesAdvantages
Advantages/techniques:
• Estimation of effects and controlling for more than one confounder simultaneously
• Study of the joint effect of several risk factors and quantify the intensity of interaction
• Possibility to have continuous risk factor
• Study the dose-response relationship: interest for causality and the specific risk at intermediary levels
• Study the trend effect according to the level of the risk factor
• Prediction of the disease
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 9/29
Multivariate analysis: StepStep
Several steps:
• Choosing the appropriate model to summarize data
• Define the strategy variable selection
• Estimate the model coefficients
• Method of least squares (LS) estimation
• Method of maximum likelihood (ML) estimation
• Writing and interpreting the model
• Study the adequation of the model
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 10/29
Multivariate analysis: Choice of the modelChoice of the model
Depends on the form of the function f:
1. Nature of the outcome variable
• Continuous outcome Multiple linear Regression
• Categorical outcome Logistic regression (LR)
• Outcome time to an event Cox regression
2. Nature of joint effect
• Additif Multiple linear regression
• Multiplicatif Logistic regression
Cox regression
3. Form of the variable-distribution
• Normally distributed…
4. Assumption
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 11/29
Multivariate analysis: variables selectionvariables selection
The final model depends on the variables will be selected:
• At the study design:
• Decide which variables to adjust or to control for
• How the variable will be coded
• Which interaction should be considered
• At the analytical phase:
• Which variables must be entered in the model
• Variables must be forced
• P value
• E.g.: 7 variables coded 0/1 with all interaction terms 27 = 128 coefficients to estimate in the final model!
Neccesity of STRATEGY
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 12/29
Multivariate analysis: Parameters estimationParameters estimation
Purpose of multivariate analysis:
• To obtain some measure of the effect that describes the exposure-outcome relationship adjusted for relevant extraneous factors
Parameters estimation depends on the model used:
• In MLR regression coefficients β
• In LR odds ratio
• In Cox hazard ratio
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 13/29
Multivariate analysis: ModelModel adequationadequation
Verify the adequation of the model:
• Capacity of the model to represent correctly the value of the disease given the value of subset of risk factors
Steps:• Adequation of the model:
• Graphical methods +++• Statistical tests
• Interpreting the test: be careful to the outlier
• The best model is necessary not the best statistical model: choose the model with the best understanding of the disease
The fitting model could be used for prediction
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 14/29
MLR: IntroductionIntroduction
= multivariate model used in case of continuous data
Principle:
• Describe one variable as a linear function of one or more other variables
• Form: E(Y)=f(E,X1,X2…) F= linear function
• E(Y/X) = α + βX Simple linear regression model
• E(Y/X1, , Xp)= α + β 1X1 + … + βpXp Multi. linear regression model
E(Y) = α + βX
••
•• ••
••• •
• • ••
•Dis
ease
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 15/29
MLR: IntroductionIntroduction
Inci
denc
e ra
te o
f ARI
Atmopsheric pollution: density of PM10
• •••
•
•• ••
• ••Y = α + βX + ε
β = slope of the straight line• Estimate the change in Y for one unit of X• E.g. when pollution atmospheric increases 1%, the incidence rate of ARI
increases by 2 cas/100.000 person
α = intercept which correspond to the value of disease when the exposure equal 0, or more generally describes the baseline
ε = error term in the model
Statistical model
In simple linear regression:
Y = α + βX ^̂^̂^̂
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 16/29
MLR: IntroductionIntroduction
In Multiple linear regression:
• Statistical model: Y = α + β1X1 + β2X2 + … + βpXp + ε
• E.g.:
•Variation of incidence rate of ARI with atmospheric pollution
•Potential confounders: age and smoking
•X1 = density of PM10
•X2 = age of person
•X3 = smoking
ARI Inc. Rate = α + β1density of PM10 + β2 Age + β3 smoking + ε
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 17/29
MLR: IntroductionIntroduction
In Multiple linear regression:
• β1 = slope along the X1 dimension: variation of ARI with the change of 1 unit of PM10 density controlled on the other variables
• β2 = slope along the X2 dimension: variation of ARI with the change of one unit of AGE controlled on the other variables
• β3 = slope along the X3 dimension: variation of ARI with the change of one unit of smoking (person/year) controlled on the other variables
• α = intercept, value of the disease when there is no risk factor…
• ε = error term in the model
ARI Inc. Rate = α + β1density of PM10 + β2 Age + β3 smoking + ε
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 18/29
MLR: Parameters estimation Parameters estimation
Method used: least squares estimation
Principle:
• Identify the best straight line that minimizes the sum of squared residuals
• ••
• • •Yi
Ŷi (Xi,Ŷi,)
(Xi,Yi,)
Xi
Least squared line fit
SSR = Σ(Yi - Ŷi)2 = Σ(Yi - α – βX)2
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 19/29
MLR: Variables selection Variables selection
Decide which variables to control for:
1. Prediction of the risk of the disease
• We haven’t to take in consideration all confounders but the best group of predictors
• Importance in term of public Health +++
• E.g.: incidence rate of ARI – Exposure: atmospheric pollution – Predictors: age and smoking
2. Estimation of the relation between exposure and disease
• We have to take in consideration ALL confounders to control confounding
• Importance in term of causal association
• E.g.: incidence rate of ARI – Exposure: atmospheric pollution – Predictors: age, smoking, breastfeeding, ROR…
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 20/29
MLR: Variables selection Variables selection
Which variables must be entered in the initial model:2 situations
• Some are obligatory in the model because there are recognized as risk factor: exposure
• Other variables significant relationship between the variable and the disease in the bivariate analysis
All candidate variables to modelling
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 21/29
MLR: Variables selection Variables selection
Which interaction should be considered:
• Problem of interaction must be approached in a manner wich facilitates understanding of the nature of the causal effect
• Statistical consideration should serve rather than determine our objectives
• Adjonction of an interaction term
• Addition of an other regression coefficient in the equation
• More difficulties to interpret the model
• For a given interaction, you must ensure that the variables which are in the term interaction are contained in the model
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 22/29
MLR: Variables selection Variables selection
Example: Incidence rate of ARI
1. Model WITH an interaction term:
• Interaction BETWEEN smoking and age: β2,3X2X3
ARI Inc. Rate = α + β1density of PM10 + β2 Age + β3 smoking + β2,3 Age smoking + β4 breastfeeding + β5 ROR + ε
ARI Inc. Rate = α + β1density of PM10 + β2 Age + β2,3 Age smoking + ε
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 23/29
MLR: Variables selection Variables selection
Which variables must be entered in the initial model:2 situations
• …
• How the variables must be entered in the initial model: Strategy must be defined
• Start with ALL variables Backward elimination
• Start with NO variable Forward selection
• Mixed the two previous methods Stepwise selection
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 24/29
MLR: Variables selection Variables selection
sexe age
Pollution
ROR
smoking
breastfeeding region
Profession
Age*smoking
At The stud design
Bivariate analysis and
stratification
First part of analytical phase
Significant variables
•Pollution
•Age
•Smoking
•Breastfeeding
•ROR
V. must be forced
•Pollution
Candidate variables to modeling
The largest possible model
Define how the V. could be entered in the model
Backward
Forward
Stepwise
Multivariate analysis
Rules
Second part of analytical phase
Final model: PollutionAgeSmoking
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 25/29
MLR: Backwards strategyBackwards strategy
Principle :
• Begins with ALL candidate variables in the model largest POSSIBLE model
• At each step, Drop one variable, the choice of this variable is based on statistical rules remains variable which is not significant
• Continue until no more variables can be dropped, meaning all remaining variables are relevant
Advantages: Evaluate the joint confounding effects of all variables
Limits: With many risk factors, strata could provide no information
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 26/29
MLR: Forward strategyForward strategy
Principle :
• Begins with NO variable in the model smallest POSSIBLE model
• At each step, Keep one variable in the model, the choice of this variable is based on statistical rules • Start with the variable that has the biggest change-in-estimate impact
when evaluated individually• Keep the var. which changes meangfully the adjusted estimate
• Continue until no other variables can be added
Advantages: Avoids the initial sparse cell problem of backwards approach
Limits: Does not evaluate joint confounding effects of many variables
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 27/29
MLR: ConclusionConclusion
Goal of modeling: To obtain • The smallest subset of relevant risk factors to describes the disease
• With the best understanding of the disease
Like for stratification, you must identify:
• First, significant interaction term: don’t forget to verifiy that the v. which are in the term interaction are contained in the model
statistical significance + biological consideration
• Secondly, test the confounding effect No statistical test
• Retain significant risk factors, confounder risk factors and interaction term that help us to understand and to explain the occurrence of disease
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 28/29
Conclusion
Multivariate analysis allows to control and adjust the effect of exposure with several extraneaous factors simultaneously
The adjusted measures of association are direct effects and not total effects
Multivariate analysis is a useful tool but it could be very dangerous if we haven’t preliminary defined the strategy
• Purpose of the study• Method of variable selection• Assumption• Adequation of the model…
Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 29/29
Conclusion
As with stratification method, statistical considerations should serve rather than determine our objectives
Multivariate analysis requires computer to run the statistical programme
The choice of the model depends upon of a lot of factors: outcome variable, form of the relationship between exposure and disease…