Date post: | 28-Mar-2015 |
Category: |
Documents |
Upload: | xavier-mcdowell |
View: | 213 times |
Download: | 0 times |
Introduction to Multilevel Models: Getting started with your own data
University of BristolMonday 31ST March– Friday 4th April 2008
Course handouts
Resources
Centre for Multilevel Modelling http://www.mlwin.com/
Provides access to general information about multilevel modelling and MlwiN.
Includes Multilevel newsletter (free electronic publication)
http://www.mlwin.com/publref/newsletters.html
Email discussion group: www.jiscmail.ac.uk/multilevel/
Lemma will include training repository http://www.ncrm.ac.uk/nodes/lemma/about.php
1.0 Introductions
Participants introduce themselves : Who you are? Whare are you from?
2.00 Multilevel Data StructuresMultilevel modelling is designed to explore and analyse data that come from populations which have a complex structure.
In any complex structure we can identify atomic units. These are the units at the lowest level of the system. The response or y variable is measured on the atomic units.
Often, but not always, these atomic units are individuals.
Individuals are then grouped into higher level units, for example, schools. By convention we then say that students are at level 1 and schools are at level 2 in our structure.
2.01 Levels, classifications and units
A level(eg pupils, schools, households, areas) is made up of a number of individuals units(eg particular pupils, schools etc).
The term classification and level can be used somewhat interchangeably but the term level implies a nested hierarchical relationship of units (in which lower units nest in one, and one only, higher-level unit) whereas classification does not.
2.02 Two-level hierarchical structures
Students St1 St2 St3 St1 St2 St1 St2 St3 St1 St2 St3 St4
School Sc1 Sc2 Sc3 Sc4
Students within schools
Unit diagram one node per unit
School
Student
Classification diagram one node per classification
Students within a school are more alike than a random sample of students. This is the ‘clustering’ effect of schools.
2.03 Data frame for student within school example
Classifications or levels
Response
Explanatory variables
Student i
School j
Student Exam scoreij
Student previousExamination scoreij
Student genderij
Schooltypej
1 1 75 56 M State
2 1 71 45 M State
3 1 91 72 F State
1 2 68 49 F Private
2 2 37 36 M Private
3 2 67 56 M Private
1 3 82 76 F State
6* Are schools more variable in their progress for students with low prior attainment? 7 Do students make more progress in private than public schools?8* Are students in public schools less variable in their progress?
1 Do Males make greater progress than Females?2 *Does the gender gap vary across schools? 3* Are Males more or less variable in their progress than Females?4 *What is the between-school variation in student’s progress?
5 *Is School X (that is a specific school) different from other schools in the sample in its effect?
* Requires multilevel model to answer
2.04 Variables, levels, fixed and random classifications
School
Student
Given that school type(state or private) classifies schools, we could redraw our classification diagram
asSchoo
l
Student
School
type
Do we now have a 3-level multilevel model?
We can divide classifications into two types : fixed classifications and random classifications. The distinction has important implications for how we handle the classifying variable in a statistical analysis.
For a classification to be a level in a multilevel model it must be a random classification. It turns out that school type is not a random classification.
2.05 Random and Fixed Classifications
A classification is a random classification if its units can be regarded as a random sample from a wider population of units. For example the students and schools in our example are a random sample from a wider population of students and schools. However, school type or indeed, student gender has a small fixed number of categories. There is no wider population of school types or genders to sample from.
Traditional or single level statistical models have only one random classification which classifies the units on which measurements are made, typically people. Multilevel models have more than one random classification.
2.06 Other examples of two-level hierarchical structures
Repeated measures, panel data
Mutivariate response models
2.07 Repeated Measures data
Classifications or levels
Response Explanatory variables
Student i
School j
Student Exam scoreij
Student previousExamination scoreij
Student genderij
Schooltypej
1 1 75 56 M State
2 1 71 45 M State
3 1 91 72 F State
1 2 68 49 F Private
2 2 37 36 M Private
3 2 67 56 M Private
1 3 82 76 F State
In the previous example we have measures on an individual at two occasions a current and a prior test score. We can analyse change (that is progress) by specifying current attainment as the response and prior attainment as a predictor variable.
However, when there are measurements on more than two occasions there are advantages as treating occasion as a level nested within individuals. Such a two level strict hierarchical structure is known as a repeated measurement or panel design
2.08 Classification, unit diagrams and data framesfor repeated measures structures. P1 P2 P3 .....
O1 O2 O3 O4 O1 O2 O1 O2 O3
Person
Measurement Occasion
Classifications or levels
Response Explanatory variables
Occasion I
Person J
Heightij Ageij Genderj
1 1 75 5 F
2 1 85 6 F
3 1 95 7 F
1 2 82 7 M
2 2 91 8 M
1 3 88 5 F
2 3 93 6 F
3 3 96 7 F
Person
H-Occ1
H-Occ2
H-Occ3
Age-Occ1
Age-Occ2
Age-Occ3
Gender
1 75 85 95 5 6 7 F
2 82 91 * 7 8 * M
3 88 93 96 5 6 7 F
Wide form 1 row per individual
Long form 1 row per occasion(required by MLwiN)
2.09 Repeated Measures Cntd
Atomic units are occasions not individuals.
Modelling between individual variation in growth, growth curves.
In a multilevel repeated measures model data need not be balanced or equally spaced.
Explanatory variables can be time invariant (gender) or time varying (age)
2.10 Multivariate responses within individuals
Sometimes we may wish to model not a single response (y-variable) we may have many. For example, we may wish to consider jointly English and Mathematics exam scores for students as two possibly related responses. We can regard this as a multilevel model with subjects (English and Maths) nested within students
St1 St2 St3 St4…
E M E E M MSubject
Student
A multilevel multivariate response model can estimate the covariance (or correlation) matrix between responses and efficiently handle missing data.
2.11 Data frames for multivariate response models
Student
English Score
Maths Score
Gender
1 95 75 M
2 55 * F
3 65 40 F
4 * 75 M
Classifications or levels
Response Explanatory variables
Exam Subject I
Student J
Exam Scoreij
Eng-Indicij
Math-Indicij
Gender-Engj
Gender-Mathj
Eng 1 1 95 1 0 M 0
Math 2 1 75 0 1 0 M
Eng 1 2 55 1 0 F 0
Eng 1 3 65 1 0 F 0
Math 2 3 40 0 1 0 M
Math 2 4 75 0 1 0 M
Wide form 1 row per individual
Long form 1 row per measurement(required by MLwiN)
2.12 Three level structures
Student
School
Class
Student St1 St2 St3 St1 St2 St1 St2 St3 St1 St2 St3 St4
School Sc1 Sc2 Sc3
Class C1 C2 C1 C2
Students:classes:schools
MLM allow a different number of students in each class and a different number of classes in each school. Bennett(1976) used a single level model to asses whether teaching styles affected test scores for reading and mathematics at age 11. The results prompted a call for return to traditional or formal teaching methods. This analysis did not take account of the dependency structures in the data: students in a class more similar than a random sample of students, likewise classes in a school. Subsequent ML analysis found the effects of traditional methods non-significant.
2.13 Data Frame for 3 level model, students: classes: schools
Classifications or levels
Response Explanatory variables
Student I
Class j
School k
Current Exam scoreijk
Student previousExamination scoreijk
Student genderijk
Class teaching stylejk
School typek
1 1 1 75 56 M Formal State
2 1 1 71 45 M Formal State
3 1 1 91 72 F Formal State
1 2 1 68 49 F Informal State
2 2 1 37 36 M Informal State
1 1 2 67 56 M Formal Private
2 1 2 82 76 F Formal Private
3 1 2 85 50 F Formal Private
1 1 3 54 39 M Informal State
2.14 Other three level structures
Multivariate responses on four health behaviours (drinking, smoking exercise & diet) on individuals within communities, such a design will allow the assessment of the how correlated are the behaviors at the individual level and the community level and to do so taking account of other characteristics at both the individual and community level. We can also can assess the extent to which there are unhealthy communities as well as unhealthy individuals
Repeated measures within students within schools. This allows us to look how learning trajectories vary across students and schools.
A repeated cross-sectional design with students:cohorts:schools
2.15 Repeated cross-sectional design
Sc1 Sc2 Sc3....
1990 1991 1990 1991 1990 1991
St1 St2.... St1 St2..... St1 St2... St1 St2... St1 St2..... St1 St2...
School
Student
Cohort
Above are unit and classification diagrams where we have Exam scores for groups of students who entered school in 1990 and a further group who entered in 1991. The model can be extended to handle an arbitrary number of cohorts. In a multilevel sense we do not have 2 cohort units but 2S cohort units where S is the number of schools.
2.16 Four level hierarchical structures
•student within class within school within LEA•multivariate responses within repeated measures within students within schools•repeated measures within patients within doctor within hospital•people within households within postcode sectors within regions
By now you should be getting a feel about how basic random classifications such as people, time, multivariate responses, institutions, families and areas can be combined within a multilevel framework to model a wide variety of nested population structures. Here areas some examples of 4-level nested structures.
As a final example of a strict hierarchy we will consider a doubly nested repeated measures structure.
2.17 repeated measures within students within cohorts within schools
St1 St2... St1 St2.. St1 St2.. St1 St2..
1990 1991 1990 1991
Sc1 Sc2...
Cohort
Msmnt occ
student
School
O1 O2 O1 O2 O1 O2 O1 O2 O1 O2 O1 O2 O1 O2 O1 O2
Cohorts are now repeated measures on schools and tell us about stability of school effects over time
Measurement occasions are repeated measures on students and can tell us about students’ learning trajectories.
2.18 Non-hierarchical structures
So far all our examples have been exact nesting with lower level units nested in one and only one higher-level unit.
That is we have been dealing with strict hierarchies. But social reality can be more complicated than that.
In fact we have found that we need two non-hierarchical structures which in combination with strict hierarchies have been able to deal with all the different types of designs, realities and research questions that we have met
•Cross-classified structures
•Multiple membership structures
School S1 S2 S3 S4
Pupils P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12
Area A1 A2 A3
School S1 S2 S3 S4
Pupils P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12
Area A1 A2 A3
School S1 S2 S3 S4
Pupils P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12
Area A1 A2 A3
In this structure schools are not nested within areas. For example
Pupils 2 and 3 attend school 1 but come from different areasPupils 6 and 10 come from the same area but attend different schools
2.19 Cross-classified Model
Schools are not nested within areas and areas are not nested within schools. School and area are are cross-classified
areaschool
pupil
2.20 Tabulation of students by school and area to reveal across-classified structure
area 1 area 2 area 3
School 1 P1,P3 P2
School 2 P5 P4
School 3 P6,P7 P8
School 4 P10 P9,P11,P12
Pupils P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12
Area A1 A 2 A3
School S1 S2 S3 S4
area 1 area 2 area 3
School 1 P1,P2,P3
School 2 P4,P5
School 3 P6,P7,P8
School 4 P9,P10,P11,P12
All elements in a row lie in a single column
Elements in a row span multiple columns,
Elements in a column span multiple rows
A1 A2 A3
S1 S2 S3 S4
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12
2.21 Data frame for pupils in a cross-classification of schools and areas
Classifications or levels
Response
Explanatory variables
Student i
School j
Area k
Exam scorei(jk)
Student gender
i(jk)
Area IMDk
School type j
1 1 1 75 M 24 State
2 1 2 71 F 46 State
3 1 1 91 F 24 State
4 2 2 68 M 46 Private
5 2 1 37 M 24 Private
6 3 2 67 F 46 Private
7 3 2 82 F 46 State
8 3 3 85 M 11 State
9 4 3 54 M 11 Private
10 4 2 91 M 46 Private
11 4 3 43 F 11 Private
12 4 3 66 M 11 Private
2.22 Other examples of cross-classified structures
Students within a cross-classification of primary school by secondary school. We may have students’ exam scores at age 16 and wish to assess the relative effects of primary and secondary schools on attainment at age 16
Exam marks within a cross classification of student and examiner, where a student’s paper is marked by more than one examiner to get an indication of examiner reliability.
examiner 1
examiner 2
examiner 3
student 1 m1 m2
student 2 m3 m4
Student 3 m5 m6
Student 4 m7 m8
Patients within a cross-classification of GP practice and hospital.
Note in this case we have at most 1 level one unit(mark) per cell in the cross-classification.
2.24 Multiple membership models
Health outcomes where patients are treated by a number of nurses, patients are multiple members of nurses
Students move schools, so some pupils are multiple members of schools.
Where atomic units are seen as nested within more than one unit from a higher level classification :.
School S1 S2 S3 S4
Pupils P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12P1 P8
Teacher
Pupil
School S1 S2 S3 S4
Pupils P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12
Area A1 A2 A3
2.23 Combining structures: crossed-classifications and multiple membership relationships
P1
Lets take the cross-classified model of the previous slide but
suppose Pupil 1 moves in the course of the study from residential area 1 to 2 and from school 1 to 2
Now in addition to schools being crossed with residential areas pupils are multiple members of both areas and schools.
Pupil 8 has moved schools but sill lives in the same area
P8
Student 7 has moved areas but still attends the same school
P7
2.24 Classification diagram for multiple membership model
Student
School Area
area 1 area 2 area 3
School 1 P1,P3 P1,P2
School 2 P1,P5 P1,P4
School 3 P6,P7 P7,P8
School 4 P10 P8,P9,P11,P12
School S1 S2 S3 S4
Pupils P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12
Area A1 A2 A3
P1 P8P7
Students nested within a cross-classification of school by area
Students multiple members of schools
Students multiple members of areas
N1 N2 N3 N4
P1 P2 P3 P4 P5 P6
GP1 GP2 GP3
H1 H2
2.25 Combining structures : crossed, nested and multiple membership relationships
Patients can be treated by more than one nurse during their stays in hospital, patients are multiple members of nurses
Nurses work in only one hospital therefore nurses are nested within hospitals
Patients nested within referring GPs. GP’s crossed with nurses. GP’s crossed with Hospitals.
Patient
Nurse
Hospital
GP practice
Pupils P1 P2 P3 P6 P7 P8 P4 P5 P9 P10 P11 P12
School S1 S3 S2 S4
School type state private
2.26 Distinguishing Variables and Levels
Classifications or levels Response Explanatory Variables
Pupil I
School j
School Type k
Pupil Exam Scoreijk
Previous Exam scoreijk
Pupil genderijk
1 1 State 75 56 M
2 1 State 71 45 M
3 1 State 91 72 F
1 2 Private 68 49 F
2 2 Private 37 36 M
Etc
School type is not a random classification it is a fixed classification, and therefore a variable not as a level.
Random classification if units can be regarded as a random sample from a wider population of units. Eg pupils and schools
Fixed classsification is a small fixed number of categories. Eg State and Private are not two types sampled from a large number of types, on the basis of these two we cannot generalise to a wider population of types of schools,
Similarly gender…..
NO!
3.0 Work with partner discussing what type of Multilevel data Structure corresponds to participant’s data(20 mins)
Draw free-hand a classification diagram giving labels for units at each level and linking the nodes by appropriate arrows to reflect nested, crossed or MM relationships
Complete a schematic data frame for your data set.
Either use overheads provided or whatever software you find convenient.
4.0 Discussion of Exercise 3.0
Each participant takes 2 minutes to present the multilevel structure for their research problem
5: Modelling varying relations: from graphs to equations
“There are NO general laws in social science that are constant over time and independent of the context
in which they are embedded”
Rein (quoted in King, 1976)
5. 1 Varying relations plot
• Simple set up
- Two level model
- houses at level 1 nested within districts at level 2
• Single continuous response: price of a house• Single continuous predictor: size = number of rooms and
this variable has been centred around average size of 5
3210-1-2-3-4
87654321Rooms
1x
5. 3 General Structure for Statistical models
• Response = general trend + fluctuations• Response = systematic component + stochastic element• Response = fixed + random
• Specific case: the single level simple regression model
Response Systematic Part Random Part
Price of Cost houseHouse = average- + of + residual Price sized extra variation
house room
Intercept Slope Residual
5 4 Simple regression model
y is the outcome, price of a house
is the predictor, number of rooms,which we shall deviate around its mean, 5
1x
3210-1-2-3-4
87654321Rooms
1x
0
1
5.5 Simple regression model (cont)
iy)(110 iii exy
is the price of house i
1x is the individual predictor variable
0
between house variance; conditional on size
1 is the fixed slope term:
ie is the residual/random term, one for every house
Summarizing the random term: ASSUME IIDMean of the random term is zeroConstant variability (Homoscedasticy)No patterning of the residuals (i.e, they are independent)
2e
is the intercept;
),0(~ 2
eNe
5.6 Random intercepts model
ijijjij exy 110 Micro-model
Macro-model: index parameter as a response
jj u000 Price of average = citywide + differential for district j price district j
Premium
Discount
Citywide line
ju0
ju0Differential shift for each district j : index the intercept
xY110
^
Substitute macro into micro…….
5.7 Random intercepts COMBINED model
ijijjij exuy 1100 )( Substituting the macro model into the micro model yields
Grouping the random parameters in brackets
)(0110 ijjijij
euxy
• Fixed part 10 ),0(~ 2
00 ujNu
),0(~ 2
00 eijNe
0],[ 00 ijj euCov
• Random part (Level 2)
• Random part (Level 1)
• District and house differentials are independent
5.8 The meaning of the random terms
),0(~ 2
00 ujNu
),0(~ 2
00 eijNe
• Level 1 : within districts between houses
• Between district variance conditional on size
• Level 2 : between districts
20u
20e • Within district, between-house
variation variance conditional on size
5.9 Variants on the same model
• Combined model in full
• Combined model
• Is the constant ; a set of 1’s
)( 00001100 ijijijjijijij xexuxxy
)( 0110 ijjijij euxy
• In MLwiN
ijx0Differentials at each
level
5.10 Random intercepts and random slopes
5. 11 Random intercepts and slopes model
Micro-model
Macro-model (Random Intercepts)
jj u000
Note: Index the intercept and the slope associated with a constant,and number of rooms, respectively
Macro-model (Random Slopes)
ijijijjijjij xexxy 001100
jj u111 Slope for district j = citywide slope + differential slope for district j
Substitute macro models into micro model…………
5.12 Random slopes model
ijijijjijjij xexuxuy 00111000 )()(
Substituting the macro model into the micro model yields
Multiplying the parameters with the associated variable and grouping them into fixed and random parameters yields the combined model:
)( 0011001100 ijijijjijjijijij xexuxuxxy
5.13 Characteristics of random intercepts & slopes model
Fixed part 10
Random part (Level 2)
Random part (Level 1) ),0(~][ 2
00 eijNe
)( 0011001100 ijijijjijjijijij xexuxuxxy
),0(~ 2
110
2
0
1
0
uuu
u
j
jN
u
u
5. 14 Interpreting varying relationship plot through mean and variance-covariances
0++++E
-++++D
+++++C
undefined0+++B
undefined0+0+A
CovarianceVarianceMeanVarianceMeanGraph
Intercept/Slope terms associated with
Slopes terms associated with
Predictor
Intercepts: terms associated with Constant
0 20u 1
21u 10uu
10xx0x 1x
),0(~
:),0(~
2
2
101
2
0
1
0
111
000
10
eij
uu
u
uuj
j
jj
jj
ijijjjij
Ne
Nu
u
u
u
exy
),0(~
),0(~
2
2
00
000
10
eij
uj
jj
ijijjij
Ne
Nu
u
exy
0 1
2~ (0, )
i i
i e
y x
e N
pre-test
attain
pre-test
pre-test
pre-test
pre-test
attain
attain
attain
attain
5.16 Random intercepts and slopes model in MLwiN
6.1 Fitting models in MLwiN
• Work through (at your own pace) Chapter 4 of the manual; Random slopes and intercepts models
• Don’t be afraid to ask!
Summary of Sessions 5+6
S1: Type of questions tackled by multilevel
modelling I • 2-level model: current attainment given prior attainment of pupils(1) in
schools(2)• NB assuming a random sample of pupils from a random samples of
schools
• Do Boys make greater progress than Girls (F)
• Are boys more or less variable in their progress than girls?(R)
• What is the between-school variation in progress? (R)
• Is School X different from other schools in the sample in its effect? (F)
• continued…….
S2: Type of questions tackled by multilevel
modelling II • Are schools more variable in their progress for pupils with low
prior attainment? (R)
• Does the gender gap vary across schools? (R)
• Do pupils make more progress in denominational schools?(F)
• Are pupils in denominational schools less variable in their
progress? (R)
• Do girls make greater progress in denominational schools? (F)
(cross-level interaction)
S3 Problems with not doing a multilevel analysis
•Substantive: the between school variability and what factors reduce it are generally of fundamental interest to us. A single level model gives us no estimate of between school variability.
•Technical: If the higher level clustering is not properly accounted for in the model then inferences we make about other predictors will be incorrect. We will tend to infer a relationship where none exists.
S4 : Fixed and Random classifications
Random classification
Generalization of a level (e.g., schools)
Random effects come from a distribution
All schools contribute to between-school variance
Fixed classification
Discrete categories of a variable (eg Gender)
Not sample from a population
Specific categories only contribute to their respective means
S5 When levels become variables...Schools can be treated as a variable and placed in the fixed part; achieved by a set of dummy variables one for each school; target of inference is each specific school; each one treated as an ‘island unto itself’
No shrinkage but no ‘help; from rest of the data; hence unreliable estimates when no of pupils in school is small
Schools in the random part, treated as a level, with generalization possible to ALL schools (or ‘population’ of schools), in addition can predict specific school effects given that they come from an overall distribution
Shrinkage towards zero for unreliably estimated schools
S6 Recap on: Random intercept models(parallel lines)
),0(~
),0(~
2
200
000
10
eij
uj
jj
ijijjij
Ne
Nu
u
exy
0 + 1x1ij
school 2
school 1
u0,1
u0,2
-3 0 1 +3
S7 Recap on: Random intercepts and slopes model
),0(~
:),0(~
2
2101
20
1
0
111
000
10
eij
uu
uuu
j
j
jj
jj
ijijjjij
Ne
Nu
u
u
u
exy
-3 0 1 3
0 + 1x1ij
school 2
school 1
u1,1
u0,2 u1,2
S8 Model in Manual : p54
S9 Estimates in Manual : p54
S10 Plot of predictions for schools: p56
7: Multilevel residuals
8.0 Contextual effects
In the previous sections we found that schools vary in both their intercepts and slopes resulting in crossing lines. The next question is are there any school level variables that can explain this variation?
Interest lies in how the outcome of individuals in a cluster is affected by their social contexts (measures at the cluster level). Typical questions are
• Does school type effect students' exam results?• Does area deprivation effect the health status of individuals in the area?
In our data set we have a contextual school ability measure, schav. The mean intake score is formed for each school, these means are ranked and the ranks are categorised into 3 groups :
low<=25%,25%>mid<=75%, high>75%
8.1 Exploring contextual effects and the tutorial data
Does school gender effect exam score by gender?
Do boys in boys’ schools do better or worse or the same compared with boys in mixed schools?
Do girls in girls’ schools do better or worse or the same compared with girls in mixed schools?
Does peer group ability effect individual pupil performance?
That is given two pupils of equal intake ability do they progress differently depending on whether they are educated in a low, mid or high ability peer group?
8.2 School gender effects
girl boysch girlsch0 0 0 boy/mixed school = -0.1891 0 0 girl/mixed school = -0.189+0.1680 1 0 boy/boy school =-0.189+0.1801 0 1 girl/girl school =-0.189+0.168+0.175
8.3 Peer group ability effectsThe effect of peer group ability is modelled as being constant across gender, school gender and standlrt.
For example, comparing peer group ability effects for boys in mixed schools and boys in boy’s schools:
-0.265+0.552*standlrtij : boy,mixed school,low(reference group)+0.067 : boy,mixed school,mid+0.174 : boy mixed school high
boy,boy school,lowboy,boy school,midboy,boy school,high} Boys school =0.187
8.4 Cross level interactions
There may be interactions between school gender, peer group ability, gender and standlrt. An interesting interaction is between peer group ability and standlrt. This tests whether the effect of peer group differs across the standlrt intake spectrum. For example, being in a high ability group may have a different effect for pupils of different ability. This is a cross level interaction because it is the interaction between a pupil level variable(standlrt) and a school level variable(schav).
8.5 Cross level interactions cont’d
Note that high ability pupils (standlrt=2.6) score nearly 1sd higher if they are educated in high rather than low ability peer groups.
Which leads to three lines for the low,mid and high groupings.
-0.347+0.455standlrtij :low
(-0.347+0.144)+(0.455+0.092) standlrtij :mid
(-0.347+0.290)+(0.455+0.180) standlrtij :high
9.1 Repeated measures.
We may have repeated measurements on individuals, for example: a series of heights or test scores. Often we want to model peoples growth. We can fit this structure as a multilevel model with repeated measurements nested within people. That is:
Occasion O1 O2 O3 O1 O2 O1 O2 O3 O4
Person P1 P2 P3…
9.2 Advantages of fitting repeated measures models in a multilevel framework
Fitting these structures using a multilevel model has the advantages that data can be • Unbalanced (individuals can have different numbers of measurement occasions)• Unequally spaced (different individuals can be measured at different ages)
As opposed to traditional multivariate techniques which require data to be balanced and equally spaced.
Again the multilevel model requires response measurements are MCAR or MAR.
9.3 An example from the MLwiN user guide
Repeated measures model for childrens’ reading scores
This (random intercepts model) models growth as a linear process with individuals varying only in their intercepts. That is for the 405 individuals in the data set
The global mean is predicted by
ijxx 1100
{The jth child’s growth curve is predicted by
ijj xu age)( 1000
9.4 Further possibilities for repeated measures model
•We can go on and fit a random slope model. Which in this case allows the model to deal with children growing at different rates. •We can fit polynomials in age to allow for curvilinear growth. •We can also try and explain between individual variation in growth by introducing child level variables.
•If appropriate we can include further levels of nesting. For example, if children are nested within schools we could fit a 3 level model [occasions:children:schools]. We could then look to see if childrens’ patterns of growth varied across schools.
10.0 Variance functions or modelling heteroscedasticity
Tabulating normexam by gender we see that the means and variances for boys and girls are (–0.140 and 1.051) and (0.093 and 0.940).
We may want to fit a model that estimates separate variances for boys and girls. The notation we have been using so far assumes a common intercept(0) and a single set of student residuals, ei, with a common variance e
2. We need to use a more flexible notation to build this model.
10.1 Working with general notation in MLwiN
A model with no variables specified in general notation looks like this.
A new first line is added stating that the response variable follows a Normal distribution. We now have the flexibility to specify alternative distributions for our response. We will explore these models later.
The 0 coefficient now has an explanatory x0 associated with it. The values x0 takes determines the meaning of the 0 coefficient. If x0 is a vector of 1s then 0 will estimate an intercept common to all individuals, in the absence of other predictors this would be the overall mean. If x0 variable, say 1 for boys and 0 for girls, then 0
will estimate the mean for boys.
10.2 A simple variance function
The new notation allows us to set up this simple model where x0i is a dummy variable for boy and x1i is a dummy variable for girl. This model estimates separate means and variances for the two groups.
This is an example of a variance function because the variance changes as a function of explanatory variables. The function is :
ieiei xxy 1210
20)var(
10.3 Deriving the variance function
ieiei xxy 1210
20)var(
We arrive at the expression
(1).at arrive weso and so variables
(0,1) are and Also girl. a andboy aboth becannot student a because
2)var(),cov(2)var(
)var(),cov(2)var()var()var(
it grearrangin and
model basic theBy taking
1210
20
10 01
21
211001
20
20
2111010
200
111100001100
11001100
111000
1100
iiii
iie
ieiieieiiiiiiii
iiiiiiiiiiiii
iiiiiii
iiii
iii
xxx x
xx
xxxxxexxeexe
xexexexexexey
xexexxy
ee
xxy
(1)
10.4 Variance functions at level 2The notion of variance functions is powerful and not restricted to level 1 variances.
The random slopes model fitted earlier produces the following school level predictions which show school level variability increasing with intake score.
),0(~
),0(~
20
2101
20
1
0
111
0000
1100
eij
uu
uuu
j
j
jj
ijjj
ijjijij
Ne
Nu
u
u
eu
xxy
The model
Can be rewritten as
),0(~
),0(~
20
2101
20
1
0
0011001100
eij
uu
uuu
j
j
ijijjjijij
Ne
Nu
u
xexuxuxxy
21
211001
20
20
211001100
2
))(()var(
ijuijuu
ijjjijjj
xxxx
xuxuExuxu
So the between school variance is
10.5 Two views of the level 2 variance
Given x0 = [1], we have21
21101
20
21
211001
20
201100 22)var( ijuijuuijuijuuijjj xxxxxxxuxu
Which shows that the level 2 variance is polynomial function of x1ij
211
2111100 015.0)018.0*2(9.0)var( ijijijijijjj xxcxbxaxuxu
• View 1: In terms of school lines predicted intercepts and slopes varying across schools.
View 2 : In terms of a variance function which shows how the level 2 variance changes as a function of 1 or more explanatory variables.
10.6 Elaborating the level 1 variance
Maybe the student level departures around their schools summary lines are not constant.
Note at level 2 we have 2 interpretations of level 2 random variation, random coefficients (varying slopes and intercepts across level 2 units) and variance functions. In each level 1 unit, by definition, we only have one point, therefore the first interpretation does not exist because you cannot have a slope given a single data point.
2 schools
2 students
10.7 Variance functions at level 1
2101
20
1
0
2101
20
1
0
110011001100
),0(~
),0(~
ee
eee
ij
ij
uu
uuu
j
j
ijijijijjjijij
Ne
e
Nu
u
xexexuxuxxy
If we allow standlrt(x1ij) to have a random term at level 1, we get
211
21
211001
20
201100
001.0)015.0*2(533.0
2)var(
ijij
ijeijeeijijij
xx
xxxxxexe
So the student level variance is now:
The resulting graph shows decreasing level 1 variance wrt standlrt extenuates the importance of school level factors driving variation in the outcome score, particularly for high ability pupils
10.8 Modelling the mean and variance simultaneously
2101
20
1
0
2101
20
1
0
110011001100
),0(~
),0(~
ee
eee
ij
ij
uu
uuu
j
j
ijijijijjjijij
Ne
e
Nu
u
xexexuxuxxy
In our model
The global mean is predicted by
ijxx 1100
The jth school mean is predicted by
ijjj xuxu 1110000 )()(
The student level variance is21
21101
201100 2)var( ijeijeeijijij xxxexe
The school level variance is21
21101
201100 2)var( ijuijuuijjj xxxuxu
),0(~ 2
110
ei
iii
Ne
exy
Where as ordinary regression:
estimates the global relationship and has a single catch all bucket for the
variance.
11.00 Applied Paper – Example of Variance functions
Understanding the sources of differential parenting: the role of child and family level effects. Jenny
Jenkins, Jon Rasbash and Tom O’Connor Developmental Psychology 2003(1) 99-113
11.01 Mapping multilevel terminology to psychological terminology
• Level 2 : Family, shared environment
Variables : family ses, marital problems
• Level 1 : Child, non-shared environment, child specific
Variables : age, sex, temperament
11.02 Background
• Recent studies in developmental psychology and behavioural genetics emphasise non-shared environment is much more important in explaining children’s adjustment than shared environment has led to a focus on non-shared environment.(Plomin et al, 1994; Turkheimer&Waldron, 2000)
• Has this meant that we have ignored the role of the shared family context both empirically and conceptually?
11.03 questions
• One key aspect of the non-shared environment that has been investigated is differential parental treatment of siblings.
• Differential treatment predicts differences in sibling adjustment
• What are the sources of differential treatment?
• Child specific/non-shared: age, temperament, biological relatedness
• Can family level shared environmental factors influence differential treatment?
“Parents have a finite amount of resources in terms of time, attention, patience and support to give their children. In families in which most of these resources are devoted to coping with economic stress, depression and/or marital conflict, parents may become less consciously or intentionally equitable and more driven by preferences or child characteristics in their childrearing efforts”. Henderson et al 1996.
This is the hypothesis we wish to test. We operationalised the stress/resources hypothesis using four contextual variables: socioeconomic status, single parenthood, large family size, and marital conflict
11.04 The Stress/Resources Hypothesis
Do family contexts(shared environment) increase or decrease the extent to which children within the same family are treated differently?
Previous analyses, in the literature exploring the sources of differential parental treatment ask mother to rate two siblings in terms of the treatment(positive or negative) they give to each child.
The difference between these two treatment scores is then analysed.
This approach has several major limitations…
11.05 How differential parental treatment has been analysed
11.06 The sibling pair difference difference model, for exploring determinants of differential parenting
...)( 11021 iii xyy Where y1i and y2i are parental ratings for siblings 1 and 2 in family I
x1i is a family level variable for example family sesProblems
• One measurement per family makes it impossible to separate shared and non-shared random effects.
•All information about magnitude of response is lost (2,4) are the same as (22,24)
•It is not possible to introduce level 1(non-shared) variables since the data has been aggregated to level 2.
•Family sizes larger than two can not be handled.
11.07 With a multilevel model…
),0(~),0(~ 22
22110
eijuj
ijjjijij
NeNu
euxxy
Where yij is the j’th mothers rating of her treatment of her i’th child
x1ij are child level(non-shared variables), x2j are child level(shared variables)
uj and eij are family and child(shared and non-shared environment) random effects.
Note that the level 1 variance is now a measure of differential parenting
2e
11.08 Advantages of the multilevel approach
•Can handle more than two kids per family
•Unconfounds family and child allowing estimation of family and child level fixed and random effects
•Can model parenting level and differential parenting in the same model.
11.09 Overall Survey Design
• National Longitudinal Survey of Children and Youth (NLSCY)
• Statistics Canada Survey, representative sample of children across the provinces
• Nested design includes up to 4 children per family
• PMK respondent
• 4-11 year old children
• Criteria: another sibling in the age range, be living with at least one biological parent, 4 years of age or older
• 8, 474 children
• 3, 860 families
• 4 child =60, 3 child=630, 2 child=3157
11.10 Measures of parental treatment of child
Derived form factor analyses..
• PMK report of positive parenting: frequency of praise of child, talk or play focusing on child, activities enjoyed together =.81
• PMK report of negative parenting: frequency of disapproval, annoyance, anger, mood related punishment =.71
• Will talk today about positive parenting
PMK is parent most known to the child.
Child specific factors• Age
• Gender
• Child position in family
• Negative emotionality
• Biological relatedness to father and mother
Family context factors• Socioeconomic status
• Family size
• Single parent status
• Marital dissatisfaction
11.12 Model 1: Null Model
)08.0(8.3ˆ)17.0(13.5ˆ)04.0(51.12ˆ
),0(~
),0(~
220
2
2
0
eu
eij
uj
ijjij
Ne
Nu
euy
The base line estimate of differential parenting is 3.8. We can now add further shared and non-shared explanatory variables and judge their effect on differential parenting by the reduction in the level 1 variance.
11.13 Model 2 : expanded model
),0(~
),0(~
*
2
1
0
11000
131211
10987
765
432
210
1
eij
uj
j
jjjjj
jj
jjjj
ijijij
ijijijijjjij
Ne
Nu
u
uu
agefamsizemaritalprbrmixedGende
allGirlsloneParentfamsizehses
midSiboldestSibnotBioF
notBioMgirlageagey
11.14 positive parenting
Child level predictors
• Strongest predictor of positive parenting is age. Younger siblings get more attention. This relationship is moderated by family membership.
• Non-bio mother and Non_bio father reduce positive parenting
• Oldest sibling > youngest sibling > middle siblings
Family level predictors
• Household SES increases positive parenting
• Marital dissatisfaction, increasing family size, mixed or all girl sib-ships all decrease positive parenting
• Lone parenthood has no effect.
11.15 Differential parenting
Modelling age reduced the level 1 variance (our measure of differential parenting) from 3.8 to 2.3, a reduction of 40%. Other explanatory variables both child specific and family(shared environment) provide no significant reduction in the level 1 variation.
Does this mean that there is no evidence to support the stress/resources hypothesis.
11.16 Testing the stress/resource hypothesis
• The mean and the variance are modelled simultaneously. So far we have modelled the mean in terms of shared environment but not the variance.
• We can elaborate model 2 by allowing the level 1 variance to be a function of the family level variables household socioeconomic status, large family size, and marital conflict. That is
)05.0(11.0ˆ)13.0(29.0ˆ
)07.0(17.0ˆ)04.0(23.0ˆ)1.0(84.1ˆ
2.2
22
54
210
54
32
2102
ww
www
familysizewsesmaritalprbw
maritalwhseswhsesww
jj
jjjej
Reduction in the deviance with 7df is 78.
11.17 Graphically …
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
household ses
1
2
3
4
5
diff
ere
ntia
l pa
ren
ting
family size = 2, no marital problemsfamily size = 2, marital problemsfamily size > 2, marital problemsfamily size > 2, no marital problems
11.18 Modelling the mean and variance simultaneously
We show a possible pattern of how the mean, within family variance and between family variance might behave as functions of HSES in the schematic diagram below.
HSES
posi
tive
par
enti
ng
Here are 5 families of increasing HSES(in the actual data set there are 3900 families.
We can fit a linear function of SES to the mean.
The family means now vary around the dashed trend line. This is now the between family variation; which is pretty constant wrt HSES
However, the within family variation(measure of differential parenting) decreases with HSES – this supports the SR hypothesis.
12 Multivariate response models
We may have data on two or more responses we wish to analyse jointly. For example, we may have english and maths scores on pupils within schools. We can consider the response type as a level below pupil.
S1 S2…
P1 P2 P3 P4….
E M E M E M E M
12.01 Rearranging data
school pupil English Maths
1 1 50 60
1 2 80 70
1 3 50 45
2 4 75 85
2 5 60 40
Often data comes like this with one row per person
For MLwiN to analyse the data we require the data matrix to have one row per level 1 unit. That is one row per response measurement
x0 is 1 if response for this record is English, 0 otherwise
x1 is 1 if response for this record is Maths, 0 otherwise
school pupil subject x0 x1
1 1 50 1 0
1 1 60 0 1
1 2 80 1 0
1 2 70 0 1
1 3 50 1 0
1 3 45 0 1
2 4 75 1 0
2 4 85 0 1
2 5 60 1 0
2 5 40 0 1
12.02 Writing down the model
2101
20
1
0
111
112
000
001
),0(~uu
uuu
j
j
jj
ijjj
jj
ijjj
Nu
u
u
xy
u
xy
Where y1j is the english score for student j and y2j is the maths score for student j.The means and variances for english and maths(0,1,u0
2,u12) are
estimated. Also the covariance between maths and english, u01is estimated.
english maths
0
1
0j
1j u
u
Note there is no level 1(eij) variance. This can be seen if we consider the picture for one pupil.
12.03 Advantages of framing a multivariate response model as a multilevel model
The model has the following advantages over traditional multivariate techniques: It can deal with missing responses-provided response data is missing completely at random(MCAR) or missing at random(MAR) that is missingness is related to explanatory variables in the model. Covariates can be added giving us the conditional covariance matrix of the responses. Further levels can be added to the model
12.04 Example from MLwiN user guide
mean for written = 46.8Variance(written) = 178.7mean for coursework = 73.36Variance(coursework) = 265.4covariance(written, coursework) = 102.3
pupils have two responses : written and coursework
That is we have two means and a covariance matrix, which we could get from any stats package. However, the data are unbalanced. Of the 1905 pupils 202 are missing a written response and 180 are missing a coursework response.
12.05 Further extensions
We can add further explanatory variables.For example, female. We see that females do better for coursework than males and worse than males on written exams males do better on written exams.
We can add further levels.
Here we partition the covariance structure into student and school components.
13.0 MCMC estimation in MlwiN
MCMC estimation is a big topic and is given a pragmatic and cursory treatment here. Interested students are referred to the manual “MCMC estimation in MLwiN” available from
http://www.cmm.bris.ac.uk/mlwin/download/manuals.shtml
In the workshop so far you have been using IGLS (Iterative Generalised Least Squares) algorithm to estimate the models.
13.1 IGLS versus MCMC
IGLS MCMC
Fast to compute Slower to compute
Deterministic convergence-easy to judge
Stochastic convergence-harder to judge
Uses mql/pql approximations to fit discrete response models which can produce biased estimates in some cases
Does not use approximations when estimating discrete response models, estimates are less biased
In samples with small numbers of level 2 units confidence intervals for level 2 variance parameters assume Normality, which is inaccurate.
In samples with small numbers of level 2 units Normality is not assumed when making inferences for level 2 variance parameters
Can not incorporate prior information Can incorporate prior information
Difficult to extend to new models Easy to extend to new models
13.2 Bayesian framework
MCMC estimation operates in a Bayesian framework. A bayesian framework requires one to think about prior information we have on the parameters we are estimating and to formally include that information in the model. We may make the decision that we are in a state of complete ignorance about the parameters we are estimating in which case we must specify a so called “uninformative prior”. The “posterior” distribution for a paremeter given that we have observed y is subject to the following rule:
p(|y) p(y| )p()
Where
p(|y) is the posterior distribution for given we have observed y
p(y| ) is the likelihood of observing y given
p() is the probability distribution arising from some statement of prior belief such as “we believe ~N(1,0.01)”. Note that “we believe ~N(1,1)” is a much weaker and therefore less influential statement of prior belief.
13.3 Applying MCMC to multilevel models
In a two level variance components model we have the following unknowns
22 ,,, euu
There joint posterior is
)()()(
)|(),,|()|,,,(22
2222
eu
ueeu
ppp
upuypyup
Likelihood – “what the data says”-estimated from data
Prior belief-supplied by the researcher
Posterior – final answers- a combination of likelihood and priors
13.4 Gibbs sampling
Evaluating the expression for the joint posterior with all the parameters unknown is for most models, virtually impossible. However, if we take each unknown parameter in turn and temporarily assume we know the values of the other parameters, then we can simulate from the so called “conditional posterior” distribution. The Gibbs sampling algorithm cycles through the following simulation steps. First we assume some starting values for our unknown parameters :
2)0(
2)0()0()0( ,,, euu
13.5 Gibbs sampling cnt’d
),,,|p(
finally then ,get to
),,,|(
then,get to
),,,|(
then,get to
),,,|(
firstly rotation,in onsdistributi lconditiona following thefrom Sampling
2)1()1()1(
2e
2u(1)
2)0()1()1(
2
)1(
2)0(
2)0()1(
)1(
2)0(
2)0()0(
u
eu
eu
eu
uy
uyp
u
yup
uyp
We now have updated all the unknowns in the model. This process is repeated many times until eventually we converge on the distribution of each of the unknown parameters.
13.6 IGLS vs MCMC convergence
IGLS algorithm converges, deterministically to a distribution.
MCMC algorithm converges on a distribution. Parameter estimates and intervals are then calculated from the simulation chains.
13.7 Other MCMC issues
By default MLwiN uses flat, uniformative priors see page 5 of MCMC estimation in MLwiN (MEM)
For specifying informative priors see chapter 6 of MEM.
For model comparison in MCMC using the DIC statistic see chapters 3 and 4 MEM.
For description of MCMC algorithms used in MLwiN see chapter 2 of MEM.
13.8 When to consider using MCMC in MLwiN
Some of the more advanced models in MLwiN are only available in MCMC. For example, factor analysis (chapter 19), measurement error in predictor variables (chapter 14) and CAR spatial models (chapter 16)
Other models, can be fitted in IGLS but are handled more easily in MCMC such as multiple imputation (chapter 17), cross-classified(chapter 14) and multiple membership models (chapter 15).
If you have discrete response data – binary, binomial, multinomial or Poisson (chapters 11, 12, 20 and 21). Often PQL gives quick and accurate estimates for these models. However, it is a good idea to check against MCMC to test for bias in the PQL estimates.
If you have few level 2 units and you want to make accurate inferences about the distribution of higher level variances.
All chapter references to MCMC estimation in MLwiN.
14.0 Generalised Multilevel Models 1 : Binary Responses
and Proportions
14.1 Generalised multilevel models
•So Far
Response at level 1 has been a continuous variable and
associated level 1 random term has been assumed to have
a Normal distribution
•Now a range of other data types for the response
All can be handled routinely by MLwiN
•Achieved by 2 aspects
a non-linear link between response and predictors
a non-Gaussian level 1 distribution
Response Example Model Binary Categorical
Yes/No Logit or probit or log-log model with binomial L1 random term
Proportion Proportion un-employed
Logit etc. with binomial L1 random term
Multiple categories
Travel by train, car, foot
Logit model with ordered or unordered multi-nomial random term
Count No of crimes in area
Log model with L1 Poisson random term
Count LOS Log model with L1 NBD random term
14.2 Typology of discrete responses
14.3 Focus on modelling proportions•Proportions eg death rate; employment rate; can be conceived as the underlying probability of dying; probability of being employed
•Four important attributes of a proportion that MUST be taken into account in modelling
(1)Closed range: bounded between 0 and 1
(2)Anticipated non-linearity between response and predictors; as predicted response approaches bounds, greater and greater change in x is required to achieve the same change in outcome; examination analogy(3)Two numbers: numerator subset of denominator(4)Heterogeneity: variance is not homoscedastic; two aspects(a) the variance depends on the mean; as approach bound of 0 and 1, less room to vary
ie Variance is a function of the predicted probability (b) the variance depends on the denominator;
small denominators result in highly variable proportions
14.4 Modelling Proportions
•Linear probability model: that is use standard regression model with linear relationship and Gaussian random term
•But 3 problems
(1) Nonsensical predictions: predicted proportions are unbounded, outside range of 0 and 1
(2) Anticipated non-linearity as approach bounds
(3) Heteogeneity: inherent unequal variance
dependent on mean and on denominator
•Logit model with Binomial random term resolves all three problems (could use probit, clog-clog)
14.5 The logistic model: resolves problems 1 & 2
• Models not the proportion but a non-linear transformation of it (solves problems 1+2)
•The relationship between the probability and predictor(s) can be represented by a logistic function, that resembles a S-shaped curve
• L = LOGe(p/ (1-p))
• L = Logit = the log of the odds
• p = proportion having an attribute
• 1-p = proportion not having the attribute
• p/(1-p) = the odds of having an attribute compared to not having an attribute
• As p goes from 0 to 1, L goes from minus to plus infinity, so if model L, cannot get predicted proportions that lie outside 0 and 1; (ie solves problem 1)
• Easy to move between proportions, odds and logits
14.6 The Logit transformation
14.7 Proportions, Odds and Logits
Proportion/ Probability OddsA 5 out of 10 5 to 5B 6 out of 10 6 to 4C 8 out of 10 8 to 2
Logit OddsA e0 1.0B e0.41 1.5C e1.39 4
Proportion(p)
Odds(p/ 1-p)
Log of oddsLoge (p/ 1-p)
A 0.5 1.0 0B 0.6 1.5 0.41C 0.8 4 1.39
Logit ProportionA e0/ (1+ e0) 0.5B e0.41/ (1+ e0.41) 0.6C e1.39/ (1+ e1.39) 0.8
14.8 The logistic model
110
110
1 x
x
e
e
where e is the base of the natural logarithm
• linearized by the logit transformation(log = natural logarithm)
1101log x
The underlying probability or proportion is non-linearly related to the predictor
14.9 The logistic model: key characteristics
• The logit transformation produces a linear function of the parameters.
• Bounded between 0 and 1• Thereby solving problems 1 and 2
14.10 Solving problem 3:assume Binomial variation
• Variance of the response in logistic models is presumed to be binomial:
1,)ˆ1(ˆ
, 2
eii
iiiiiii n
zzey
nyVar
)1()|(
Ie depends on underlying proportion and the denominator • In practice this is achieved by replacing the constant variable
at level 1 by a binomial weight, z, and constraining the level-1 variance to 1 for exact binomial variation
• The random (level-1) component can be written as
14.11 Multilevel Logistic Model
),(Binomial~ ijijij ny
jijijijij
ijij uxxx 03322110)1(
ln)(logit
• Underlying proportions/probabilities, in turn, are related to a set of individual and neighborhood predictors by the logit link function
• Linear predictor of the fixed part and the higher-level random part
• Assume observed response comes from a Binomial distribution with a denominator for each cell, and an underlying probability/proportion
14.12 Estimation 1
•Quasi-likelihood (MQL/PQL – 1st and 2nd order)
–model linearised and IGLS applied.
–1st or 2nd order Taylor series expansion (to linearise the non-linear model)
– MQL versus PQL are higher-level effects included in the linearisation
–MQL1 crudest approximation. Estimates may be biased downwards (esp. if within cluster sample size is small and between cluster variance is large eg households). But stable.
–PQL2 best approximation, but may not converge.
–Tip: Start with MQL1 to get starting values for PQL.
14.13 Estimation 2
•MCMC methods: get deviance of model (DIC) for sequential model testing, and good quality estimates even where cluster size is small; start with MQL1 and then switch to MCMC
14.14 Variance Partition CoefficientF o r 2 - l e v e l N o r m a l r e s p o n s e r a n d o m i n t e r c e p t m o d e l :
variance2Levelvariance1Level
variance2LevelVPC
yij~Binomial(ij,1)logit(ij | xij, uj,) = + x1ij + uj
Var(uj) =u2
var(yij- ij) = ij(1- ij) Level 1 variance is function of predicted probability
The level 2 variance u2 is on the logit scale and the level 1 variance
var(yij- ij) is on the probability scale so they can not be directly compared. Also level 1 variance depends on ij and therefore x1ij.
Possible solutions include i) set the level 1 variance = variance of a
standard logistic distribution; ii) simulation method
14.15 VPC 1: Threshold Model
Formulate logit model as:
ijjijT
ij uxy *
where *
ijy is continuous latent variable underlying ijy , and
ij has a standard logistic distribution with variance
29.33/2
Then 29.3
VPC2
2
u
u
But this ignores the fact that the level –1 variance is not constant, but is function of the mean probability which depends on the predictors in the fixed part of the model
14.16 VPC 2: Simulation Method
(i) Generate M values for random effect u from )ˆ,0( 2uN :
u(1) , u(2) . . ., u(M) say 5000 group-level logit values
(ii) For m=1,…,M compute (for any chosen value x*):
)1(
and))]*ˆ(exp(1[
*)(
*)(
*)(1
1)(
*)(
mmm
mT
m
v
u
x
(iii) Level 1 variance is mean of *)(1 mv (m=1,…,M) and
level 2 variance is variance of *)(m and then use ordinary
VPC
14.17 Multilevel modelling of binary data
• Exactly the same as proportions except
• The response is either 1 or 0
• The denominator is a set of 1’s
• So that a ‘Yes’ is 1 out of 1 , while a ‘No’ is 0 out of 1
14.18 Chapter 9 of Manual: Contraceptive Use in Bangladesh
• 2867 women nested in 60 districts
• y=1 if using contraception at time of survey, y=0 if not using contraception
• Covariates: age (mean centred), urban residence (vs. rural)
14.19 Random Intercept Model: PQL2
Estimate (SE) Fixed
0 -0.69 (0.08)
1 (urban) 0.71 (0.10)
2 (age) 0.015 (0.004)
Random (between-district) 20u 0.21 (0.06)
14.20 Variance Partition Coefficient
Threshold model approach 0.21/(0.21+3.29)=0.060
Simulation approach
(M=5000, mean age)
Urban 0.050
Rural 0.045
14.21 MLwiN Gives
• UNIT or (subject) SPECIFIC Estimates the fixed effects conditional on higher level unit random effects, NOT the
• POPULATION-AVERAGE estimatesiethe marginal expectation of the dependent variables across the population "averaged " across the random effects
• In non-linear models these are different and the PA will generally be smaller than US, especially as size of random effects grows
• Can derive PA fom US but not vice-versa (next version give both)
14.22 Unit specific / Population average• Probability of adverse reaction against dose• Left: subject-specific; big differences between subjects for middle dose (the between –
patient variance is large), • Right is the population average dose response curve,• Subject-specific curves have a steeper slope in the middle range of the dose variable
15.0 Multilevel Multinomial Models
Logistic models handle the situation where we have a binary response(two response categories eg alive/dead or pass/fail.)
Where we have a response variable with more than two categories we use multinomial models.
Two types of multinomial response:
First we deal with unordered multinomial responses
Ordered – attitude scales(strongly disagree...strongly agree) or exam grades.
Unordered – eg voting prerference(lab, tory, libdem, other) or cause of death.
15.1 Extending a binary to a multinomial model
log[i / (1- i )]=x1i..... (1)
Take a binary variable (yi) which is 1 if an individual votes tory 0 otherwise.
The underlying probability of individual i voting tory is i .
We model the log odds of voting tory as a function of explanatory variables
Lets call i = 1i = prob of individual i voting tory and
2i =(1- 1i )= prob of individual i not voting tory
We can now write (1) as
log[1i / 2i]=x1i.....
15.2 Moving to more than two response categories
Suppose now that yi can take three values {1,2,3} vote tory, vote labour, vote lib dem. Now
log[1i / 3i]=x1i.....
log[2i / 3i]=x1i.....
1i is probability of individual i votes tory2i is probability of individual i votes labour3i is probability of individual i votes lib dem
Now we must choose a reference category, say vote lib dem, and model the log odds of all remaining categories against the reference category. Therefore with t categories we need t-1 equations to model this set of log odds ratios. In our case
15.3 Notation
For s = 1
log[i(1) / i
(3) ]=
(1)x1i.....
log[1i / 3i]=x1i.....
log[2i / 3i]=x1i.....
.....
log[i(s) / i
(t) ]=s)
(s) x1i s=1,..,t-1
The MLwiN software uses the notation
Often in papers you will see the more succinct notational form
For s = 2
log[i(2) / i
(3) ]=
(2)x1i.....
Which becomes
15.4 Interpretation(odds ratios)
log[1i / 3i]=x1i.....
log[2i / 3i]=x1i.....
We can interpret as with logistic regression. In the political example, {1,2,3} vote tory, vote labour, vote lib dem.
is the change in the log odds of voting tory as opposed to lib dem for
1 unit increase in x1i.
is the change in the log odds of voting labour as opposed to lib dem
for 1 unit increase in x1i.
and expo(gives odds ratios
15.5 Interpretation(probabilities)
log[1i / 3i]=x1i.....
log[2i / 3i]=x1i.....
)(1 )()(
)(
1 132110
110
ii
i
xx
x
i ee
e
Probability of voting tory for individual i
)(1 )()(
)(
2 132110
132
ii
i
xx
x
i ee
e
Probability of voting labour for individual i
Or in general notation
1
1
)(
)()(
1)(
1)(
0
1)(
1)(
0
1t
k
x
xs
ii
kk
iss
e
e
iii 123 1
1
1
)()( 1t
k
ki
ti
15.6 Multilevel Multinomial models
Suppose the individuals in the voting example are clustered into constituencies and we wish to include constituency effects in our model. We include intercept level residuals for each log odds equiation in our model
log[1ij / 3ij]=x1ij +u0j
log[2ij / 3ij]=+ x1ij u2j
u0j is the effect of the constituency j on the log odds of voting tory as opposed to lib dem. So if u0j is 1 the log odds of voting tory as opposed to lib dem increase by 1 compared to u0j where u0j = 0 (the is average constituency)
Likewise u2j is the effect of the constituency j on the log odds of voting labour as opposed to lib dem.
15.7 Variance of level 2 random effects
log[1ij / 3ij]=+u0jx1ij
log[2ij / 3ij]= u2j + x1ij
u0j
u2j
~N( 0,u ) u=
u0
u02 u2
u0 is the betwen constituency variance of the vote tory:lib dem log odds
ratio
u2 is the between constituency variance of the vote labour:lib dem log odds
ratio
u02 is the constituency level covariance between tory and labour constituency level effects. A negative covariance means there is a tendency for constituencies where labour do well as opposed to libdems; for tories to do badly as opposed to the libdems and vice versa.
16.0 Ordered categorical data
Where there is an underlying ordering to the categories a convenient parameterisation is to work with cumulative probabilities that an individual crosses a threshold. For example, with exam grades
Grade probability Threshold Cumulative probability
D 1i D 1i
C 2i C: (C,D) 1i+ 2i
B 3i B:(B,C,D) 1i+ 2i+ 3i
A 4i A:(A,B,C,D) 1i+ 2i+ 3i+ 4i=1
With an ordered multinomialwe work with the set of cumulative probabalities As before with t categories in the the model has t-1 categories.
16.1 Writing the ordered multinomial model
log(log odds of D
log(log odds of C
log(log odds of B
The threshold probability kare given by antilogit(k)
We must have to ensure
16.2 Adding covariates to the model
log(hilog odds of D
log(hi log odds of C
log(hilog odds of B
hi= x1i.....
Note that the covariates hi are the same for each of the response threshold categories.
x1i log odds of D
x1i log odds of C
x1i log odds of B
Log odds
xi
This means that the log odds ratios and odds ratios for threshold category membership are independent of the predictor variables.
16.3 Proportional odds models
Sio far we have assumed that the odds ratios of response category membership remains constant wrt predictor variables. This is known as the proportional odds assumption.
We can test the assumption that odds ratio’s of response category membership being independent of predictor variables by fitting:
log(x1ilog odds of D
log(x1ilog odds of C
log(x1ilog odds of B
Now if our assumptions are correct will be very similar. We can formally test using the intervals and tests window
16.4 Multilevel ordered multinomial models
log(hilog odds of D
log(hi log odds of C
log(hilog odds of B
hi= x1i+u0j
u0j is a random effect for school j, which shifts all the threshold probabilities equally for all kids in school j. Again odds ratios for category membership are independent of u0j
k+ x1i+ u0j for +ve u0j
k+ x1i+ u0j for -ve u0j
Log odds
xi
k+ x1i
16.5 Higher level variances
u0j~N(0,u0)
The greater u0 The greater the variability in the school
level shifts in the response threshold probabilities.
17 Non-hierarchical multilevel models
Two types :
•Cross-classified models
•Multiple membership models
17.01 Cross-classification
For example, hospitals by neighbourhoods. Hospitals will draw patients from many different neighbourhoods and the inhabitants of a neighbourhood will go to many hospitals. No pure hierarchy can be found and patients are said to be contained within a cross-classification of hospitals by neighbourhoods :
nbhd 1 nbhd 2 Nbhd 3
hospital 1 xx x
hospital 2 x x
hospital 3 xx x
hospital 4 x xxx
Hospital H1 H2 H3 H4
Patient P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12
Nbhd N1 N2 N3
17.02 Other examples of cross-classifications
• pupils within primary schools by secondary schools
• patients within GPs by hospitals
• interviewees within interviewers by surveys
• repeated measures within raters by individual(e.g. patients by nurses)
17.03 Notation
With hirearchical models we have subscript notation that has one subscript per level and nesting is implied reading from left. For example, subscript pattern ijk denotes the i’th level unit within the j’th level 2 unit within the k’th level 3 unit.
If models become cross-classified we use the term classification instead of level. With notation that has one subscript per classification, that captures the relationship between classifications, notation can become very cumbersome. We propose an alternative notation that only has a single subscript no matter how many classifications are in the model.
17.04 Single subscript notation
Hospital H1 H2 H3 H4
Patient P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12
Nbhd N1 N2 N3
i nbhd(i) hosp(i)1 1 12 2 13 1 14 2 25 1 26 2 27 2 38 3 39 3 410 2 411 3 412 3 4
)1()3()(
)2()(0 iihospinbhdi euuy
We write the model as
1)3(
4)2(
3011
1)3(
1)2(
101
euuy
euuy
Where classification 2 is nbhd and classification 3 is hospital. Classification 1 always corresponds to the classification at which the response measurements are made, in this case patients. For patients 1 and 11 equation (1) becomes:
17.05 Classification diagrams
Hospital
Patient
Neighbourhood
Hospital
Patient
Neighbourhood
Nested structure where hospitals are contained within neighbourhoods
Cross-classified structure where patients from a hospital come from many neighbourhoods and people from a neighbourhood attend several hospitals.
In the single subscript notation we loose informatin about the relationship(crossed or nested) between classifications. A useful way of conveying this informatin is with the classification diagram. Which has one node per classification and nodes linked by arrows have a nested relationship and unlinked nodes have a crossed relationship.
17.06 Data example : Artificial insemination by donor
Women w1 w2 w3 Cycles c1 c2 c3 c4… c1 c2 c3 c4… c1 c2 c3 c4… Donations d1 d2 d1 d2 d3 d1 d2 Donors m1 m2 m3
1901 women279 donors 1328 donations12100 ovulatory cyclesresponse is whether conception occurs in a given cycle
In terms of a unit diagram:
Donor
Woman
Cycle
Donation
Or a classification diagram:
17.07 Model for artificial insemination data artificial insemination
),0(~
),0(~
),0(~
)()logit(
),1(~
2)4(
)4()(
2)3(
)3()(
2)2(
)2()(
)4()(
)3()(
)2()(i
uidonor
uidonation
uiwoman
idonoridonationiwomani
ii
Nu
Nu
Nu
uuuX
Binomialy
We can write the model as
2)4(u
0
1
2
3
4
5
6
7
2)2(u
2)3(u
Parameter Description Estimate(se)
intercept -4.04(2.30)
azoospermia * 0.22(0.11)
semen quality 0.19(0.03)
womens age>35 -0.30(0.14)
sperm count 0.20(0.07)
sperm motility 0.02(0.06)
insemination to early -0.72(0.19)
insemination to late -0.27(0.10)
women variance 1.02(0.21)
donation variance 0.644(0.21)
donor variance 0.338(0.07)
Results:
17.08 Multiple membership models
Where level 1 units are members of more than one higher level unit. For example,
• Pupils change schools/classes and each school/class has an effect on pupil outcomes
• Patients are seen by more than one nurse during the course of their treatment
17.09 Notation
),0(~
)1(),0(~
)(
2
2)2(
)2(
)(
)2()2(,
ei
uj
inursejijjiii
Ne
Nu
euwXBy
Note that nurse(i) now indexes the set of nurses that treat patient i and w(2)
i,j is a weighting factor relating patient i to nurse j. For example, with four patients and three nurses, we may have the following weights
n1(j=1) n2(j=2) n3(j=3)
p1(i=1) 0.5 0 0.5
p2(i=2) 1 0 0
p3(i=3) 0 0.5 0.5
p4(i=4) 0.5 0.5 0
i
i
i
i
euuXBy
euuXBy
euXBy
euuXBy
)2(2
)2(14
)2(3
)2(23
)2(12
)2(3
)2(11
5.05.0
5.05.0
1
5.05.0
Here patient 1 was seen by nurse 1 and 3 but not nurse 2 and so on. If we substitute the values of w(2)
i,j , i and j. from the table into (1) we get the series of equations :
17.10 Classification diagrams for multiple membership relationships
Double arrows indicate a multiple membership relationship between classifications
patient
nurseWe can mix multiple membership, crossed and hierarchical structures in a single model
patient
nurse
hospital
GP practice
Here patients are multiple members of nurses, nurses are nested within hospitals and GP
practice is crossed with both nurse and hospital.
17.11 Example involving, nesting, crossing and multiple membership – Danish chickensProduction hierarchy10,127 child flocks 725 houses 304 farms
Breeding hierarchy10,127 child flocks200 parent flocks
farm f1 f2… Houses h1 h2 h1 h2 Child flocks c1 c2 c3… c1 c2 c3…. c1 c2 c3…. c1 c2 c3…. Parent flock p1 p2 p3 p4 p5….
Child flock
House
Farm
Parent flock
As a unit diagram: As a classification diagram:
17.12 Model and results
),0(~
),0(~),0(~
)()logit(
),1(~
2)4(
)4()(
2)3(
)3()(
2)2(
)2(
)(.
)4()(
)3()(
)2()2(,i
uifarm
uihouseuj
iflockpjiifarmihousejjii
ii
Nu
NuNu
euuuwXB
Binomialy
0
1
2
3
4
5
2)2(u
2)3(u
2)4(u
Parameter Description Estimate(se)
intercept -2.322(0.213)
1996 -1.239(0.162)
1997 -1.165(0.187)
hatchery 2 -1.733(0.255)
hatchery 3 -0.211(0.252)
hatchery 4 -1.062(0.388)
parent flock variance 0.895(0.179)
house variance 0.208(0.108)
farm variance 0.927(0.197)
Results:
All the children born in the Avon area in 1990 followed up longitudinally
Many measurements made including educational attainment measures
Children span 3 school year cohorts(say 1994,1995,1996)
Suppose we wish to model development of numeracy over the schooling period. We may have the following attainment measures on a child :
m1 m2 m3 m4 m5 m6 m7 m8
primary school secondary school
17.13 Alspac data
•Measurement occasions within pupils
M. Occasion
Pupil P. Teacher
•At each occasion there may be a different teacher
P School Cohort
•Pupils are nested within primary school cohorts
Primary school
Area
•All this structure is nested within primary school• Pupils are nested within residential areas
17.14 Structure for primary schools
M. occasions
Pupil P. Teacher
P School Cohort
Primary school
Area
Nodes directly connected by a single arrow are nested, otherwise nodes are cross-classified. For example, measurement occasions are nested within pupils. However, cohort are cross-classified with primary teachers, that is teachers teach more than one cohort and a cohort is taught by more than one teacher.
T1 T2 T3
Cohort 1 95 96 97
Cohort 2 96 97 98
Cohort 3 98 99 00
17.15 A mixture of nested and crossed relationships
It is reasonable to suppose the attainment of a child in a particualr year is influenced not only by the current teacher, but also by teachers in previous years. That is measurements occasions are “multiple members” of teachers.
m1 m2 m3 m4
t1 t2 t3 t4
M. occasions
Pupil P. Teacher
P School Cohort
Primary school
AreaWe represent this in the classification diagram by using a double arrow.
17.16 Multiple membership
If pupils move area, then pupils are no longer nested within areas. Pupils and areas are cross-classified. Also it is reasonable to suppose that pupils measured attainments are effected by the areas they have previously lived in. So measurement occasions are multiple members of areas
M. occasions
Pupil
P. TeacherP School Cohort
Primary school
Area
M. occasions
Pupil
P. TeacherP School Cohort
Primary school
Area
Classification diagram without pupils moving residential areas
Classification diagram where pupils move between residential areas
BUT…
17.17 What happens if pupils move area?
Classification diagram where pupils move between areas but not schools
If pupils move schools they are no longer nested within primary school or primary school cohort. Also we can expect, for the mobile pupils, both their previous and current cohort and school to effect measured attainments
M. occasions
Pupil
P. TeacherP School Cohort
Primary school
Area
M. occasions
Pupil P. TeacherP School Cohort
Primary school
Area
Classification diagram where pupils move between schools and areas
17.18 If pupils move area they will also move schools
And secondary schools…
M. occasions
Pupil P. TeacherP School Cohort
Primary school
Area
We could also extend the above model to take account of Secondary school, secondary school cohort and secondary school teachers.
17.19 If pupils move area they will also move schools cnt’d
Remember we are partitioning the variability in attainment over time between primary school, residential area, pupil, p. school cohort, teacher and occasion. We also have predictor variables for these classifications, eg pupil social class, teacher training, school budget and so on. We can introduce these predictor variables to see to what extent they explain the partitioned variability.
17.20 Other predictor variables
18 Significance testing and model comparison
• Individual fixed part and random coefficients at each level
• Simultaneous and complex comparisons
• Comparing nested models: likelihood ratio test
• Use of Deviance Information Criteria
18.1 Individual coefficients• Akin to t tests in regression models
• Either specific fixed effect or specific variance-covariance component
– H0: is 0; H1: is not 0
– H0: is 0; H1: is not 0
• Procedure: Divide estimated coefficient by their standard error– Judge against a z distribution– If ratio exceeds 1.96 then significant at 0.05 level
• Approximate procedure; asymptotic test, small sample properties not well-known.
• OK for fixed part coefficients but not for random (typically small numbers; variance distribution is likely to have + skew)
1 120u 2
0u
18.2 Simultaneous/complex comparisons & recommended for random part testing
• Example: Testing H0: 2 – 3 = 0 AND 3 = 5
• H0: [C][k
• [C] is the contrast matrix (p by q) specifying the nature of hypothesis (q is number of parameters in model; p is the number of simultaneous tests) FILL Contrast matrix with 1 if parameter involved-1 if involved as a difference0 not involved otherwise
• [is a vector of parameters (fixed or random); q
• [k is a vector of values that the parameters are contrasted against (usually the null); these have to be set
• Example: Testing H0: 2 – 3 = 0 AND 3 = 5
– q = 4 (intercept and 3 slope terms)– p = 2 (2 sets of tests)
[C] [] [k]
*
0
0 0 1 -1 1 = 0
0 0 0 1 2 5
3
• Overall test against chi square with p degrees of freedom
• Output– Result of the contrast– Chi-square statistic for each test separately– Chi-square statistic for overall test; all contrasts simultaneously
Testing in fixed part1 slope for Standlrt; 2 BoySch from mixed3 GirlSch from mixed4 Boysch from Girlsch
Model > Intervals& tests >Fixed coefficients; 4 tests
Basic Statistics > Tail Areas Chi square;
CPRObability 1.586 1
0.20790
Testing in random part1 school variance2 difference between school and student variance
Model > Intervals& tests >Random coefficients; 2 tests
Basic Statistics > Tail Areas Chi square;
CPRObability 25.019 1
5.6768e-007
18.6 Do we need a quadratic variance function at level 2?
->CPRObability 32.126 3
4.9230e-007 CPRO 4 1 Benchmarks 0.046 CPRO 6 2 0.050 CPRO 8 3 0.046
18.7 Comparing nested models: likelihood ratio test• Akin to F tests in regression models, i.e., is a more complex model a
significantly model better fit to the data; or is simpler model a significantly worse fit
• Procedure:
– Calculate the difference in the deviance of the two models
– Calculate the change in complexity as the difference in the number of parameters between models
– Compare the difference in deviance with a chi-square distribution with df = difference in number of parameters
• Example: tutorial data
do we get a significant improvement in the fit if we move from a constant variance function for schools to a quadratic involving Standlrt?
-2*log(lh) is 9305.78: quadratic-2*log(lh) is 9349.42: constant->calc b3 = b2-b1 43.644 ->cpro 43.410 2 3.7466e-010 NB significantly worse fit; ie need quadratic
18.9 Deviance Information Criterion• Diagnostic for model comparison• Goodness of fit criterion that is penalized for model complexity• Generalization of the Akaike Information Criterion (AIC; where df is known)• Used for comparing non-nested models (eg same number but different variables)• Valuable in Mlwin for testing improved goodness of fit of non-linear model (eg Logit) because
Likelihood (and hence Deviance is incorrect)• Estimated by MCMC sampling; on output get
Bayesian Deviance Information Criterion (DIC)Dbar D(thetabar) pD DIC9763.54 9760.51 3.02 9766.56
Dbar: the average deviance from the complete set of iterationsD(thetaBar): the deviance at the expected value of the unknown parameters pD: the Estimated degrees of freedom consumed in the fit, ie Dbar-
D(thetaBar) DIC: Fit + Complexity; Dbar + pD
NB lower values = better parsimonious model• Somewhat contoversial! Spiegelhalter, D.J., Best, N.G., Carlin, B.P. and van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical
Society, Series B 64: 583-640.
18.10 Some guidance
• any decrease in DIC suggests a better model• But stochastic nature of MCMC; so, with small difference in DIC you should confirm if this is a real difference by checking the results with different seeds and/or starting values.
More experience with AIC, and common rules of thumb………
18.11 Example: Tutorial dataset exampleModel 1: NULL model: a constant and level 1 varianceModel 2: additionally include slope for StandlrtModel 3: 65 fixed school effects (64 dummies and constant)Model 4: school as random effectsModel 5: 65 fixed school intercepts and slopes Model 6: random slopes model; quadratic variance function
Best = Model 6Note: random models (4 & 6) have more nominal parametersthan their fixed equivalents but less effective parameters and alower DIC value (due to distributional assumptions)