National Institute for Applied Statistics Research Australiaweb/... · 2015-06-07 · Health...

Copyright © 2013 by the National Institute for Applied Statistics Research Australia, UOW. Work in progress, no part of this paper may be reproduced without permission from the Institute.

National Institute for Applied Statistics Research Australia, University of Wollongong, Wollongong NSW 2522. Phone +61 2 4221 5435, Fax +61 2 4221 4845. Email:

[email protected]

National Institute for Applied Statistics Research Australia

The University of Wollongong

Working Paper

15-13

Empirical Zoning Distributions for Small Area Health Data

Dr Sandy Burden and Professor David Steel

mailto:[email protected]�

Empirical Zoning Distributions for Small Area

Health Data

Dr Sandy Burden ∗1 and Professor David Steel2

1Research Fellow, National Institute of Applied Statistics Research Australia, School

of Mathematics and Applied Statistics, University of Wollongong, Australia.

2Director, National Institute of Applied Statistics Research Australia, School of

Mathematics and Applied Statistics, University of Wollongong, Australia

Abstract

Many health studies use aggregate data such as the means or totals

for areal units or zones when individual level data are not available. An

ecological analysis using these data typically produces estimates that dif-

fer from those obtained using the corresponding individual level analysis.

This is due to the modifiable area unit problem (MAUP) whereby the re-

sults of the analysis depend on the scale and zoning effects. In this paper

empirical zoning distributions are used to study the effect of zoning on

the parameter estimates from ecological analyses. The zoning distribution

is defined as the distribution of the parameter estimates obtained from a

given ecological analysis which is repeated for different sets of M zones.

For simulated population data, generated using the 2007-2008 Aus-

tralian National Health Survey and 2006 Australian Census, we create

zoning distributions for estimates from ecological regression models at

∗Email: [email protected]

1

multiple scales of analysis. These distributions are typically relatively

symmetrical and unimodal and appear to be normally distributed. They

often have appreciable variation, which should not be ignored. Using the

distribution, the “ecological average” or mean of the empirical estimates

at each scale of analysis, displays systematic variation with the number

of zones. The variance of the zoning distribution is related to the average

zone population per zone.

Using empirical zoning distributions, the parameter estimates obtained

for a given set of zones at the same or a different scale can be compared

and the influence of zones on the results of an ecological analysis.

Keywords: modifiable areal unit problem, ecological bias, scale, zoning,

regression analysis, multilevel modelling

1 Introduction

A set of zones is formed when a given study region is partitioned into M ge-

ographically contiguous, non-overlapping areal units. Individuals in the pop-

ulation are assigned to the zones using indicators of geographic location, such

that all individuals belong to exactly one zone and each zone has Ng ≥ 1 ob-

servations, g = 1, ...,M . Health studies ideally use individual level data in a

multilevel modelling framework to incorporate the clustered nature of the data

in the analysis. The model has a hierarchical structure with individual ob-

servations nested within the cluster or group to which they belong, which for

geographical data are the zones. For privacy reasons, or to allow data from

different sources to be brought together, frequently only aggregate geographic

summaries are available and an ecological analysis is used which substitutes area

level summaries for the individual level data. However, ecological parameter es-

timates may be biased and the amount and direction of bias depends on the

zones used to analyse the data.

2

This sensitivity is called the modifiable area unit problem (MAUP) and it

occurs because aggregation removes the direct link between individual response

and covariate values. It has two aspects: the scale effect, which occurs when

the number of areas in the study region is changed and the zoning effect which

arises when the areas at a given scale are defined using different boundaries

(Flowerdew et al., 2001). Changing either factor may alter the estimates which

are obtained. Changes in scale substantially affect the variance of the estimates

in a somewhat systematic way (Steel and Holt, 1996), but the effect of moving

the zone boundaries is not apparently systematic (Openshaw, 1984; Stafford

et al., 2008; Haynes et al., 2008). Although highly homogenous areas will result

in an analysis with more power and potentially a smaller bias due to the MAUP

(Briant et al., 2010) as in this case more of the variability between data values

is between areas.

While the MAUP remains unresolved, the results of area level or ecological

analyses can only legitimately be applied to the particular zones used in the

study, as estimates may change for alternative sets of zones, even at the same

scale. When the particular set of zones is of direct substantive interest, the

possibility of a different result for an alternative set of zones is irrelevant. In

other cases the effect of the zones on the parameter estimates and variation in

the results for an alternative set of zones can assist interpreting the results of the

analysis. Its importance has been recognised in many different types of analysis

including studies of health (Diez-Roux and Mair, 2010; Parenteau and Sawada,

2011; Swift et al., 2008; Cockings and Martin, 2005; Schuurman et al., 2007;

Best et al., 2001). However, the geographical scale of a study is still frequently

determined by data availability (Wakefield, 2004), as is the choice of zones.

The zones used to analyse data can either be derived from existing zoning

systems or created for the purpose of the analysis. Existing systems typically

3

utilise official or administrative boundaries which provide a convenient way to

disseminate data, such as post-codes, output areas, enumeration districts and

local authorities. However, it has long been recognised that the use of existing

zoning systems has limitations (Openshaw, 1977). For example the UK Census

Enumeration Districts display “wide variations in population size, geographical

shape, area and social composition” (Cockings and Martin, 2005, pp. 2732–

2733). Alternatives to administrative boundaries include zones formed using

geometric shapes (such as a rectangular grid), Voronoi tessellations (Swift et al.,

2008), local knowledge of the area, or automatic zone design procedures. A re-

cent review of zone design techniques is provided by Duque et al. (2007). Stand

alone zone design algorithms which have been used for small area health data

include ZDES (Openshaw and Rao, 1995) and AZTool (Cockings et al., 2011;

Martin, 2003) which are based on the AZP algorithm (Openshaw, 1977; Open-

shaw and Rao, 1995). Other zone design packages, including the scale-space

clustering method (Mu and Wang, 2008), are available for use with Geographic

Information System (GIS) packages.

Several aspects of the zoning effect have previously been studied, including

the appropriate zones to use in an analysis (Haynes et al., 2007); the definition

of neighbourhoods; and the inclusion of contextual or neighbourhood effects,

particularly for the analysis of individual level data (Diez-Roux and Mair, 2010).

The effect of scale and the ecological bias associated with using aggregate data

to estimate individual level relationships have been widely considered in the

fields of spatial epidemiology, geography and the social sciences (see Greenland,

2002; Steel et al., 2003; Richardson et al., 1987; Wakefield, 2004, for example).

However, these analyses do not consider the distribution of estimates obtained

using multiple sets of zones at each of several given scales, which is the focus of

this paper.

4

In this paper we define the zoning distribution of a parameter estimate as

the distribution or density function of the estimate over all possible sets of M

zones, given the scale and any constraints used in constructing the zones (such

as minimum population thresholds). It can be used to obtain the expected value

and variability of estimates at a given scale. The ecological average, defined as

the expected value of the zoning distribution, along with the variance of the

zoning distribution provide a way to compare and standardise the results for a

set of zones at a given scale. Zoning distributions may also be used to make

inferences about a parameter for one set of zones or at one scale, given the data

for another set of zones at a different scale.

There are presently no established rules or guidelines which can be used to

consider the form of the zoning distribution, other than in the case of random

aggregation, when the expected value of each estimate is unbiased for the ap-

propriate individual level parameters and their variation can be derived from

standard statistical theory (see Steel and Holt, 1996, for example). Since their

comprehensive demonstration by Openshaw (1984), zoning distributions have

not been widely considered in the literature, with the notable exception of Cock-

ings and Martin (2005) who create 10 sets of zones at several scales to determine

the sensitivity of a correlation coefficient to changes in scale and zoning.

In this paper empirical estimates of zoning distributions are created for the

parameter estimates from a statistical model for small area health data so that

they can be better understood. They are used throughout this paper to describe

the variation in estimates for different sets of zones at multiple scales; to iden-

tify their impact on statistical analyses; to identify any systematic changes with

scale; and to identify appropriate assumptions for zoning distributions so they

may be incorporated in future statistical analyses. In section 2, the methodol-

ogy used to create empirical zoning distributions for aggregate health data is

5

described. The results of the analysis are presented in section 3 and discussed

in section 4. In addition to the usual results on ecological bias and scale effects,

we focus on the characteristics of the zoning distribution such as its dispersion,

its shape, its variance and how it changes with scale and how important it is to

account for the zoning distribution in interpreting the results of an ecological

analysis.

2 Methodology

Evaluation of zoning distributions requires spatially detailed data which is not

available for the Australian population. Instead, these data were simulated for

6,378,163 individuals in the 11879 populated Census Collection Districts (CDs)

in New South Wales (NSW), Australia. Unit record data from the 2007-2008

National Health Survey (see Australian Bureau of Statistics, 2008, 2009, for a

description of the data) were combined with summary (benchmark) data defin-

ing the characteristics of CDs from the 2006 Australian Census (Australian

Bureau of Statistics, 2006a) using a spatial microsimulation model (MSM) as

described in Burden and Steel (2013). For this approach, for each area an initial

set of individual records was randomly selected from the National Health Sur-

vey. These records were then swapped with stochastically selected alternatives

using a simulated annealing procedure to optimise the fit of the data to the

pre-selected benchmarks for the area from the 2006 Australian Census. The re-

sulting simulated population reflects the characteristics of the survey data and

also the geographical structure at the CD level.

Health outcomes considered in the study were: measured body mass index

(BMI); type 2 diabetes mellitus (diabetes); and angina. The relationship be-

tween each outcome and three binary indicators: a sedentary lifestyle (little or

no physical activity); dietary fat (consumption of whole milk with ≥3% fat);

6

and current smoking status (for BMI) or obesity (BMI ≥ 25 - for angina and

diabetes) was investigated. Table 1 shows summary statistics for the distri-

bution of each variable across the CD’s. The variables age and sex were also

included in the model for BMI and used to calculate the expected cases of angina

and diabetes in each area, which were included as an offset in their respective

models.

[Table 1 about here.]

As health studies often includes a measure of socio-economic status, an area

level indicator of socioeconomic status (denoted HSEIA) was created, following

the procedure used to define the Australian Bureau of Statistics Socioeconomic

Index for Areas, Index of Relative Socio-Economic Advantage and Disadvantage

(SEIFA) (Australian Bureau of Statistics, 2006b, p.17–23). HSEIA was created

using principle component analysis applied to the CD-level correlation matrix

obtained from the simulated data. The index was standardised to have a mean

of 1000 and standard deviation of 100. Despite being created using different

datasets and some different variables (due to data availability limitations), the

distribution of deciles assigned to each area for HSEIA and SEIFA was simi-

lar. Overall, HSEIA included 16 variables (see Appendix 5 for details), had an

eigenvalue of 7.09 and explained 44% of the variation in the variables used in

the index. For comparison, SEIFA included 21 variables, had an eigenvalue of

9.16 and also explained 44% of the variation in the variables.

The final simulated dataset comprised a set of individual health records

(from the health survey) for the population of NSW in private dwellings with

location information known to the CD level. The simulated data were then

rezoned to higher levels of aggregation using the AZTool Software (Cockings

et al., 2011; Martin, 2003). AZTool randomly allocates CDs to analysis zones

whilst preserving geographic contiguity. It then iteratively swaps CDs between

7

adjacent zones to improve predefined targets and ensure each zone lies within

specified upper and lower population limits. At eight scales of analysis, 1000

sets of zones were defined, each using a single run of AZTool with 15 sets of swap

iterations. Table 2 shows the population constraints used for each scale and the

resulting average population statistics. At each scale, the average population

per zone was achieved and the range in population per zone was always narrower

than the specified limits. The coefficient of variation (standard deviation divided

by the mean) decreased with scale indicating less variability in zone population

with increasing scale.


Aggregate data summaries were produced for each set of zones and the sum-

maries were analysed to obtain regression parameter estimates, βE , for the

covariates in X.

For a given set of M zones, the average BMI in each zone (Yg; g = 1, ...,M)

was modelled in terms of the average for age, sex, each indicator variable and

HSEIA using the zone level regression model defined in Equation 1.

Yg|xg ∼ N(µg, σ2)

E[Yg|xg] = µg = xTg β

E (1)

where xg = 1Ng

∑Ng

i=1 Xi is the mean covariate value for individuals i = 1, ..., Ng

in area g.

Angina and diabetes were modelled as count variables. Making a rare disease

assumption, the ecological model defined in Equation 2 was specified using a

Poisson distributed response and log link function. A normally distributed

random effect νg was modelled in terms of the relative risk of disease for area g,

8

θg. This approach was used because for rare diseases and large group sizes, the

count of positive responses for each area can be approximated by an independent

Poisson random variable given the covariates and a normally distributed random

effect for the zones. An offset, Eg, was included in the model to account for

differences in the population at risk in each area according to its age and sex

structure. It was calculated as the weighted sum of the expected counts of

disease in each age×sex stratum in the population. The expected counts in

each stratum were calculated across all areas using the individual level data for

the population.

Yg|νEg , xg ∼ Pois(µg)

E[Yg|νEg , xg] = µg = Egθg (2)

θg = exp(xTg β

E + νEg )

νEg ∼ N(0, σ2ν)

Model parameters were estimated using either ordinary least squares (for

BMI) or second order penalised quasi-likelihood (PQL2) in the MlWiN software

(Rasbash et al., 2009). Markov Chain Monte Carlo (MCMC) techniques are also

frequently utilised to analyse these models and can provide better estimates.

However, these techniques are also time-consuming. Due to the large number of

simulations and as the aim of this project was development and demonstration

of zoning distributions comparing changes in estimates with scale and zoning,

MCMC techniques were not utilised. Using the resulting parameter estimates

for each covariate, zoning distributions were defined at each scale using ker-

nel density estimation with a Gaussian kernel in the R Statistical Software (R

Development Core Team, 2008).

9

For the rth set of zones, r = 1, ..., 1000, at scale Lk, k = 1, ..., 8, the es-

timate of the regression parameter is denoted βEr,k. The associated standard

variance estimate from the analysis is denoted V ar(βEr,k), from which the es-

timated standard error can be obtained as SE(βEr,k) =

√V ar(βE

r,k). We can

calculated the average of each of these quantities over the R = 1000 sets of

zones generated at scale Lk. For the zoning distribution this gives: Ek[βE ] =

1R

∑Rr=1 β

Er,k, the average of the ecological regression estimates, Ek[V ar(βE)] =

1R

∑Rr=1 V ar(βE

r,k), the average of the estimated variance of the regression es-

timates, and Ek[SE(βE)] =

√Ek[V ar(βE)], the average estimated standard

error of the regression estimates. The estimated variance is not the same as the

empirical variance of the zoning distribution, which is given by V ark(βE) =

1R

∑Rr=1

(βEr,k − Ek[β

E ])2

.

The individual level data were also analysed using appropriate models to pro-

vide a reference for comparison with the ecological estimates. A linear multilevel

statistical model was used to obtain parameter estimates for the regression coef-

ficients, βIr,k for r = 1, ..., 1000 at scale Lk, and the associated variance estimate

V ar(βIr,k), for each covariate on BMI (including age and sex). The statistical

model is defined in Equation 3 for a given set of M zones where νIg is a normally

distributed random effect with mean zero and variance σ2ν .

Yi|νg,xi ∼ N(µi, σ2)

E[Yi|νg,xi] = µi = xTi β

I + νIg (3)

νIg ∼ N(0, σ2νI )

Using equivalent notation, binary indicators of prevalence of angina and diabetes

were modelled using Equation 4 as Bernoulli random variables with a logistic

10

link and random effect νIg .

Yi|νg,xi ∼ Bern(µi)

E[Yi|νg,xi] = µi =exp(xT

i βI + νIg )

1 + exp(xTi β

I + νIg )(4)

νIg ∼ N(0, σ2νI )

The parameters in these models were estimated using each set of zones to

define the group level and a sample of data of size approximately equivalent to

that obtained from a large population survey. A 0.33% or 1% simple random

sample of individual records was selected with an equal probability of selection

for BMI and the binary variables respectively. Some logistic models at levels

one to three failed to converge using PQL2, due to the large number of small

areas, so first order marginal quasi-likelihood (MQL1) was used for estimation.

3 Results

3.1 Empirical Zoning Distributions

Zoning distribution density plots for the ecological regression parameter esti-

mates (βEr,k) are shown for BMI, angina and diabetes in Figures 1, 2 and 3

respectively. The domain of each distribution represents the range of parameter

estimates which may be obtained for the given covariate and scale. The den-

sity curve reflects the probability distribution for the zoning distribution of the

estimate for the given statistical model.

[Figure 1 about here.]



11

The zoning distributions were generally unimodal, reasonably symmetric and

were a similar shape for all response variables. With an increase in scale, that

is with smaller M , the ecological average of each parameter estimate generally

increased in absolute magnitude in a consistent direction, although the relative

size of the change diminished with scale. Due to the large size of the dataset,

parameter estimates were statistically significant at the 5% level for most sets of

zones at all scales, based on the regression estimate and the estimated variance

for a particular set of zones.

The variance of the distributions increased substantially with scale and the

proportional increase in variance with scale was similar for all covariates due

to the reduced power of the analysis resulting from fewer data values. This

demonstrates the implications for the inferences that can be made using ecolog-

ical parameter estimates. When the limits of the distribution extend to include

zero, inferences may not be significant and in some cases the apparent relation-

ship may change sign, i.e. the covariate obesity in the model for diabetes.

The Shapiro Wilk test for normality was performed (using the R statistical

software) for each zoning distribution. For BMI and diabetes, the null hypothe-

sis of normality was retained at the 5% level for approximately 90% and 96% of

the tests performed. For angina, the results were mixed, with the assumption of

normality retained for only 73% of the tests at the 5% level. This may be due

to reduced power with angina having fewer positive responses in the data. For

all response variables, the skewness of the zoning distributions for all parame-

ter estimates was well below one (with typical values of 0.01 to 0.03) and the

excess kurtosis was also low, confirming that the distributions are all relatively

symmetric and are consistent with normally distributed data.

A key advantage of the microsimulation approach is that estimates obtained

from ecological analysis and multilevel models can be compared using realistic

12

aggregate and individual level data from the same population. Table 3 provides

a summary the ecological average, the average estimated standard error and

the empirical variance of the zoning distributions for levels 1, 4, and 8 of the

ecological analyses. Average multilevel model parameter estimates, calculated

for 1000 sets of zones, and the equivalent CD level estimates are also included.

The multilevel model estimates are the average estimates using groups at level 4,

although very similar results were obtained for groups at all scales. For angina

and diabetes the estimates are obtained using a Bernoulli distributed response

variable (not Poisson). The estimated variance of the random effects is generally

small and decreases with scale in all cases.


For an individual level target of inference, the ecological parameter estimates

are consistently biased and have generally higher variances than multilevel mod-

els. With increasing scale, the change in bias of the average variance of the dif-

ference appears to behave systematically. However, predicting the magnitude

of the bias for a particular set of zones is much harder. If the zoning distri-

bution were known, the bias of a particular estimate may be standardised by

accounting for the variability due to zoning, although the problem still remains

that the relationship between the individual level and area level estimates is not

immediately predictable.

The scale effect is demonstrated by the change in the ecological average M

changes. In some cases this resulted in parameters changing sign. For example

in Figure 3, the distribution of the estimated parameter for the HSEIA covariate

takes positive and negative values at different scales. The results suggest that

a component of the bias is related to the scale of the analysis, particularly as

similar results were obtained for the minimum, median and maximum of the

zoning distributions (not shown). The relationship was not universal though, as

13

the pattern of covariate values with scale for some covariates was more complex

i.e. obesity in the diabetes model.

The potential magnitude of the zoning effect is quantified by variance pa-

rameters. These results suggest that the zoning distributions of the variances

vary systematically as a function of scale i.e. as a function of N and therefore

M , reflecting that the variance is a function of the degrees of freedom, M − 1.

The trend in the variance of the zoning distribution with scale seems consistent

with that expected when data are randomly aggregated even though the average

parameter estimates do not exhibit the same characteristics.

The implications of the zoning distribution can also be considered in terms

of a predictive interval for the parameter estimates obtained for a new set of

zones. Assuming approximate normality of the zoning distribution, the width

of a 95% prediction interval for a new parameter estimate calculated from a

different set of zones is equal to 2×1.96

√V ar(β). Table 4 shows these intervals

as a proportion of the average value of the estimate at the relevant scale. The

predictive intervals get much wider as the scale increases, in a similar fashion

to the variance of the zoning distribution. This suggests that the impact of the

zones increases for higher scales, and hence the importance of taking the zoning

distribution into consideration also increases.


For some parameter estimates, 95% of the estimates from a new set of zones

would lie within 10% of the average value of the parameter estimate, i.e. the

covariates for angina at the lower scales. In some studies this may represent a

reasonable level of variation due to zoning, in which case the zoning distribution

will not substantially affect the parameter estimates or inference. However in

many cases, even when the estimates are statistically significant, the prediction

interval for 95% of estimates from a new set of zones may be greater than 20%

14

and could lie between 30% and 100%. For these cases the variation due to

zoning may have a substantial impact on inference.

3.2 Relationships in the Data

For a linear model, when data are randomly aggregated, the parameter estimates

for a fully specified model are theoretically unbiased and the variance is inversely

proportional to the degrees of freedom, M−1 (Steel and Holt, 1996). As the size

of the total population is fixed, the variance is also proportional to N . Figure 4

plots the ecological average of the parameter estimates versus M for each scale

of analysis. It shows that there are clear trends in the variation of the average

ecological parameter estimates with scale and that the relationship is not linear.

In general the ecological average decreases in magnitude with increasing scale,

although the coefficient for dietary fat in the case of diabetes appears inversely

related to M . The key finding is that for statistically significant parameter

estimates there is a systematic relationship between the ecological average and

number of areas used in the analysis.


The relationship between the empirical variance of the zoning distribution

for each regression parameter estimate, (V ark(βE)), and N is shown in Figure

5. For the response BMI, the expected value of the estimated variance of the

parameter estimates is proportional to the average population size in each area,

N (and 1/M , which is not shown). Despite non-linearity in the models, similar

results are obtained for angina and diabetes. These results suggest that scale

effects for the regression parameter estimates are related to M with variances

linearly related to N . It is interesting to see that the same result appears to

hold for the parameter estimates for a non-linear model. An important finding

from these results is that even though the number of areas at the highest scale

15

(L8 M=317) is not small, the variance of the zoning distribution is substantial.

Moreover, even at smaller scales where M is large, there is appreciable variation

in the zoning distribution.


The average estimated variance of the parameter estimates, (Ek[V ar(βE)]),

over 1000 zones (not shown), also exhibits the same linear relationship with N

(and 1/M) with increasing scale for all of the response variables. Ek[V ar(βE)]

and V ark(βE) are compared directly in Figure 5 for each of the response vari-

ables. In all cases the relationship between the variance estimates is reasonably

linear, but not exactly the same. The average estimated variance is close to four

times greater than the zoning variance.


4 Discussion

These results provide valuable insights into zoning distributions, showing that

they exist and can have appreciable dispersion even when there are a large num-

ber of zones. For a continuous or a binary response variable, the zoning dis-

tributions of the ecological parameter estimates appear normally distributed.

The variance of each zoning distribution is a predictable function of N and

the average of the parameter estimates over the zoning distribution are related

to M . The variance of the zoning distribution is often appreciable, although

less than the standard variance estimates obtained from an ecological analysis,

and it should not be ignored when interpreting the results of statistical analysis

based on aggregate data for geographic zones. For example, zoning distribu-

tions at two different scales may overlap and in some cases the variation due to

zoning may exceed difference due to scale. Quantifying the zoning distribution

16

allows confidence intervals for the expectation of the estimates over the zoning

distribution to be obtained.

The results demonstrate several implications of using data aggregated to a

given set of zones for obtaining parameter estimates. In general, the aggregate

data estimator is biased compared with the individual level estimator and the

magnitude and direction of the bias depends on the estimator. The bias of a

given estimate cannot yet be predicted, but for statistically significant estimates

the average value of the estimator over the zones varies reasonably systematically

with scale for both linear and non-linear ecological models. An extension of this

result is that zoning distributions may also be used in the definition and analysis

of neighbourhood effects. For example, zoning distributions (particularly at

multiple scales) can be used to identify the scale above which zones can be

assumed to be randomly formed.

One result which is only available when multiple sets of zones are used to

analyse the data at each level is that parameter estimates may only be statis-

tically significant for some sets of zones at a given scale. When this occurs,

the zones chosen to analyse the data can affect the statistical significance of

the parameter estimates in ways that at present are not predictable. In many

cases the primary factor affecting the stability of the estimates is the scale of

the analysis. In all cases, a greater number of observations (i.e. zones) improves

the variance of the estimates and increases the probability of a statistically sig-

nificant result. However, even when there are over 1000 zones in the analysis

the zoning can have an impact on the parameter estimates.

In some cases, the expectation of the zoning distribution at a given scale

may itself be a reasonable target of inference. If the zoning distribution can be

characterised, we might then be able to draw conclusions about it. A finding

in this study is that in most cases, the bias caused by aggregation is more

17

substantial than the variation due to the zones used in the analysis. However

there are some cases - particularly at higher scales - when the variation due

to zoning is substantial and cannot be ignored. Similarly, compared with the

bias associated with the use of aggregate rather than individual level data, the

zoning effect was relatively minor at low scales although its impact increased

substantially as the scale of aggregation increased and the zoning distributions

became much flatter and wider.

Given a set of zones at a particular scale it is worthwhile evaluating the

zoning distribution to assess the sensitivity of the results of the analysis to

the zones used. If only one set of zones is available at the scale, the trend in

scale effects and the empirical variance of the zoning distribution observed in

this paper suggest that it is worthwhile evaluating the zoning distribution at a

higher scale to give some indication of the possible means and variance of the

zoning distribution at the scale of interest.

If the zoning distribution at a particular scale can be estimated, then given

results for one set of zones it may be possible to make a judgment regarding the

results which may be obtained for another set of zones at that scale. For exam-

ple, using prediction intervals, a prediction of the parameter estimates obtained

with a different set of zones can be made. For this study, relative prediction

intervals of ±10% to ±15% were frequently obtained from the zoning distri-

butions, although some parameter estimates were more substantially affected.

Consequently the use of a particular set of zones introduces an additional source

of error, and knowledge of the zoning distribution allows it to be quantified and

compared with the other sources of error.

By undertaking analyses at several scales or using several sets of zones at

a given scale, the average value of the zoning distribution may be obtained.

A major finding of the analyses conducted here is that the ecological average,

18

Ek[βE ], appears to vary systematically with scale allowing the effect of zoning

at a given scale to be predicted. Moreover, it is possible to re-aggregate the data

to a higher scale and then to predict the variance of the zoning distribution at a

lower scale based on the variance at the higher scale. The implications of these

results are that given one observation on a zoning distribution at one scale,

if the data are aggregated in a number of ways to several different scales, the

relationships between the scales can be exploited to help asses the possible mean

and variance of the zoning distribution for the scale of interest. Moreover at a

given level above the lowest scale, it is possible to make a partial adjustment

of the estimator to its average value, to account for the zoning distribution.

For example by extrapolating the observed relationship between the ecological

average and the scale of analysis down to the lowest scale at which data are

available.

To draw conclusions about a different scale requires an understanding of how

the expectation of the zoning distribution varies with scale. Obtaining parame-

ter estimates at a scale which is higher than the scale of interest is not difficult,

as with sufficient zones the data can be merged in multiple ways to a higher

scale. Going down a level is more difficult, but if the zoning distribution can

be related in a systematic way to the scale of the analysis, then prediction and

estimation at lower levels is possible. An example of this is that the zoning

distribution at level 1, say, may help us in assessing the CD level zoning distri-

bution which should be used with the single estimate that we have for the CD

level.

In conclusion, the characteristics of the zones used to aggregate the data are

an important aspect of the analysis for any type of study using small area health

data or when population grouping is involved. In all studies there is a need to

carefully consider the zones used in the analyses and the zoning distribution

19

that applies. This paper provides an extensive systematic investigation of the

characteristics of zoning distributions for parameter estimates obtained from

the analysis of small area health data using an ecological model.

5 Appendix 1


References

Australian Bureau of Statistics (2006a). Census Dictionary. cat. no. 2901.0

(reissue) ABS, Canberra.

Australian Bureau of Statistics (2006b). Socio-Economic Indexes for Areas

(SEIFA) - Technical Paper. cat. no. 2039.0.55.001. ABS, Canberra.

Australian Bureau of Statistics (2008). National Health Survey 2007-08. Basic

CURF, CD-ROM. Findings based on use of ABS CURF data.

Australian Bureau of Statistics (2009). National Health Survey: Users’

guide - Electronic Publication, 2007-08. Cat. No. 4363.0.55.001. Viewed

21 June 2012 http://www.ausstats.abs.gov.au/ausstats/subscriber.nsf/

0/CC0FB5A08570984ECA25762E0017CF2B/$File/4363055001 2007-08.pdf.

Best, N., Cockings, S., Bennett, J., Wakefield, J., and Elliott, P. (2001). Eco-

logical regression analysis of environmental benzene exposure and childhood

leukaemia: Sensitivity to data inaccuracies, geographical scale and ecologi-

cal bias (Pkg: p141-207). Journal of the Royal Statistical Society, Series A,

164(1):155–174.

Briant, A., Combes, P.-P., and Lafourcade, M. (2010). Dots to boxes: Do the

20

size and shape of spatial units jeopardize economic geography estimations.

Journal of Urban Economics, 67:287 – 302.

Burden, S. and Steel, D. (2013). Microsimulation of health data to retain spatial

structure for small areas. Working Paper 20-13, University of Wollongong.

Cockings, S., Harfoot, A., Martin, D., and Hornby, D. (2011). Maintaining ex-

isting zoning systems using automated zone-design techniques: methods for

creating the 2011 census output geographies for england and wales. Environ-

ment and Planning A, 43:2399 – 2418.

Cockings, S. and Martin, D. (2005). Zone design for environment and health

studies using pre-aggregated data. Social Science and Medicine, 60:2729–

2742.

Diez-Roux, A. and Mair, C. (2010). Neighbourhoods and health. Annals of the

New York Academy of Sciences, 1186:125 – 145.

Duque, J., Ramos, R., and Surinach, J. (2007). Supervised regionalisation meth-

ods: a survey. International Regional Science Review, 30(3):195 – 220.

Flowerdew, R., Geddes, A., and Green, M. (2001). Behaviour of regression mod-

els under random aggregation. pages 89–104. John Wiley and Sons, Chich-

ester.

Greenland, S. (2002). A review of multilevel theory for ecologic analyses. Statis-

tics in Medicine, 21(1):389–395.

Haynes, R., Daras, K., Reading, R., and Jones, A. (2007). Modifiable neigh-

bourhood units, zone design and residents perceptions. Health and Place,

13:812 – 825.

21

Haynes, R., Jones, A., Reading, R., Daras, K., and Emond, A. (2008). Neigh-

bourhood variations in child accidents and related child and maternal char-

acteristics: does area definition make a difference. Health and Place, 14:693

– 701.

Martin, D. (2003). Extending the automated zoning procedure to recon-

cile incompatible zoning systems. Int. J. Geographical Information Science,

17(2):181–196.

Mu, L. and Wang, F. (2008). A scale-space clustering method: mitigating the

effect of scale in the analysis of zone-based data. Annals of the Association

of American Geographers, 98(1):86 – 101.

Openshaw, S. (1977). A geographical solution to scale and aggregation problems

in region-building partitioning and spatial modelling. Transactions of the

Institute of British Geographers, 2(4):459–472.

Openshaw, S. (1984). The modifiable area unit problem, volume 38 of Concepts

and techniques in modern geography. Geobooks, Norwich.

Openshaw, S. and Rao, L. (1995). Algorithms for reengineering 1991 census

geography. Environment and Planning, 27:425–446.

Parenteau, M.-P. and Sawada, M. (2011). The modifiable areal unit problem

(maup) in the relationship between exposure to no2 and respiratory health.

International Journal of Health Geographics, 10:58.

R Development Core Team (2008). R: A Language and Environment for Statis-

tical Computing. R Foundation for Statistical Computing, Vienna, Austria.

ISBN 3-900051-07-0.

Rasbash, J., Charlton, C., Browne, W., Healy, M., and Cameron, B. (2009).

22

MLwiN version 2.1. Technical report, Centre for Multilevel Modelling,, Uni-

versity of Bristol.

Richardson, S., Stucker, I., and Hemon, D. (1987). Comparison of relative risks

obtained in ecological and individual studies: some methodological consider-

ations. International journal of epidemiology, 16:111–120.

Schuurman, N., Bell, N., Dunn, J., and Oliver, L. (2007). Deprivation indices,

population health and geography: an evaluation of the spatial effectiveness of

indices at multiple scales. Journal of Urban Health, 84:591 – 603.

Stafford, M., Duke-Williams, O., and Shelton, N. (2008). Small area inequalities

in health: are we underestimating them? Social Science and Medicine, 67:891

– 899.

Steel, D. and Holt, D. (1996). Rules for random aggregation. Environment and

Planning A, 28:957–978.

Steel, D., Tranmer, M., and Holt, D. (2003). Analysis combining survey and

geographically aggregated data. In Analysis of Survey Data, chapter 20, pages

323–343. John Wiley and Sons, London.

Swift, A., Liu, L., and Uber, J. (2008). Reducing maup bias of correlation

statistics between water quality and gi illness. Computers, Environment and

Urban Systems, 32:134 – 148.

Wakefield, J. (2004). A critique of statistical aspects of ecological studies in

spatial epidemiology. Environmental and Ecological Statistics, 11:31–54.

23

Table 1: CD level health outcome and risk factor summary data for the simu-lated population of NSW

Total Mean Standard Coeff. 2.5 % 97.5 %(’000) Deviation Variation percentile percentile

BMI 27.1 5.17 0.191 23.4 39.2Diabetes 199 0.031 0.0015 0.049 0.028 0.034Angina 121 0.019 0.0019 0.099 0.015 0.023Smoking 1049 0.164 0.0071 0.043 0.151 0.178Sedentary 1955 0.306 0.0070 0.023 0.293 0.320Obesity 763 0.120 0.0033 0.028 0.113 0.126Dietary Fat 2924 0.458 0.0070 0.015 0.445 0.472

24

Table 2: Population targets and constraints used to create sets of zones inAZTool and the resulting population statistics averaged for each scale.Scale Pop. Range No. Mean Std Min Max Coeff.

Target (’000) Zones Dev VariationCD 11879 537 258 3 2755 0.481

L1 1000 0.5 – 3 6214 1026 210 636 2755 0.205L2 2000 1–4 3168 2013 237 1373 3180 0.117L3 4000 2–8 1585 4024 309 2783 5502 0.077L4 6000 3–12 1056 6040 376 4429 8269 0.062L5 8000 4 – 16 792 8053 444 5904 10464 0.055L6 10000 5 – 20 634 10060 506 8004 12613 0.05L7 15000 7.5 – 30 423 15078 678 11881 18018 0.044L8 20000 10 – 40 317 20120 834 17037 23781 0.041

25

Table 3: Comparison of the ecological averages of the parameter estimates(Ek[β

Er,k]) for levels 8, 4 and 1 with the corresponding CD and multilevel esti-

mates (using groups at scale lk = 4) (line 1); and variance estimates V ark(βE)

(and Ek[SE(βEr,k)) (line 2) for the statistical models for BMI, angina and dia-

betes. For the multilevel estimates V ark(βE) < 1e−04 so they are not included.

Covariates for age and sex were also included in the model for BMI.Level 8 Level 4 Level 1 CD Model ML Model

BMIConstant 27.8⋄ 27.4 27.9 27.9 29.9

0.731 (1.76) 0.214 (0.841) 0.0262 (0.283) (0.543)Sedentary -4.4⋄ -2.36 -1.28 -0.235 0.508

0.488 (1.35) 0.128 (0.653) 0.0122 (0.234) (0.101)Smoking 6.3⋄ 4.56 2.24 1.68e-04 -0.203

0.191 (1.01) 0.0667 (0.511) 0.0108 (0.203) (0.122)Dietary Fat -1.78⋆ -1.98 -0.694 0.584 -1.26

0.395 (1.40) 0.11 (0.661) 0.0138 (0.217) (0.1)HSEIA -0.00242⋄ -0.00229 -0.00185 -0.00129 -0.00279

1.46e-07 (8.18e-04) 4.62e-08 (3.89e-04) 6.86e-09 (1.24e-04) (4.77e-04)

ANGINAConstant 4.48 3.34 1.59 0.8 -2.61

0.176 (0.702) 0.0633 (0.383) 0.00745 (0.157) (0.257)Sedentary 3.74 3.46 2.75 2.31 0.988

0.049 (0.478) 0.019 (0.272) 0.00323 (0.129) (0.0605)Obesity -6.38 -5.34 -3.6 -2.67 0.528

0.144 (0.641) 0.0483 (0.375) 0.00687 (0.19) (0.0734)Dietary Fat -5.65 -4.26 -1.83 -0.74 -0.55

0.102 (0.551) 0.0384 (0.325) 0.0054 (0.149) (0.0628)HSEIA -0.00223 -0.00178 -0.00117 -8.69e-04 -0.00167

3.56e-08 (3.18e-04) 1.28e-08 (1.75e-04) 1.65e-09 (7.42e-05) (2.54e-04)σ2ν 0.012 0.0185 0.0404 0.0545 0.151

DIABETESConstant -2.52 -1.74 -0.477 -0.0708 -2.44

0.0695 (0.475) 0.0221 (0.268) 0.0027 (0.114) (0.214)Sedentary 0.803⋄ 0.595 0.0697⋆ -0.138 0.629

0.0202 (0.321) 0.00864 (0.189) 0.00143 (0.0938) (0.0471)Obesity 0.416⋆ 0.333⋆ 0.538 0.822 1.33

0.0704 (0.443) 0.0216 (0.271) 0.00321 (0.141) (0.0495)Dietary Fat 3.6 2.84 1.58 1.07 -0.72

0.048 (0.38) 0.0157 (0.231) 0.00225 (0.108) (0.0502)HSEIA 0.000565⋄ 0.00021⋆ -0.000348 -4.98e-04 -0.00126

1.37e-08 (2.14e-04) 4.31e-09 (1.21e-04) 5.37e-10 (5.35e-05) (2.11e-04)σ2ν 0.0047 0.0082 0.0131 0.0174 0.0815

Standard errors in parentheses

⋄ indicates that parameter estimate significant for less than 90% of zones

⋆ indicates that parameter estimate significant for less than 50% of zones

26

Table 4: Relative width of the predictive interval for a parameter estimateobtained for a new set of zones (in the same study area) at the same scale. It iswritten as a percentage of the average value of the parameter estimate, X, i.e.

E[β](1±X) where X = 1.96

√V ark(β)/Ek[βE ]

Level 1 2 3 4 5 6 7 8BMIConstant 0.0114 0.0194 0.0271 0.0331 0.0383 0.0424 0.0512 0.0603Sedentary 0.169 0.245 0.298 0.297 0.302 0.301 0.298 0.311Smoking 0.091 0.0912 0.0894 0.111 0.119 0.123 0.136 0.136Diet. Fat 0.331 0.336 0.292 0.329 0.376 0.449 0.569 0.691HSEIA 0.088 0.129 0.153 0.184 0.198 0.227 0.283 0.31

ANGINAConstant 0.106 0.13 0.147 0.148 0.153 0.164 0.17 0.184Sedentary 0.0405 0.0556 0.0728 0.078 0.0852 0.0909 0.101 0.116Obesity 0.0452 0.0588 0.0713 0.0807 0.0854 0.0925 0.102 0.117Diet. Fat 0.0786 0.0867 0.0885 0.0903 0.0899 0.0973 0.102 0.111HSEIA 0.0679 0.0937 0.118 0.124 0.131 0.143 0.152 0.166

DIABETESConstant 0.214 0.21 0.179 0.167 0.169 0.177 0.183 0.205Sedentary 1.06 0.504 0.32 0.306 0.313 0.304 0.326 0.347Obesity 0.206 0.568 0.92 0.864 0.921 1.06 1.04 1.25Diet. Fat 0.0588 0.0759 0.0849 0.0864 0.0876 0.0982 0.106 0.119HSEIA 0.13 0.465 1.91 0.612 0.478 0.441 0.386 0.406

27

Table 5: Variables from NHS0708 used in the HSEIA index with their loadings.Weight NHS0708 Variable Description-0.334 % Gross weekly equivalent cash income (∼ 2nd - 3rd decile)-0.294 % Persons over 15 with no post-school qualification-0.290 % Persons with a government concession card-0.266 % Families: one parent with dependent offspring only-0.256 % Households renting from government/community organisation-0.226 % Persons less than 70 with disability requiring help with core activities-0.223 % Occupation: labourers-0.174 % Labour force status: unemployed-0.169 % Occupation: machinery operators and drivers

0.123 % occupied private dwellings with four or more bedrooms0.142 % Persons over 15 years at university or other tertiary institution0.159 % Occupation: Managers0.264 % Persons over 15 Diploma/adv. diploma only non-school qualification0.298 Gross weekly equiv. cash income (∼ 9th - 10th decile)0.313 % Occupation: professionals0.324 % Persons with private health insurance

28

−6 −5 −4 −3 −2 −1

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Sed

enta

ry

2 3 4 5 6 7

01

23

Sm

okin

g

−3 −2 −1 0

0.0

1.0

2.0

3.0

Regression parameter estimate

Die

t. Fa

t

−0.0040 −0.0030 −0.0020

010

0030

0050

00


HS

EIA

L1L2L3L4L5L6L7L8

Figure 1: Density plots of the zoning distribution of the ecological regressioncoefficients for Sedentary, obesity, dietary fat, HSEIA on BMI at eight scales

29

2.5 3.0 3.5 4.0 4.5

01

23

45

67

Sed

enta

ry

−7 −6 −5 −4

01

23

4

Obe

sity

−6 −5 −4 −3 −2

01

23

45


Die

t. Fa

t

−0.0025 −0.0020 −0.0015 −0.0010

020

0040

0060

0080

00


HS

EIA

L1L2L3L4L5L6L7L8

Figure 2: Density plots of the zoning distribution for the ecological regressioncoefficients for Sedentary, obesity, dietary fat and HSEIA on angina at eightscales

30

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

02

46

810

Sed

enta

ry

0.0 0.5 1.0

01

23

45

6

Obe

sity

1.5 2.0 2.5 3.0 3.5 4.0

02

46

8


Die

t. Fa

t

−4e−04 0e+00 4e−04 8e−04

050

0010

000

1500

0


HS

EIA

L1L2L3L4L5L6L7L8

Figure 3: Density plots of the zoning distribution for the ecological regressioncoefficients for Sedentary, obesity, dietary fat and HSEIA on diabetes at eightscales

31

0 4000 8000 12000

−4

−2

02

46

BMI

M

Ek(β

)SedentarySmokingDietary FatHSEIA

0 4000 8000 12000

−6

−4

−2

02

4

Angina

M0 4000 8000 12000

01

23

Diabetes

M

SedentaryObesityDietary FatHSEIA

Figure 4: Plot of the ecological average of the parameter estimates Ek[βE ]

versus M for each scale of analysis for BMI, angina and diabetes. The legendfor angina is the same as for diabetes

32

5000 15000

0.0

0.5

1.0

1.5

2.0

N

BM

ISedentarySmokingDietary FatHSEIA

5000 15000

0.0

0.1

0.2

0.3

0.4

N

Ang

ina


5000 15000

0.00

0.05

0.10

0.15

0.20

N

Dia

bete

s


Figure 5: Ek[V ar(βE)] versus N for each scale of analysis for BMI, angina anddiabetes

33

0.0 0.5 1.0 1.5 2.0

0.0

0.5

1.0

1.5

2.0

BMI

Ek(Var(β))

Var

k(β)

SedentarySmokingDietary FatHSEIA

0.0 0.1 0.2 0.3 0.4

0.0

0.1

0.2

0.3

0.4

Angina

Ek(Var(β))

Var

k(β)


0.00 0.05 0.10 0.15 0.20

0.00

0.05

0.10

0.15

0.20

Diabetes

Ek(Var(β))

Var

k(β)


Figure 6: Ek[V ar(βE)] versus V ark(βE) for each scale of analysis for BMI,

angina and diabetes

34

Date post:	04-Jan-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

National Institute for Applied Statistics Research Australiaweb/... · 2015-06-07 · Health...

Documents