Post on 22-May-2018
transcript
Tabular Rounding in Iterative Proportional Fitting for 1
Population Synthesis in Activity-Based Models 2
3
4
5
Word count: 5852 (Text) + 7 * 250 (Tables & Figures) = 7602 6
Submission date: August 1, 2014 7
Submission of revised manuscript: November 13, 2014 8
9
10
11
Abdoul-Ahad Choupani 12
Ph.D. Candidate 13
Department of Civil and Environmental Engineering 14
Tarbiat Modares University 15
Gisha Bridge, Tehran, Iran 16
Phone: +98-21-33-255-473 17
Fax: +98-21-8288-4925 18
Email: ahad.choupani@modares.ac.ir 19
20
21
22
23
Amir Reza Mamdoohi (Corresponding Author) 24
Assistant Professor 25
Department of Civil and Environmental Engineering 26
Tarbiat Modares University 27
Gisha Bridge, Tehran, Iran 28
Phone: +98-21-8288-4925 29
Fax: +98-21-8288-4925 30
Email: armamdoohi@modares.ac.ir31
Choupani and Mamdoohi 2
Tabular Rounding in Iterative Proportional Fitting for Population 1
Synthesis in Activity-Based Models 2
Abstract 3
Activity-based travel demand modeling requires socioeconomic micro-data of the 4
population under study. Since acquisition of such data for the entire population is 5
infeasible or highly expensive, techniques like iterative proportional fitting (IPF) 6
have been applied extensively to estimate such data for the population, 7
synthetically. Despite its many advantages, IPF results in non-integer values 8
instead of integers: fractions of households or individuals are obtained for zones. 9
Although methods have been proposed to integerize non-integer tables, seldom 10
has this problem been viewed as tabular rounding. This paper proposes a binary 11
linear programming model for tabular rounding in which the integerized tables 12
totals and marginals perfectly fit to input data obtained from the census bureau. 13
Furthermore, it minimizes distortion to the correlation structure of the household-14
level and individual-level non-integer tables. It does not bias joint or marginal 15
distributions of socioeconomic attributes of population units and the sampling of 16
rare demographic groups (at Ξ±=0.05 significance level). The empirical 17
comparison of the proposed method with eight existing ones demonstrates that 18
the proposed model outperforms tested methods. Sensitivity analysis 19
demonstrates that integerization of small values is as significant as large values. 20
In this study, deterministic methods outperform stochastic methods in terms of 21
accuracy, and perfect fit to census data. Finally, a scoring and ranking tool is used 22
to reflect concisely the advantages and disadvantages of these methods. 23
24
Keywords: activity-based models, population synthesis, iterative proportional 25
fitting, tabular rounding, binary linear programming. 26
Choupani and Mamdoohi 3
1. INTRODUCTION 1
Activity-based travel demand forecasting has received increasing attention by 2
researchers and practitioners. This modeling approach requires micro-data of 3
the study area population, which are usually inaccessible due to confidentiality 4
concerns or costs. The two most widely applied techniques to produce the 5
micro-data synthetically are iterative proportional fitting (IPF) [1] and the 6
combinatorial optimization (CO) [2]. Other techniques have also been proposed 7
such as a sample-free method to synthesize Belgian population [3]. Farooq et al. 8
[4] applied Markov Chain Monte Carlo simulation to synthesize the population 9
of Switzerland. IPF, however, has been the workhorse for synthesizing 10
population due to its many advantages: requiring little census data, 11
computational ease and speed [1, 5, 6], and guarantee of convergence [7, 8]. 12
A disadvantage of IPF is that it produces non-integer values: fractions of 13
households or individuals [1, 6, 9]. Although this is not a problem for many 14
applications such as aggregate zonal-based modeling of travel demand, 15
disaggregate agent-based modeling requires integer rather than fractional 16
values. CO produces integer estimates of households/individuals for a given 17
zone and this is why some researchers favor this technique [10]. However, CO 18
needs considerable time for convergence [11] because it does not necessarily 19
produce better results at all iterations due to the random selection of agents in 20
the solution algorithms [6]. 21
The usefulness of IPF in population synthesis can be enhanced if the 22
integerization problem is solved. Lovelace and Ballas [6] proposed four 23
integerization methods and compared them with the conventional rounding 24
method. However, none of their methods and very few of the existing ones 25
regard the problem as a tabular rounding problem. 26
The aims of this paper are as follows: 27
1. Propose a tabular rounding method, maintaining aggregate totals and 28
marginals the same, and a close similarity of correlation structure of 29
estimated tables and reference tables [8, 13, 14]. 30
2. Investigate the bias caused by integerization in general and by deterministic 31
methods in particular, 32
3. Compare stochastic and deterministic integerization methods, and 33
4. Study the effects of integerization on sampling of rare demographic groups. 34
Choupani and Mamdoohi 4
The rest of this paper is organized as follows. Section 2 reviews the literature 1
and familiarizes readers with IPF and the integerization problem. Section 3 2
describes the proposed and existing methods. In section 4, the results of 3
integerization methods are discussed and their performances are compared and 4
ranked. The paper ends with concluding remarks. 5
2. LITERATURE REVIEW 6
A three-way count table Sijk cross-classifies variables X, Y, and Z of sampled 7
agents from a large area (region) into I, J, and K categories, respectively. Each 8
cell Sijk is a count of observations classified into categories i, j, and k of the first, 9
second, and third variables, respectively. However, such counts (nijk) are not 10
available for the population of small areas (zones) and only the marginals are 11
available (ni++, n+j+, and n++k). Each marginal cell ni++ = β β nijkππ contains the 12
total number of observations in category i of variable X (similarly for n+j+ and 13
n++k). Population synthesizers generally use IPF, a well-known statistical 14
technique similar to Furness technique, for the estimation of nijk. 15
Once the joint distribution is determined, with each cell indicating the 16
number of agents of a certain type residing in a specific zone, for each cell its 17
number of agents are selected randomly from the corresponding set of records 18
available in the sample and is placed into the zone population. However, the 19
problem is that these tables are not integer. 20
Therefore, a similarly structured integerized table Nijk is desired, in which 21
marginals and totals of Nijk correspond to those of nijk, e.g. n+++= N+++, ni++ = Ni++. 22
Synthesizers devise two approaches to deal with non-integer tables. The first 23
approach integerizes non-integer tables and then accomplishes the selection [13]. 24
The second approach carries out integerization indirectly. Fitted tables are treated 25
as joint probability mass functions (PMF) displaying the probabilities of 26
observing agents with specific demographics. n (n=table total) Monte Carlo 27
draws with replacement are then used to select n agents from the sample records 28
and to construct the zone population [6, 14-17]. The second approach regards all 29
types of agents simultaneously for selection that results in a longer list of 30
selections and in spending more time and memory. Theoretically, agents with 31
higher probabilities are likely to appear more, while agents with lower 32
probabilities are less likely to appear in the synthesized population [6]. However, 33
there is a nonzero chance that an agent with a lower probability (e.g. 0.002) is 34
replicated more times than an agent with a higher probability (e.g. 0.004). But, a 35
Choupani and Mamdoohi 5
disadvantage of the first approach is that the integerized table is not the best 1
solution in terms of information discrimination: a value of 0.501 is treated the 2
same as 0.999 [11] (see TABLE 1 for further comparison of the two approaches). 3
We revisit the first approach, which arguably outperforms the second, i.e. tables 4
are integerized and then agents are selected from the sample records. 5
Conventional rounding (which rounds fractions down to 0 if they are 6
smaller than 0.5, and up to 1 otherwise) minimizes overall discrepancy between 7
unrounded and rounded values, however it fails to add to the primary total [18]. 8
The implicit assumption of conventional rounding is that the frequencies of 9
fractional parts smaller and greater than 0.5 are equal, resulting in additive 10
rounding. Since this assumption does not always hold true, the non-additive 11
rounding will occur in most cases. 12
The construction of additive rounding in multi-way tables is much more 13
profound because rounding should not only retain the additivity of table total, 14
but also the additivity of the marginals [19]. Since for large tables, the error in 15
the marginals can be very large, sophisticated methods are needed for rounding 16
of multi-way tables. 17
IPF also results in sparse tables, i.e. the tables are composed of many 18
cells, most of which have low values. FIGURE 1.a shows (for the case study of 19
the current paper) that household-level tables contain values ranging from small 20
values close to zero to values greater than 100, but 15% of values fall in the 21
narrow interval of [0 1). Integerization of these small values is much more 22
sensitive than large values. 23
The aim of this paper is to propose a tabular rounding method and to 24
compare its performance with those of eight existing methods. Among the 25
existing methods, an attempt was made to select the most competitive ones. 26
However, to demonstrate that conventional methods (e.g. rounding) may not be 27
appropriate, a few such methods are included in this study. 28
Because synthesizing population at both household and individual levels 29
is becoming a common practice [12, 16, 20], integerization effects at both 30
levels will be empirically studied considering the case of Wyoming State, US. 31
32
Choupani and Mamdoohi 6
TABLE 1 A Comparison of Two Approaches to Integerize Non-Integer 1
Tables 2
Reference Integerize approach
Criteria No. Indirect Direct
[1] Probability
mass
function
Count
table How the approach treats the fitted tables 1
[9] Shorter list of selections 2
[9] Requiring less time and memory 3
[5]
Replicating an agent with higher replication count
(or probability) more times than an agent with lower
replication count (or probability)
4
[6, 11] Discrimination information 5
_____a Selection of agents with a specific order 6
_____a Possibility of investigating integerization and
selection effects separately 7
a. Proposed criterion by the current paper. 3
(a) Household-Level table
(observations=8223)
(b) Individual-Level table
(observations=6213)
FIGURE 1 Relative frequency of values (x) of estimated (a) Household-4
Level tables (observations=8223), and (b) Individual-Level tables 5
(observations =6213) for Census Tracts (CTs) of Wyoming State. 6
3. METHODOLOGY 7
Most integerization methods truncate non-integer numbers to separate integer 8
parts from their fractional parts. Integer parts are usually considered as 9
deterministic numbers, which determine how many times agents of a certain type 10
should be βclonedβ and placed into the synthesized population. Then the 11
0β€ x<1
15%
1β€ x<10
38%
10β€
x<100
40%
100β€ x
7%
0β€ x<1
5%
1β€ x<10
43%
10β€ x<100
24%
100β€ x
28%
Choupani and Mamdoohi 7
remaining fractional parts are used to determine deterministically or 1
stochastically whether an agent of a certain type should be placed into the 2
population one more time or not. When the fractional parts are rounded, they add 3
to the stored integer parts of the numbers to give the total times that an agent 4
should be placed into the zone population. 5
In the following, we describe briefly how the existing methods integerize 6
fractional parts of numbers. Then, the proposed method is described in detail. 7
1. The first method, rounding, refers here to the conventional rounding where 8
the fractional parts are rounded down if they are smaller than 0.5 and are 9
rounded up otherwise. 10
2. PopGenRandom [20-22] uses a random integerization method that can be 11
represented by: 12
Fijk=
1 with probability puijk= fijk
0 with probability pdijk= 1- fijk
(1)
where 13
Fijk = the integerized value of fijk, 14
puijk= the probability of rounding fijk up to 1, 15
pdijk= the probability of rounding fijk down to 0, and 16
fijk= the fractional part of cell values. 17
Then, it adds Β±1 to rounded cells to account for the differences between the 18
rounded and unrounded values sums. If the integerization under-estimates 19
table totals, it adds +1 to the rounded down cells. If the integerization over-20
estimates table totals, it adds -1 to the rounded up cells. 21
3. PopGenBucket [21] applies bucket rounding to cells of a specific table. This 22
deterministic method keeps track of the accumulated rounding errors. 23
Accumulated rounding error in the previous cell of a table is used to bias the 24
rounding of the next cell in the same table. The procedure preserves the 25
populations of agents (table totals) in the zone. 26
4. PopGenRounding [21] first uses the conventional rounding. The sum of 27
rounded values may not add to the table total in most cases. Therefore, 28
PopGenRounding adds Β±1 to rounded cells the way as described for 29
PopGenRandom. 30
5. TRANSISM [23] applies the bucket rounding to specific cell of all tables. 31
This process preserves the total population of each demographic group in the 32
study area, but may change slightly the total population of a given zone [23]. 33
Choupani and Mamdoohi 8
6. TRESIS [24] uses a simple rounding procedure, except all non-zero values 1
less than 1 are rounded up to 1. This method may be biased for values less 2
than 1 and can yield aggregations that differ substantially from the control 3
values [5]. 4
7. ARC [25] and MORPC [26] round up, starting with the demographic groups 5
with the largest fractional part, but avoiding rounding up if it would cause a 6
control value to be exceeded. 7
8. Truncate-Replicate-Sample (TRS) [6] is formally defined by: 8
(2) 1 with probability puijk=
ππππ
β β β πππππππ
0 with probability pdijk= 1- puijk
Fijk=
where 9
Fijk = the integerized value of fijk, 10
puijk= the probability of rounding fijk up to 1, 11
pdijk= the probability of rounding fijk down to 0, and 12
fijk= the fractional part of cell values. 13
The puijk is proportional to the share of fijk in β β β πππππππ . This method 14
guarantees that higher fractional parts have a higher chance of being rounded 15
up. Also, rounding until β β β πππ πππ= β β β πΉπππππππ ensures that population 16
size remains the same. 17
The existing methods are not tabular rounding methods, in which totals 18
and marginals of integerized tables perfectly fit to input data obtained from 19
census bureau. Tabular rounding also minimizes distortion to primary non-20
integer table. 21
One way for tabular rounding of two-way tables is to apply transportation 22
problem [18]. Transportation is a well-known problem in economics aims 23
originally at determining a minimal-cost shipping schedule between sources 24
and destinations [27]. The classic statement of this problem uses a matrix with 25
the rows representing sources and columns representing destinations. The 26
margins of the matrix show supplies and demands. The costs of shipping from 27
sources to destinations are indicated by the entries in the matrix. 28
If the integerization of two-way tables containing positive non-integers is 29
formulated similar to linear programming of the transportation problem [18], 30
the integerization problem may not have integer solutions primarily. The 31
βtriangular basis propertyβ of the transportation problem is then brought to 32
prove that solutions do exist and the optimal solutions are also integer if 33
marginals are integer [28]. This property is still true when the integer values are 34
Choupani and Mamdoohi 9
restricted to sets of {0, 1} by introducing the capacity constraints on the 1
decision variables. 2
The above properties of the transportation problem help to find the 3
solution of tabular rounding of two-way tables without solving an integer 4
programming model. However, it was proved [22] that the three-dimensional 5
linear program does not always have integer solutions. 6
Thus, an integer programming model similar to transportation problem 7
construct is proposed with modifications for integerizing three-way tables given 8
below in program A (P-A): 9
Minimize β β β βππππ β πΉπππ
πΎ
π=1
π½
π=1
πΌ
π=1
Subject to:
β β πΉπππ = β β πππππΎ=1π=1
π½π=1 π =πΎ
π=1π½π=1 1, 2, β¦ , πΌ
β β πΉπππ = β β πππππ²π=π
π°π=π π = 1,2, β¦ , π½πΎ
π=1πΌπ=1
β β πΉπππ = β β πππππΌπ=1
π½π=1 π = 1,2, β¦ , πΎπΌ
π=1π½π=1
πΉπππ = {0,1}
(P- A)
where 10
Fijk =the integerized value of fijk, and 11
fijk= the fractional part of cell values. 12
Constraints guarantee that the observed marginals do not change due to the 13
integerization. If marginals remain the same (e.g. f++k=F++k), totals of tables 14
remain the same (n+++= N+++) through integerization. Program A (P-A) minimizes 15
the error introduced to the results of IPF. Like conventional rounding, this 16
program gives a greater chance of being rounded up to higher fijk. If nijk is integer, 17
there is no need to consider it in program, which is typically solved for each zone 18
and generally has I*J*K binary variables and I+J+K equality constraints. 19
P-A is scalable to the large number of dimensions. If more dimensions 20
(variables) are used, the rounding problem can still be converted to a binary 21
linear programming (BLP) model, in which the objective function and 22
constraints are convex. Therefore, the model always has an integer solution 23
irrespective of the number of dimensions. 24
The characteristics of the existing methods and the proposed one are 25
summarized in TABLE 2.26
Choupani and Mamdoohi 10
4. RESULTS AND DISCUSSION 1
This section compares integerization methods for Census Tracts (CTs) as zones 2
in the state of Wyoming using year 2010 census data. Wyoming State had a 3
population of 563,626 persons residing in 226,879 households in 132 CTs (1 CT 4
had zero household). Marginal distributions of control variables were extracted 5
from the Summary Files-1 (SF-1) while the one-percent Public Use Microdata 6
Sample (PUMS), which includes 5,665 persons residing in 2,475 households, 7
served as the sample from which reference tables were established. 8
The household-level control variables were as the followings: 9
1. Household size (7 categories), 10
2. Householder age (7 categories), and 11
3. Householder race (4 categories). 12
While the individual-level control variables were as the followings: 13
1. Gender (2 categories), 14
2. Age (7 categories), and 15
3. Race (4 categories). 16
The Household-level and individual-level tables contain 196 (7*7*4) and 17
56 (2*7*4) cells, respectively. 18
IPF and all methods were coded in MATLAB. To guarantee representative 19
results for the stochastic methods, they were run 30 times and the best-fit result 20
with the minimum total absolute error (TAE) was selected. 21
After applying different methodologies for integerization, the same method 22
is applied for selection of the households/individuals. Since there is no 23
variations between methods regarding selection, only the analysis of 24
integerization step is presented in this paper. For selecting 25
households/individuals, the method proposed by Auld and Mohammadian [16] 26
is used similarly for all methods, which aims at synthesizing population at both 27
levels of households and persons simultaneously. 28
Choupani and Mamdoohi 11
TABLE 2 Existing and Proposed Integerization Methods and Their Characteristics 1
No. Method Synthesizer Country Integerization Perfect fit to table totals
Perfect fit
to
marginals?
Reference
Results of
scoring and
ranking using
TOPSIS
Obtained? How? Weight Rank
1 Rounding
TRANSISM 3.0
&
SimBRITAIN
US
& UK Conventional rounding [6, 29] 0.5272 7
2 PopGenRandom
PopGen US
Random rounding of each cell
separately
Adding Β±1 to
integerized cells
[21]
0.5407 6
3 PopGenBucket Applying bucket rounding to cells of
a specific table.
Inherently results in
perfect fit 0.6894 3
4 PopGenRounding Conventional rounding Adding Β±1 to
integerized cells 0.6778 4
5 TRANSIMS TRANSISM
4.0.10 US
Applying bucket rounding to specific
cell of all tables. [23] 0.5272 7
6 TRESIS TRESIS AU Rounding up values lower than 1 and
rounding other cells. [24] 0.5272 7
7 ARC ARC &
MORPC US
Rounding up cells with the largest
fractional parts if they do not exceed
marginals.
[25, 26] 0.6933 2
8 TRS SimBRITAIN UK Random rounding of all cells
simultaneously.
Inherently results in
perfect fit [6] 0.6056 5
9 The proposed
optimization ______a _____a Binary linear programming
Inherently results in
perfect fit _____b 0.8437 1
a. This method is not intended for specific synthesizer (or case) and is generic. b. Proposed method of this paper2
Choupani and Mamdoohi 12
4.1. Accuracy 1
The first statistic used for comparison of methods is absolute error. 2
FIGURE 2 shows the cumulative percentage distributions of absolute errors of 3
table totals and marginals. As absolute errors of totals of household-level and 4
individual-level tables can be seen in FIGURE 2.a and c, the 100% of absolute 5
errors are zero in the proposed, TRS, and PopGen methods (note the single 6
triangle in the Cartesian coordinate (0,100) of the above figures). ARC is the 7
second best method since 90% and 100% of absolute errors of household-level 8
and individual-level tables are equal to or less than 1, respectively (as can be 9
seen in the aforementioned figures). 10
Marginals classify the population of agents into distinct groups. As 11
FIGURE 2.b and d show for household-level and individual-level tables 12
respectively, 100% of the absolute errors of marginals are zero in the proposed 13
method. This is due to keeping all marginals the same during integerization 14
using linear constraint in the optimization problem. It can also be seen in the 15
above figures that around 90% of the absolute errors of household-level or 16
individual-level marginals are zero in ARC. This good performance is due to 17
rounding up, starting with the demographic groups with the largest fractional 18
part, but avoiding rounding up if it would cause a control value to be exceeded. 19
Total absolute error (TAE) is used to measure aggregate errors of a table 20
as shown below [30]: 21
ππ΄πΈ = β β β|ππππ β ππππ|
πππ
(3)
where 22
TAE= Total absolute error, 23
Nijk = the integerized value of cell ijk, and 24
nijk = the non-integer value of cell ijk. 25
FIGURE 3.a and b show that the lowest TAEs belong to methods 26
applying conventional rounding (including rounding, PopGenRounding) since 27
they choose the closest integers to fractional values. However, stochastic 28
rounding (including TRS and PopGenRandom) or bucket rounding methods 29
(including TRANSIMS and PopGenBucket) have the largest TAEs. The errors 30
of the remaining methods (including the proposed, TRESIS, and ARC) are 31
moderate. 32
Choupani and Mamdoohi 13
(b) Household-Level marginals (observations=2358) (a) Household-Level tables totals (observations=131)
(d) Individual-Level marginals (observations=1703) (c) Individual-Level tables totals (observations=131)
1
FIGURE 2 Cumulative percentage distributions of absolute errors of (a) Household-Level tables totals (observations=131), (b) 2
Household-Level marginals (observations=2358), (c) Individual-Level tables totals (observations=131), and (d) Individual-Level 3
marginals (observations=1703). 4
0
10
20
30
40
50
60
70
80
90
100
0 1 2 3 4 5 6
Per
cent
Absolute error
Optimiztaion
ARC
PopGenRounding
TRS
PopGenRandom
0
10
20
30
40
50
60
70
80
90
100
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28
Per
cent
Absolute error
Optimization, PopGen, and TRS
ARC
Rounding
TRANSISM
TRESIS
0
10
20
30
40
50
60
70
80
90
100
0 1 2 3 4 5
Per
cent
Absolute error
Optimiztaion
ARC
PopGenBucket
PopGenRounding, TRESIS, and
RoundingTRS & TRANSISM
0
10
20
30
40
50
60
70
80
90
100
0 1 2 3 4 5 6 7 8
Per
cent
Absolute error
Optimization, PopGen, and TRS
ARC
Rounding
TRESIS
TRANSISM
Choupani and Mamdoohi 14
(a) Household-Level tables (observations=131) (b) Individual-Level tables (observations=131)
FIGURE 3 Sum of Total Absolute Errors (TAEs) of (a) Household-Level tables (observations=131), and (b) 1
Individual-Level tables (observations=131). 2
3
2000
2200
2400
2600
2800
3000T
ota
l A
bso
lute
Err
or
(TA
E)
Rounding methods
1500
1700
1900
2100
2300
Tota
l A
bso
lute
err
or
(TA
E)
Rounding methods
Choupani and Mamdoohi 15
To shed more light on the performances of deterministic and stochastic 1
methods, the two stochastic methods, PopGenRounding is compared with 2
PopGenRandom and TRS. 3
Both PopGenRandom and PopGenRounding integerize each cell 4
independently from other cells. The former randomly decides to round each 5
cell to the closest integer, whilst the latter rounds the cells deterministically to 6
the closest integer. To account for the difference between the rounded values 7
sum and the unrounded values sum, both add Β±1 to rounded cells. TRS 8
randomly rounds all cells simultaneously using equation (2) and it does not need 9
further adjustments to preserves table totals. 10
FIGURE 2.a and c show for the three methods that the 100% of the 11
absolute errors of household-level and individual-level totals are 0. However, 12
the errors of marginals are not similar. FIGURE 2.b and d show that around 13
95% of the absolute errors of household-level and individual-level marginals are 14
equal to or less than unity in PopGenRounding. In TRS and PopGenRandom, 15
however, 85% and 60% of the absolute errors of marginals are equal to or less 16
than unity, respectively. 17
Furthermore, PopGenRounding results in lower TAEs as compared to 18
PopGenRandom and TRS. The sums of TAEs of household-level and 19
individual-level tables are approximately 2,050 and 1,550 in PopGenRounding, 20
respectively (FIGURE 3.a and b). These values are respectively 2,650 and 2,050 21
in PopGenRandom. In TRS, the sums of TAEs of household-level and 22
individual-level tables are approximately 3,050 and 2,300, respectively 23
(FIGURE 3.a and b). Looking more closely at FIGURE 3.a and b, it is found 24
that the lowest and highest TAEs belong to PopGenRounding and TRS, 25
respectively. PopGenRounding outperforms TRS and PopGenRandom in our 26
case. Collectively, it is found that deterministic methods are better than 27
stochastic methods. 28
Freeman-Tukey (FT) test investigates whether joint (or marginal) 29
distributions of attributes (of agents) remain the same through integerization (as 30
the null hypothesis) or not (as the alternative hypothesis) [30]. This test uses the 31
π2-type FT statistic as defined below: 32
πΉπ = 4 β β β(βππππ β
πππ
βππππ)2 (4)
where 33
FT= the statistic of FT test, 34
Nijk= integer value of cell ijk, and 35
Choupani and Mamdoohi 16
nijk = non-integer value of cell ijk. 1
It is known that this test may perform poorly when dealing with cell 2
frequencies less than 5. However, this condition can be relaxed following 3
investigations showing that the π2 can still be a good approximation for FT 4
statistic even when some cell values are lower than 1 [31]. It was added that this 5
test might also perform poorly for sparse tables containing both very small and 6
moderately large values [32]. However, as can be seen in FIGURE 1.a and b, 7
the household-level and individual-level tables contain values ranging from 8
small values close to zero to values greater than 100 and table cells include a 9
wide range of small, moderate, and large values. Therefore, π2 can be a good 10
approximation for FT statistic even when some cell values are smaller than 1. 11
Results of FT test (in TABLE 3) show that the null hypothesis is not 12
rejected for the proposed method. This means that the proposed integerization 13
method does not change joint (or marginal) distributions of agents' attributes 14
and does not enter bias into synthesis process. FT test (in TABLE 3) 15
demonstrates that not all existing integerization methods except two change 16
joint (or marginal) distributions of attributes (at Ξ±=0.05 significance level). 17
TRESIS changes (at Ξ±=0.05 significance level) marginal distributions of 18
householder age and race in 2 and 7 CTs, respectively (collectively 9 CTs). 19
PopGenRandom also changes joint distributions of individuals attributes in 3 20
CTs (at Ξ±=0.05 significance level). These 12 CTs are less populated zones 21
whose populations are close to the minimum population or below the average. 22
It is desired to ensure that the fit is good overall, with no problems at any 23
point. Cell by cell Z test determines whether the probability of observing agents 24
with specific socio-demographics in specific zone is changed due to 25
integerization or not [30]. This test statistic is given by: 26
27
π§πππ =οΏ½ΜοΏ½πππβοΏ½ΜοΏ½πππ
βοΏ½ΜοΏ½πππ(1βοΏ½ΜοΏ½πππ)
π
(5)
28
29
Choupani and Mamdoohi 17
TABLE 3 Number of Census Tracts (CTs) in which Null Hypothesis was 1
Rejected 2
(a) Household-Level tables
No. Method
FT test a Z test a
Marginals Entire table Cells
Household size FTcritical= 12.5916
df=6
Householder
age FTcritical=12.5916
df=6
Householder
race FTcritical= 7.8147
df=3
FTcritical=
133.2569
df=108
(6*6*3)
Zcritical=
Β±1.960
1 Rounding 0 0 0 0 0
2 PopGenRandom 0 0 0 0 0
3 PopGenBucket 0 0 0 0 0
4 PopGenRounding 0 0 0 0 0
5 TRANSIMS 0 0 0 0 0
6 TRESIS 0 2 7 0 0
7 ARC 0 0 0 0 0
8 TRS 0 0 0 0 0
9 Proposed
optimization 0 0 0 0 0
(b) Individual-Level tables
No. Method
FT test a Z test a
Marginals Entire table Cells
Gender FTcritical=3.8416
df=1
Age FTcritical=12.5916
df=6
Race FTcritical= 7.8147
df=3
FTcritical=
28.8693
df=18
(1*6*3)
Zcritical=
Β±1.960
1 Rounding 0 0 0 0 0
2 PopGenRandom 0 0 0 3 0
3 PopGenBucket 0 0 0 0 0
4 PopGenRounding 0 0 0 0 0
5 TRANSIMS 0 0 0 0 0
6 TRESIS 0 0 0 0 0
7 ARC 0 0 0 0 0
8 TRS 0 0 0 0 0
9 Proposed
optimization 0 0 0 0 0
a. Ξ± = 0.05 3
where 4
Zijk= the statistic of Z test for cell ijk, 5
οΏ½ΜοΏ½ijk = probability of observing agents with specific socio-demographics (ijk) in 6
non-integer table, (ππππ
π), 7
οΏ½ΜοΏ½ijk= probability of observing agents with specific socio-demographics (ijk) in 8
integer table, (ππππ
π), 9
nijk = non-integer value of cell ijk, 10
Choupani and Mamdoohi 18
Nijk= integer value of cell ijk, and 1
N=n=table total. 2
Z test shows that the proposed method does not change the probabilities 3
significantly (at Ξ±=0.05 level). This holds true for all of the studied existing 4
methods. 5
Pritchard and Miller [11] remarked that deterministic rounding might 6
bias the estimates, particularly for cells representing ββrareββ characteristics with 7
a count under 0.5. Cell by cell Z test demonstrates that there is no strong 8
evidence that cell proportions change due to integerization. Since the tables 9
contain values lower than 0.5 (FIGURE 1), it can be concluded that 10
deterministic methods do not bias the estimates for values lower than 0.5. 11
Muller and Axhausen [9] stated that any integerization method might bias 12
the population synthesis. However, considering FT and Z statistics, there is no 13
significant evidence that deterministic or stochastic integerization methods bias 14
the result. Since integerization is part of the synthesis process, it can be 15
concluded that generally, integerization does not introduce bias into synthesis 16
process. Only a few methods (including PopGenRandom and TRESIS) resulted 17
in limited bias. 18
4.2. Computational Effort 19
The time spent on the integerization of IPF tables as a criterion of 20
computational effort was measured on an Intel Core i3 (2.10 GHz) machine 21
with 4 GB of RAM running Windows 7.0. All methods runtimes were one 22
second except for the proposed and TRS methods. The proposed method spent 23
insignificantly 5 seconds longer on integerization. The single run and multiple 24
(30) runs of the stochastic TRS method took 6 and 37 seconds. The slower time 25
of the proposed method is due to solving one optimization problem with 26
relatively large number of binary variables for each zone (I*J*K variables). The 27
slowest time of TRS is due to random rounding of all cells simultaneously. It 28
seems that the run time is not a concerning issue. 29
4.3. Sensitivity Analysis 30
Integerization of values smaller than 1 which represent rare demographic 31
groups may be much more sensitive than integerization of large values. 32
Therefore, M is introduced into program A (P-A) to conduct sensitivity tests to 33
see how the proportions of rare demographic groups may vary due to 34
integerization as the following: 35
36
Choupani and Mamdoohi 19
aijk =
fijk if nijk>1
M* fijk if nijk<1
(6)
where 1
aijk= coefficients of decision variables in the objective function (of P-A), 2
M= a coefficient defined for conducting sensitivity analysis, and 3
fijk= the fractional parts of cell values. 4
If M=1, then aijk= fijk (as it appears in P-A). M=1 implies that P-A treats 5
the integerization of small cell values the same as large ones. 6
However, there is a concern that small values (nijk< 1) are rounded down 7
to zero and the corresponding demographic groups disappear from zone 8
population. This may cause the bias of not sampling enough population units of 9
the rare groups. Thus, the proportions of rare demographic groups to the zone 10
population are assessed. The assessments show that the proportions remain the 11
same in the study area (at Ξ±=0.05 significance level). This implies that the 12
proposed method, which treated all values similarly, does not result in biased 13
sampling. 14
As M increases, equation 6 places higher priority on rounding up values 15
smaller than 1 and this may lead to biased oversampling of rare types. Decreasing 16
M may result in biased undersampling of rare groups. 17
FIGURE 4 shows the changes of proportions of rare groups with the 18
changes of M. FIGURE 4.a shows for household-level tables that if M varies 19
from 0.745 to 1.133, sampling of rare households is not biased (see the darkened 20
trapezoid in the middle of the figure). If M>1.133, however, proportions of rare 21
households increases significantly (at Ξ±=0.05 significance level) and this will 22
result in biased over-sampling of rare households (see the lighter trapezoid on the 23
right hand of FIGURE 4.a). On the other hand, for M<0.745, proportions of rare 24
households will decrease significantly (at Ξ±=0.05 significance level) and this 25
results in biased under-sampling of rare households (see the lighter trapezoid on 26
the left hand of FIGURE 4.a). Therefore, for an unbiased sampling, M should be 27
between 0.745 and 1.133. Deviating M slightly from 1 results in biased sampling. 28
So the best value for M is 1 in which integerization of all values are treated the 29
same. 30
Choupani and Mamdoohi 20
(a) Households (observations=131) (b) Individuals (observations=131)
FIGURE 4 The sensitivity of the proportions of rare demographic groups 1
to changes in M for (a) Households (observations=131), and (b) Individuals 2
(observations=131). 3
4.4. Scoring and Ranking 4
Since several criteria are used to compare the integerization methods, 5
a multiple attribute decision-making (MADM) tool should be applied to assess 6
the overall performance of the methods and rank them. MADM is a sub-7
discipline of operations research and is the study of evaluating and choosing 8
alternatives based on the values (and the preferences) of the decision maker. 9
MADM techniques are divided into two classes: 1. compensatory, and 2. non-10
compensatory [33]. The compensatory techniques allow trade-offs between 11
criteria, where a poor result in one criterion can be negated by a good result in 12
another criterion. However, the non-compensatory techniques do not allow 13
trade-offs between criteria. Considering that all methods show strengths and 14
weaknesses according to the criteria taken into account, a compensatory 15
approach should be deployed for the assessments and ranking [34]. There are 16
many compensatory decision-making techniques in the literature such as 17
TOPSIS (technique for order of preference by similarity to ideal solution), AHP 18
(analytic hierarchy process), ELECTRE (elimination et choice translation 19
reality), and SAW (simple additive weighting) [33]. For brevity, only TOPSIS 20
is explained herein. Interested readers are referred to [33] for details of other 21
MADM techniques. Of the above methods, TOPSIS is selected due to the 22
following two reasons: 23
1. TOPSIS, unlike AHP and SAW, does not require expert judgment for 24
comparison or weighting and thus the decision-making process cannot be 25
biased by subjective judgments. 26
2. TOPSIS is more famous than methods such as ELECTRE and SAW. 27
0.16
0.2
0.24
0.28
0.32
0.36
0.4 0.6 0.8 1 1.2 1.4 1.6
% o
f ra
re h
ou
seh
old
s
M
0.025
0.03
0.035
0.04
0.045
0.05
0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
% o
f ra
re i
nd
ivid
ua
ls
M
Biased
under-
sampling
Biased
over-
sampling
Unbiased
sampling Biased
under-
sampling
Unbiased
sampling
Biased
over-
sampling
(0.745, 0.251)
(1.133, 0.295)
(0.62, 0.030)
(1.443, 0.046)
Choupani and Mamdoohi 21
1
TOPSIS is based on the concept that the chosen alternative should have 2
the shortest Euclidean distance from the positive ideal solution (PIS) (which has 3
the best score in each criterion) and should have the farthest Euclidean distance 4
from the negative ideal solution (NIS) (which has the worst score in each 5
criterion) [33]. 6
Alternatives that are closer to the PIS are more similar to PIS and they 7
receive higher ranks. TAEs of tables totals, marginals, and cells were the three 8
uncorrelated criteria applied for comparison. Inferential statistics were not 9
considered for ranking because all integerization methods were nearly similar 10
when these statistics were taken into account for comparison. 11
The results of TOPSIS (that summarized in TABLE 2) show that the 12
proposed deterministic method has the highest rank and ARC takes the second 13
place. The stochastic methods, TRS and PopGenRandom, are ranked 5th and 6th, 14
respectively. 15
Although the best-performing method is the proposed deterministic 16
method and there are other deterministic methods (e.g. PopGenBucket) that 17
outperform stochastic methods, we propose stochasticity be addressed in the 18
selection stage. When the integer frequency of each cell is obtained, 19
corresponding agents can be drawn randomly from the available sample 20
records. The randomness has been applied in selection phase since the exact 21
places of agents' residence are not known (in a given region). The second reason 22
for random selection is that the agents that are similar in controlled attributes 23
(the attributes used for building tables) may be completely dissimilar in other 24
attributes (uncontrolled variables). Moreover, if desired, stochastic 25
integerization can also be applied to produce multiple answers and appraise 26
quantitatively uncertainty introduced through integerization. 27
28
Choupani and Mamdoohi 22
5. CONCLUSION 1
The proposed binary linear programming model outperforms the other methods 2
of integerization tested in the current paper and has the highest rank. This 3
method as a tabular rounding method is inspired by transportation problem, 4
which stems from economics and it was used previously to integerize two-way 5
tables. The proposed model decides whether to round up or down fractional 6
parts of cell values while minimizing deviation from IPF non-integer tables and 7
keeps tables totals and marginals the same through integerization. 8
Fortunately, hypothesis testing demonstrates that integerization methods 9
do not bias joint or marginal distributions of attributes of agents in zones. Also, 10
they do not alter the proportions of demographic groups in all zones of study 11
area, significantly. It does not bias joint or marginal distributions of agents' 12
attributes and the sampling of infrequent types of agents (at Ξ±=0.05 significance 13
level). Sensitivity tests reveal that integerization of small and large values can 14
be treated similarly. 15
Furthermore, deterministic and stochastic methods of rounding are 16
compared. The assessments demonstrate that deterministic methods produce 17
better results in terms of accuracy and fitting perfectly to the census data. 18
Although the proposed method is deterministic and the studied 19
deterministic methods outperform stochastic methods, randomness can be 20
addressed in selection phase when agents are drawn from the sample and 21
replicated into the zone population. Agents are drawn randomly since their 22
exact place of residence in a given region is not known. Also, the agents that are 23
similar in controlled variables may be completely different in uncontrolled 24
variables. 25
Our results provide insight into the advantages and disadvantages of nine 26
integerization methods and guidance to researchers aiming to use IPF and 27
produce integer values. Existing methods do not consider integerization of 28
tables as the tabular rounding problem. The current paper proposed a tabular 29
rounding method in the context of iterative proportional fitting to be applied in 30
population synthesis for activity-based models.31
Choupani and Mamdoohi 23
References 1
1. Beckman, R. J., K. A. Baggerly, and M. D. McKay. Creating Synthetic Baseline 2
Populations. Transportation Research Part A, Vol. 30, No. 6, 1996, pp. 415β429. 3
2. Ryan, J., H. Maoh, and P. Kanaroglou. Population Synthesis: Comparing the Major 4
Techniques Using a Small, Complete Population of Firms. Geographical Analysis, Vol. 41, 5
No. 2, April 2009, pp. 181β203. 6
3. Barthelemy, J., and P. L. Toint. Synthetic Population Generation without a Sample. 7
Transportation Science, Vol. 47, No. 2, 2013, pp. 266-279. 8
4. Farooq, B., M. Bierlaire, R. Hurtubia, and G. FlΓΆtterΓΆd. Simulation Based Population 9
Synthesis. Transportation Research Part B: Methodological, Vol. 58, 2013, pp. 243-263. 10
5. Bowman, J. L. A Comparison of Population Synthesizers Used in Microsimulation Models 11
of Activity and Travel Demand. 2004, http://jbowman.net/papers/, Accessed on 12
06/24/2014. 13
6. Lovelace, R., and D. Ballas. Truncate, Replicate, Sample: A Method for Creating Integer 14
Weights for Spatial Microsimulation. Computers, Environment and Urban Systems, Vol. 15
39, May 2013, pp. 172-181. 16
7. Fienberg, S. E. An Iterative Procedure for Estimation in Contingency Tables. Annals of 17
Mathematical Statistics, Vol. 41, 1970, pp. 907-917. 18
8. Pukelsheim, F. Biproportional Matrix Scaling and the Iterative Proportional Fitting 19
Procedure. Annals of Operations Research, Vol. 215, No. 1, 2014, pp. 269-283. 20
9. Muller, K., and K. W. Axhausen. Population Synthesis for Microsimulation: State of the 21
Art. Paper presented at the 90th Annual Meeting of the Transportation Research Board, 22
Washington, D.C., January 2011. 23
10. Williamson, P., M. Birkin, and H. Rees. The Estimation of Population Microdata by Using 24
Data from Small Area Statistics and Samples of Anonymized Records. Environment and 25
Planning A, Vol. 30, No. 5, 1998, pp. 785β816. 26
11. Pritchard, R. D., and E. J. Miller. Advances in Population Synthesis: Fitting Many 27
Attributes per Agent and Fitting to Household and Person Margins Simultaneously. 28
Transportation, Vol. 39, No. 3, 2012, pp. 685-704. 29
12. Guo, J. Y., and C. R. Bhat. Population Synthesis for Microsimulating Travel Behavior. In 30
Transportation Research Record: Journal of the Transportation Research Board, Vol. 2014, 31
No. 12, Transportation Research Board of the National Academies, Washington, D.C., 32
2007, pp. 92β101. 33
13. Bowman, J. L. Population Synthesizers. Traffic Engineering and Control, Vol. 49, No. 9, 34
2009, pp. 342. 35
14. Parsons Brinckerhoff Consult, HBA Specto Inc., and EcoNorthwest. Oregon2 Model 36
Development, HA Module Description at Finalization. Oregon Department of 37
Transportation, USA, 2003. 38
15. Auld, J., and A. K. Mohammadian. PopSyn-Win V 4.1 Methodology and Program 39
Documentation. Prepared for Chicago Metropolitan Agency for Planning, 2007, 40
http://www.travelbehavior.com/PopSynWINVersion 4.1 Users Guide.html, Last viewed 41
on 05/26/2014. 42
16. Auld, J., and A. K. Mohammadian. Efficient Methodology for Generating Synthetic 43
Populations with Multiple Control Levels. In Transportation Research Record: Journal of 44
the Transportation Research Board, No. 2175, Transportation Research Board of the 45
National Academies, Washington, D.C., 2010, pp. 138-147. 46
Choupani and Mamdoohi 24
17. Auld, J., A. K. Mohammadian, and K. Wies. Population Synthesis with Sub-Region Level 1
Control Variable Aggregation. ASCE Journal of Transportation Engineering, September 2
2009, pp. 632-639. 3
18. Causey, B. D., L. H. Cox, and L. R. Ernst. Applications of Transportation Theory to 4
Statistical Problems. Journal of the American Statistical Association, Vol. 80, No. 392, 5
1985, pp. 903-909. 6
19. Ε Γma, J. Aunt's Problem: Table Rounding. Computers and Artificial Intelligence, Vol. 18, 7
1999, pp. 175β189. 8
20. Ye, X., K. Konduri, R. M. Pendyala, B. Sana, and P. Waddell. A Methodology to Match 9
Distributions of Both Household and Person Attributes in the Generation of Synthetic 10
Populations. Paper presented at the 88th Annual Meeting of the Transportation Research 11
Board, Washington, D.C., January 2009. 12
21. SimTRAVEL Research Group. Synthetic Population Generation for Travel Demand 13
Forecasting. 2010, http://urbanmodel.asu.edu/popgen/trainingmaterials.html, Accessed on 14
06/24/2014. 15
22. Salazar-GonzΓ‘lez, J. J. Controlled Rounding and Cell Perturbation: Statistical Disclosure 16
Limitation Methods for Tabular Data. Mathematical Programming, Vol. 105, No. 2-3, 17
2006, pp. 583-603. 18
23. Los Alamos National Laboratories (LANL). Population Synthesizer: TRANSIMS 4.0.10 19
User Manual. 2010, http://sourceforge.net/projects/transims/files/documentation/4.0.06/, 20
Accessed on 06/24/2014. 21
24. Hensher, D. A., and T. Ton. TRESIS: A Transportation, Land Use and Environmental 22
Strategy Impact Simulator for Urban Areas. Transportation, Vol. 29, No. 4, 2002, pp. 439-23
457. 24
25. Atlanta Regional Commission. Activity-Based Travel Model Specifications: Coordinated 25
Travel β Regional Activity Based Modeling Platform (CT-RAMP) for the Atlanta Region. 26
Updated December 2012. 27
26. Parsons Brinckerhoff Consult. Task 2: Household and Population Synthesis. Report 28
prepared for the Mid-Ohio Regional Planning Commission (MORPC) as part of The 29
MORPC Model Improvement Project, March 2003. 30
27. Hitchcock, F. L. The Distribution of Product from Several Sources to Numerous Localities. 31
Journal of Mathematical Physics, Vol. 20, No. 2, 1941, pp. 224-230. 32
28. Dantzig, G. B. Linear Programming and Extensions. Princeton University Press, Princeton, 33
New Jersey, 1963. 34
29. Hobeika, A. TRANSIMS Fundamentals: Population Synthesizer. U.S. Department of 35
Transportation, Washington, D.C., 2005. 36
30. Voas, D., and P. Williamson. Evaluating Goodness-of-Fit Measures for Synthetic 37
Microdata. Geographical & Environmental Modeling, Vol. 5, No. 2, 2011, pp. 177-200. 38
31. Read, T. R., and N. A. Cressie. Goodness-of-Fit Statistics for Discrete Multivariate Data. 39
Springer, 1988. 40
32. Koehler, K. J. Goodness-of-Fit Tests for Log-Linear Models in Sparse Contingency Tables. 41
Journal of the American Statistical Association, Vol. 81, No. 394, 1986, pp. 483-493. 42
33. Tzeng, G. H., and J. J. Huang. Multiple Attribute Decision Making: Methods and 43
Applications. CRC Press, 2011. 44
34. Jeffreys, I. The Use of Compensatory and Non-Compensatory Multi-Criteria Analysis for 45
Small-scale Forestry. Small-scale Forest Economics, Management and Policy, Vol. 3, No. 46
1, 2004, pp. 99-117. 47