Tabular Rounding in Iterative Proportional Fitting for ...docs.trb.org/prp/15-0841.pdf1 Tabular...

transcript

Tabular Rounding in Iterative Proportional Fitting for 1

Population Synthesis in Activity-Based Models 2

Word count: 5852 (Text) + 7 * 250 (Tables & Figures) = 7602 6

Submission date: August 1, 2014 7

Submission of revised manuscript: November 13, 2014 8

Abdoul-Ahad Choupani 12

Ph.D. Candidate 13

Department of Civil and Environmental Engineering 14

Tarbiat Modares University 15

Gisha Bridge, Tehran, Iran 16

Phone: +98-21-33-255-473 17

Fax: +98-21-8288-4925 18

Email: ahad.choupani@modares.ac.ir 19

Amir Reza Mamdoohi (Corresponding Author) 24

Assistant Professor 25

Department of Civil and Environmental Engineering 26

Tarbiat Modares University 27

Gisha Bridge, Tehran, Iran 28

Phone: +98-21-8288-4925 29

Fax: +98-21-8288-4925 30

Email: armamdoohi@modares.ac.ir31

Choupani and Mamdoohi 2

Tabular Rounding in Iterative Proportional Fitting for Population 1

Synthesis in Activity-Based Models 2

Abstract 3

Activity-based travel demand modeling requires socioeconomic micro-data of the 4

population under study. Since acquisition of such data for the entire population is 5

infeasible or highly expensive, techniques like iterative proportional fitting (IPF) 6

have been applied extensively to estimate such data for the population, 7

synthetically. Despite its many advantages, IPF results in non-integer values 8

instead of integers: fractions of households or individuals are obtained for zones. 9

Although methods have been proposed to integerize non-integer tables, seldom 10

has this problem been viewed as tabular rounding. This paper proposes a binary 11

linear programming model for tabular rounding in which the integerized tables 12

totals and marginals perfectly fit to input data obtained from the census bureau. 13

Furthermore, it minimizes distortion to the correlation structure of the household-14

level and individual-level non-integer tables. It does not bias joint or marginal 15

distributions of socioeconomic attributes of population units and the sampling of 16

rare demographic groups (at α=0.05 significance level). The empirical 17

comparison of the proposed method with eight existing ones demonstrates that 18

the proposed model outperforms tested methods. Sensitivity analysis 19

demonstrates that integerization of small values is as significant as large values. 20

In this study, deterministic methods outperform stochastic methods in terms of 21

accuracy, and perfect fit to census data. Finally, a scoring and ranking tool is used 22

to reflect concisely the advantages and disadvantages of these methods. 23

Keywords: activity-based models, population synthesis, iterative proportional 25

fitting, tabular rounding, binary linear programming. 26

1. INTRODUCTION 1

Activity-based travel demand forecasting has received increasing attention by 2

researchers and practitioners. This modeling approach requires micro-data of 3

the study area population, which are usually inaccessible due to confidentiality 4

concerns or costs. The two most widely applied techniques to produce the 5

micro-data synthetically are iterative proportional fitting (IPF) [1] and the 6

combinatorial optimization (CO) [2]. Other techniques have also been proposed 7

such as a sample-free method to synthesize Belgian population [3]. Farooq et al. 8

[4] applied Markov Chain Monte Carlo simulation to synthesize the population 9

of Switzerland. IPF, however, has been the workhorse for synthesizing 10

population due to its many advantages: requiring little census data, 11

computational ease and speed [1, 5, 6], and guarantee of convergence [7, 8]. 12

A disadvantage of IPF is that it produces non-integer values: fractions of 13

households or individuals [1, 6, 9]. Although this is not a problem for many 14

applications such as aggregate zonal-based modeling of travel demand, 15

disaggregate agent-based modeling requires integer rather than fractional 16

values. CO produces integer estimates of households/individuals for a given 17

zone and this is why some researchers favor this technique [10]. However, CO 18

needs considerable time for convergence [11] because it does not necessarily 19

produce better results at all iterations due to the random selection of agents in 20

the solution algorithms [6]. 21

The usefulness of IPF in population synthesis can be enhanced if the 22

integerization problem is solved. Lovelace and Ballas [6] proposed four 23

integerization methods and compared them with the conventional rounding 24

method. However, none of their methods and very few of the existing ones 25

regard the problem as a tabular rounding problem. 26

The aims of this paper are as follows: 27

1. Propose a tabular rounding method, maintaining aggregate totals and 28

marginals the same, and a close similarity of correlation structure of 29

estimated tables and reference tables [8, 13, 14]. 30

2. Investigate the bias caused by integerization in general and by deterministic 31

methods in particular, 32

3. Compare stochastic and deterministic integerization methods, and 33

4. Study the effects of integerization on sampling of rare demographic groups. 34

The rest of this paper is organized as follows. Section 2 reviews the literature 1

and familiarizes readers with IPF and the integerization problem. Section 3 2

describes the proposed and existing methods. In section 4, the results of 3

integerization methods are discussed and their performances are compared and 4

ranked. The paper ends with concluding remarks. 5

2. LITERATURE REVIEW 6

A three-way count table Sijk cross-classifies variables X, Y, and Z of sampled 7

agents from a large area (region) into I, J, and K categories, respectively. Each 8

cell Sijk is a count of observations classified into categories i, j, and k of the first, 9

second, and third variables, respectively. However, such counts (nijk) are not 10

available for the population of small areas (zones) and only the marginals are 11

available (ni++, n+j+, and n++k). Each marginal cell ni++ = ∑ ∑ nijk𝑘𝑗 contains the 12

total number of observations in category i of variable X (similarly for n+j+ and 13

n++k). Population synthesizers generally use IPF, a well-known statistical 14

technique similar to Furness technique, for the estimation of nijk. 15

Once the joint distribution is determined, with each cell indicating the 16

number of agents of a certain type residing in a specific zone, for each cell its 17

number of agents are selected randomly from the corresponding set of records 18

available in the sample and is placed into the zone population. However, the 19

problem is that these tables are not integer. 20

Therefore, a similarly structured integerized table Nijk is desired, in which 21

marginals and totals of Nijk correspond to those of nijk, e.g. n+++= N+++, ni++ = Ni++. 22

Synthesizers devise two approaches to deal with non-integer tables. The first 23

approach integerizes non-integer tables and then accomplishes the selection [13]. 24

The second approach carries out integerization indirectly. Fitted tables are treated 25

as joint probability mass functions (PMF) displaying the probabilities of 26

observing agents with specific demographics. n (n=table total) Monte Carlo 27

draws with replacement are then used to select n agents from the sample records 28

and to construct the zone population [6, 14-17]. The second approach regards all 29

types of agents simultaneously for selection that results in a longer list of 30

selections and in spending more time and memory. Theoretically, agents with 31

higher probabilities are likely to appear more, while agents with lower 32

probabilities are less likely to appear in the synthesized population [6]. However, 33

there is a nonzero chance that an agent with a lower probability (e.g. 0.002) is 34

replicated more times than an agent with a higher probability (e.g. 0.004). But, a 35

disadvantage of the first approach is that the integerized table is not the best 1

solution in terms of information discrimination: a value of 0.501 is treated the 2

same as 0.999 [11] (see TABLE 1 for further comparison of the two approaches). 3

We revisit the first approach, which arguably outperforms the second, i.e. tables 4

are integerized and then agents are selected from the sample records. 5

Conventional rounding (which rounds fractions down to 0 if they are 6

smaller than 0.5, and up to 1 otherwise) minimizes overall discrepancy between 7

unrounded and rounded values, however it fails to add to the primary total [18]. 8

The implicit assumption of conventional rounding is that the frequencies of 9

fractional parts smaller and greater than 0.5 are equal, resulting in additive 10

rounding. Since this assumption does not always hold true, the non-additive 11

rounding will occur in most cases. 12

The construction of additive rounding in multi-way tables is much more 13

profound because rounding should not only retain the additivity of table total, 14

but also the additivity of the marginals [19]. Since for large tables, the error in 15

the marginals can be very large, sophisticated methods are needed for rounding 16

of multi-way tables. 17

IPF also results in sparse tables, i.e. the tables are composed of many 18

cells, most of which have low values. FIGURE 1.a shows (for the case study of 19

the current paper) that household-level tables contain values ranging from small 20

values close to zero to values greater than 100, but 15% of values fall in the 21

narrow interval of [0 1). Integerization of these small values is much more 22

sensitive than large values. 23

The aim of this paper is to propose a tabular rounding method and to 24

compare its performance with those of eight existing methods. Among the 25

existing methods, an attempt was made to select the most competitive ones. 26

However, to demonstrate that conventional methods (e.g. rounding) may not be 27

appropriate, a few such methods are included in this study. 28

Because synthesizing population at both household and individual levels 29

is becoming a common practice [12, 16, 20], integerization effects at both 30

levels will be empirically studied considering the case of Wyoming State, US. 31

TABLE 1 A Comparison of Two Approaches to Integerize Non-Integer 1

Tables 2

Reference Integerize approach

Criteria No. Indirect Direct

[1] Probability

function

table How the approach treats the fitted tables 1

[9] Shorter list of selections 2

[9] Requiring less time and memory 3

Replicating an agent with higher replication count

(or probability) more times than an agent with lower

replication count (or probability)

[6, 11] Discrimination information 5

_____a Selection of agents with a specific order 6

_____a Possibility of investigating integerization and

selection effects separately 7

a. Proposed criterion by the current paper. 3

(a) Household-Level table

(observations=8223)

(b) Individual-Level table

(observations=6213)

FIGURE 1 Relative frequency of values (x) of estimated (a) Household-4

Level tables (observations=8223), and (b) Individual-Level tables 5

(observations =6213) for Census Tracts (CTs) of Wyoming State. 6

3. METHODOLOGY 7

Most integerization methods truncate non-integer numbers to separate integer 8

parts from their fractional parts. Integer parts are usually considered as 9

deterministic numbers, which determine how many times agents of a certain type 10

should be ‘cloned’ and placed into the synthesized population. Then the 11

0≤ x<1

1≤ x<10

100≤ x

0≤ x<1

1≤ x<10

10≤ x<100

100≤ x

remaining fractional parts are used to determine deterministically or 1

stochastically whether an agent of a certain type should be placed into the 2

population one more time or not. When the fractional parts are rounded, they add 3

to the stored integer parts of the numbers to give the total times that an agent 4

should be placed into the zone population. 5

In the following, we describe briefly how the existing methods integerize 6

fractional parts of numbers. Then, the proposed method is described in detail. 7

1. The first method, rounding, refers here to the conventional rounding where 8

the fractional parts are rounded down if they are smaller than 0.5 and are 9

rounded up otherwise. 10

2. PopGenRandom [20-22] uses a random integerization method that can be 11

represented by: 12

1 with probability puijk= fijk

0 with probability pdijk= 1- fijk

where 13

Fijk = the integerized value of fijk, 14

puijk= the probability of rounding fijk up to 1, 15

pdijk= the probability of rounding fijk down to 0, and 16

fijk= the fractional part of cell values. 17

Then, it adds ±1 to rounded cells to account for the differences between the 18

rounded and unrounded values sums. If the integerization under-estimates 19

table totals, it adds +1 to the rounded down cells. If the integerization over-20

estimates table totals, it adds -1 to the rounded up cells. 21

3. PopGenBucket [21] applies bucket rounding to cells of a specific table. This 22

deterministic method keeps track of the accumulated rounding errors. 23

Accumulated rounding error in the previous cell of a table is used to bias the 24

rounding of the next cell in the same table. The procedure preserves the 25

populations of agents (table totals) in the zone. 26

4. PopGenRounding [21] first uses the conventional rounding. The sum of 27

rounded values may not add to the table total in most cases. Therefore, 28

PopGenRounding adds ±1 to rounded cells the way as described for 29

PopGenRandom. 30

5. TRANSISM [23] applies the bucket rounding to specific cell of all tables. 31

This process preserves the total population of each demographic group in the 32

study area, but may change slightly the total population of a given zone [23]. 33

6. TRESIS [24] uses a simple rounding procedure, except all non-zero values 1

less than 1 are rounded up to 1. This method may be biased for values less 2

than 1 and can yield aggregations that differ substantially from the control 3

values [5]. 4

7. ARC [25] and MORPC [26] round up, starting with the demographic groups 5

with the largest fractional part, but avoiding rounding up if it would cause a 6

control value to be exceeded. 7

8. Truncate-Replicate-Sample (TRS) [6] is formally defined by: 8

(2) 1 with probability puijk=

𝑓𝑖𝑗𝑘

∑ ∑ ∑ 𝑓𝑖𝑗𝑘𝑘𝑗𝑖

0 with probability pdijk= 1- puijk

where 9

Fijk = the integerized value of fijk, 10

puijk= the probability of rounding fijk up to 1, 11

pdijk= the probability of rounding fijk down to 0, and 12

The puijk is proportional to the share of fijk in ∑ ∑ ∑ 𝑓𝑖𝑗𝑘𝑘𝑗𝑖 . This method 14

guarantees that higher fractional parts have a higher chance of being rounded 15

up. Also, rounding until ∑ ∑ ∑ 𝑓𝑘𝑗 𝑖𝑗𝑘= ∑ ∑ ∑ 𝐹𝑖𝑗𝑘𝑘𝑗𝑖𝑖 ensures that population 16

size remains the same. 17

The existing methods are not tabular rounding methods, in which totals 18

and marginals of integerized tables perfectly fit to input data obtained from 19

census bureau. Tabular rounding also minimizes distortion to primary non-20

integer table. 21

One way for tabular rounding of two-way tables is to apply transportation 22

problem [18]. Transportation is a well-known problem in economics aims 23

originally at determining a minimal-cost shipping schedule between sources 24

and destinations [27]. The classic statement of this problem uses a matrix with 25

the rows representing sources and columns representing destinations. The 26

margins of the matrix show supplies and demands. The costs of shipping from 27

sources to destinations are indicated by the entries in the matrix. 28

If the integerization of two-way tables containing positive non-integers is 29

formulated similar to linear programming of the transportation problem [18], 30

the integerization problem may not have integer solutions primarily. The 31

“triangular basis property” of the transportation problem is then brought to 32

prove that solutions do exist and the optimal solutions are also integer if 33

marginals are integer [28]. This property is still true when the integer values are 34

restricted to sets of {0, 1} by introducing the capacity constraints on the 1

decision variables. 2

The above properties of the transportation problem help to find the 3

solution of tabular rounding of two-way tables without solving an integer 4

programming model. However, it was proved [22] that the three-dimensional 5

linear program does not always have integer solutions. 6

Thus, an integer programming model similar to transportation problem 7

construct is proposed with modifications for integerizing three-way tables given 8

below in program A (P-A): 9

Minimize ∑ ∑ ∑ −𝑓𝑖𝑗𝑘 ∗ 𝐹𝑖𝑗𝑘

𝑘=1

𝑗=1

𝑖=1

Subject to:

∑ ∑ 𝐹𝑖𝑗𝑘 = ∑ ∑ 𝑓𝑖𝑗𝑘𝐾=1𝑘=1

𝐽𝑗=1 𝑖 =𝐾

𝑘=1𝐽𝑗=1 1, 2, … , 𝐼

∑ ∑ 𝐹𝑖𝑗𝑘 = ∑ ∑ 𝑓𝑖𝑗𝑘𝑲𝒌=𝟏

𝑰𝒊=𝟏 𝑗 = 1,2, … , 𝐽𝐾

𝑘=1𝐼𝑖=1

∑ ∑ 𝐹𝑖𝑗𝑘 = ∑ ∑ 𝑓𝑖𝑗𝑘𝐼𝑖=1

𝐽𝑗=1 𝑘 = 1,2, … , 𝐾𝐼

𝑖=1𝐽𝑗=1

𝐹𝑖𝑗𝑘 = {0,1}

(P- A)

where 10

Fijk =the integerized value of fijk, and 11

Constraints guarantee that the observed marginals do not change due to the 13

integerization. If marginals remain the same (e.g. f++k=F++k), totals of tables 14

remain the same (n+++= N+++) through integerization. Program A (P-A) minimizes 15

the error introduced to the results of IPF. Like conventional rounding, this 16

program gives a greater chance of being rounded up to higher fijk. If nijk is integer, 17

there is no need to consider it in program, which is typically solved for each zone 18

and generally has I*J*K binary variables and I+J+K equality constraints. 19

P-A is scalable to the large number of dimensions. If more dimensions 20

(variables) are used, the rounding problem can still be converted to a binary 21

linear programming (BLP) model, in which the objective function and 22

constraints are convex. Therefore, the model always has an integer solution 23

irrespective of the number of dimensions. 24

The characteristics of the existing methods and the proposed one are 25

summarized in TABLE 2.26

4. RESULTS AND DISCUSSION 1

This section compares integerization methods for Census Tracts (CTs) as zones 2

in the state of Wyoming using year 2010 census data. Wyoming State had a 3

population of 563,626 persons residing in 226,879 households in 132 CTs (1 CT 4

had zero household). Marginal distributions of control variables were extracted 5

from the Summary Files-1 (SF-1) while the one-percent Public Use Microdata 6

Sample (PUMS), which includes 5,665 persons residing in 2,475 households, 7

served as the sample from which reference tables were established. 8

The household-level control variables were as the followings: 9

1. Household size (7 categories), 10

2. Householder age (7 categories), and 11

3. Householder race (4 categories). 12

While the individual-level control variables were as the followings: 13

1. Gender (2 categories), 14

2. Age (7 categories), and 15

3. Race (4 categories). 16

The Household-level and individual-level tables contain 196 (7*7*4) and 17

56 (2*7*4) cells, respectively. 18

IPF and all methods were coded in MATLAB. To guarantee representative 19

results for the stochastic methods, they were run 30 times and the best-fit result 20

with the minimum total absolute error (TAE) was selected. 21

After applying different methodologies for integerization, the same method 22

is applied for selection of the households/individuals. Since there is no 23

variations between methods regarding selection, only the analysis of 24

integerization step is presented in this paper. For selecting 25

households/individuals, the method proposed by Auld and Mohammadian [16] 26

is used similarly for all methods, which aims at synthesizing population at both 27

levels of households and persons simultaneously. 28

TABLE 2 Existing and Proposed Integerization Methods and Their Characteristics 1

No. Method Synthesizer Country Integerization Perfect fit to table totals

Perfect fit

marginals?

Reference

Results of

scoring and

ranking using

TOPSIS

Obtained? How? Weight Rank

1 Rounding

TRANSISM 3.0

SimBRITAIN

& UK Conventional rounding [6, 29] 0.5272 7

2 PopGenRandom

PopGen US

Random rounding of each cell

separately

Adding ±1 to

integerized cells

0.5407 6

3 PopGenBucket Applying bucket rounding to cells of

a specific table.

Inherently results in

perfect fit 0.6894 3

4 PopGenRounding Conventional rounding Adding ±1 to

integerized cells 0.6778 4

5 TRANSIMS TRANSISM

4.0.10 US

Applying bucket rounding to specific

cell of all tables. [23] 0.5272 7

6 TRESIS TRESIS AU Rounding up values lower than 1 and

rounding other cells. [24] 0.5272 7

7 ARC ARC &

MORPC US

Rounding up cells with the largest

fractional parts if they do not exceed

marginals.

[25, 26] 0.6933 2

8 TRS SimBRITAIN UK Random rounding of all cells

simultaneously.

perfect fit [6] 0.6056 5

9 The proposed

optimization ______a _____a Binary linear programming

perfect fit _____b 0.8437 1

a. This method is not intended for specific synthesizer (or case) and is generic. b. Proposed method of this paper2

4.1. Accuracy 1

The first statistic used for comparison of methods is absolute error. 2

FIGURE 2 shows the cumulative percentage distributions of absolute errors of 3

table totals and marginals. As absolute errors of totals of household-level and 4

individual-level tables can be seen in FIGURE 2.a and c, the 100% of absolute 5

errors are zero in the proposed, TRS, and PopGen methods (note the single 6

triangle in the Cartesian coordinate (0,100) of the above figures). ARC is the 7

second best method since 90% and 100% of absolute errors of household-level 8

and individual-level tables are equal to or less than 1, respectively (as can be 9

seen in the aforementioned figures). 10

Marginals classify the population of agents into distinct groups. As 11

FIGURE 2.b and d show for household-level and individual-level tables 12

respectively, 100% of the absolute errors of marginals are zero in the proposed 13

method. This is due to keeping all marginals the same during integerization 14

using linear constraint in the optimization problem. It can also be seen in the 15

above figures that around 90% of the absolute errors of household-level or 16

individual-level marginals are zero in ARC. This good performance is due to 17

rounding up, starting with the demographic groups with the largest fractional 18

part, but avoiding rounding up if it would cause a control value to be exceeded. 19

Total absolute error (TAE) is used to measure aggregate errors of a table 20

as shown below [30]: 21

𝑇𝐴𝐸 = ∑ ∑ ∑|𝑁𝑖𝑗𝑘 − 𝑛𝑖𝑗𝑘|

𝑘𝑗𝑖

where 22

TAE= Total absolute error, 23

Nijk = the integerized value of cell ijk, and 24

nijk = the non-integer value of cell ijk. 25

FIGURE 3.a and b show that the lowest TAEs belong to methods 26

applying conventional rounding (including rounding, PopGenRounding) since 27

they choose the closest integers to fractional values. However, stochastic 28

rounding (including TRS and PopGenRandom) or bucket rounding methods 29

(including TRANSIMS and PopGenBucket) have the largest TAEs. The errors 30

of the remaining methods (including the proposed, TRESIS, and ARC) are 31

moderate. 32

(b) Household-Level marginals (observations=2358) (a) Household-Level tables totals (observations=131)

(d) Individual-Level marginals (observations=1703) (c) Individual-Level tables totals (observations=131)

FIGURE 2 Cumulative percentage distributions of absolute errors of (a) Household-Level tables totals (observations=131), (b) 2

Household-Level marginals (observations=2358), (c) Individual-Level tables totals (observations=131), and (d) Individual-Level 3

marginals (observations=1703). 4

0 1 2 3 4 5 6

Absolute error

Optimiztaion

PopGenRounding

PopGenRandom

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28

Absolute error

Optimization, PopGen, and TRS

Rounding

TRANSISM

TRESIS

0 1 2 3 4 5

Absolute error

Optimiztaion

PopGenBucket

PopGenRounding, TRESIS, and

RoundingTRS & TRANSISM

0 1 2 3 4 5 6 7 8

Absolute error

Optimization, PopGen, and TRS

Rounding

TRESIS

TRANSISM

(a) Household-Level tables (observations=131) (b) Individual-Level tables (observations=131)

FIGURE 3 Sum of Total Absolute Errors (TAEs) of (a) Household-Level tables (observations=131), and (b) 1

Individual-Level tables (observations=131). 2

Rounding methods

To shed more light on the performances of deterministic and stochastic 1

methods, the two stochastic methods, PopGenRounding is compared with 2

PopGenRandom and TRS. 3

Both PopGenRandom and PopGenRounding integerize each cell 4

independently from other cells. The former randomly decides to round each 5

cell to the closest integer, whilst the latter rounds the cells deterministically to 6

the closest integer. To account for the difference between the rounded values 7

sum and the unrounded values sum, both add ±1 to rounded cells. TRS 8

randomly rounds all cells simultaneously using equation (2) and it does not need 9

further adjustments to preserves table totals. 10

FIGURE 2.a and c show for the three methods that the 100% of the 11

absolute errors of household-level and individual-level totals are 0. However, 12

the errors of marginals are not similar. FIGURE 2.b and d show that around 13

95% of the absolute errors of household-level and individual-level marginals are 14

equal to or less than unity in PopGenRounding. In TRS and PopGenRandom, 15

however, 85% and 60% of the absolute errors of marginals are equal to or less 16

than unity, respectively. 17

Furthermore, PopGenRounding results in lower TAEs as compared to 18

PopGenRandom and TRS. The sums of TAEs of household-level and 19

individual-level tables are approximately 2,050 and 1,550 in PopGenRounding, 20

respectively (FIGURE 3.a and b). These values are respectively 2,650 and 2,050 21

in PopGenRandom. In TRS, the sums of TAEs of household-level and 22

individual-level tables are approximately 3,050 and 2,300, respectively 23

(FIGURE 3.a and b). Looking more closely at FIGURE 3.a and b, it is found 24

that the lowest and highest TAEs belong to PopGenRounding and TRS, 25

respectively. PopGenRounding outperforms TRS and PopGenRandom in our 26

case. Collectively, it is found that deterministic methods are better than 27

stochastic methods. 28

Freeman-Tukey (FT) test investigates whether joint (or marginal) 29

distributions of attributes (of agents) remain the same through integerization (as 30

the null hypothesis) or not (as the alternative hypothesis) [30]. This test uses the 31

𝜒2-type FT statistic as defined below: 32

𝐹𝑇 = 4 ∑ ∑ ∑(√𝑁𝑖𝑗𝑘 −

𝑘𝑗𝑖

√𝑛𝑖𝑗𝑘)2 (4)

where 33

FT= the statistic of FT test, 34

Nijk= integer value of cell ijk, and 35

nijk = non-integer value of cell ijk. 1

It is known that this test may perform poorly when dealing with cell 2

frequencies less than 5. However, this condition can be relaxed following 3

investigations showing that the 𝜒2 can still be a good approximation for FT 4

statistic even when some cell values are lower than 1 [31]. It was added that this 5

test might also perform poorly for sparse tables containing both very small and 6

moderately large values [32]. However, as can be seen in FIGURE 1.a and b, 7

the household-level and individual-level tables contain values ranging from 8

small values close to zero to values greater than 100 and table cells include a 9

wide range of small, moderate, and large values. Therefore, 𝜒2 can be a good 10

approximation for FT statistic even when some cell values are smaller than 1. 11

Results of FT test (in TABLE 3) show that the null hypothesis is not 12

rejected for the proposed method. This means that the proposed integerization 13

method does not change joint (or marginal) distributions of agents' attributes 14

and does not enter bias into synthesis process. FT test (in TABLE 3) 15

demonstrates that not all existing integerization methods except two change 16

joint (or marginal) distributions of attributes (at α=0.05 significance level). 17

TRESIS changes (at α=0.05 significance level) marginal distributions of 18

householder age and race in 2 and 7 CTs, respectively (collectively 9 CTs). 19

PopGenRandom also changes joint distributions of individuals attributes in 3 20

CTs (at α=0.05 significance level). These 12 CTs are less populated zones 21

whose populations are close to the minimum population or below the average. 22

It is desired to ensure that the fit is good overall, with no problems at any 23

point. Cell by cell Z test determines whether the probability of observing agents 24

with specific socio-demographics in specific zone is changed due to 25

integerization or not [30]. This test statistic is given by: 26

𝑧𝑖𝑗𝑘 =�̇�𝑖𝑗𝑘−�̇�𝑖𝑗𝑘

√�̇�𝑖𝑗𝑘(1−�̇�𝑖𝑗𝑘)

TABLE 3 Number of Census Tracts (CTs) in which Null Hypothesis was 1

Rejected 2

(a) Household-Level tables

No. Method

FT test a Z test a

Marginals Entire table Cells

Household size FTcritical= 12.5916

Householder

age FTcritical=12.5916

Householder

race FTcritical= 7.8147

FTcritical=

133.2569

df=108

(6*6*3)

Zcritical=

±1.960

1 Rounding 0 0 0 0 0

2 PopGenRandom 0 0 0 0 0

3 PopGenBucket 0 0 0 0 0

4 PopGenRounding 0 0 0 0 0

5 TRANSIMS 0 0 0 0 0

6 TRESIS 0 2 7 0 0

7 ARC 0 0 0 0 0

8 TRS 0 0 0 0 0

9 Proposed

optimization 0 0 0 0 0

(b) Individual-Level tables

No. Method

FT test a Z test a

Marginals Entire table Cells

Gender FTcritical=3.8416

Age FTcritical=12.5916

Race FTcritical= 7.8147

FTcritical=

28.8693

(1*6*3)

Zcritical=

±1.960

1 Rounding 0 0 0 0 0

2 PopGenRandom 0 0 0 3 0

3 PopGenBucket 0 0 0 0 0

4 PopGenRounding 0 0 0 0 0

5 TRANSIMS 0 0 0 0 0

6 TRESIS 0 0 0 0 0

7 ARC 0 0 0 0 0

8 TRS 0 0 0 0 0

9 Proposed

optimization 0 0 0 0 0

a. α = 0.05 3

where 4

Zijk= the statistic of Z test for cell ijk, 5

�̇�ijk = probability of observing agents with specific socio-demographics (ijk) in 6

non-integer table, (𝑛𝑖𝑗𝑘

𝑛), 7

�̇�ijk= probability of observing agents with specific socio-demographics (ijk) in 8

integer table, (𝑁𝑖𝑗𝑘

𝑁), 9

nijk = non-integer value of cell ijk, 10

Nijk= integer value of cell ijk, and 1

N=n=table total. 2

Z test shows that the proposed method does not change the probabilities 3

significantly (at α=0.05 level). This holds true for all of the studied existing 4

methods. 5

Pritchard and Miller [11] remarked that deterministic rounding might 6

bias the estimates, particularly for cells representing ‘‘rare’’ characteristics with 7

a count under 0.5. Cell by cell Z test demonstrates that there is no strong 8

evidence that cell proportions change due to integerization. Since the tables 9

contain values lower than 0.5 (FIGURE 1), it can be concluded that 10

deterministic methods do not bias the estimates for values lower than 0.5. 11

Muller and Axhausen [9] stated that any integerization method might bias 12

the population synthesis. However, considering FT and Z statistics, there is no 13

significant evidence that deterministic or stochastic integerization methods bias 14

the result. Since integerization is part of the synthesis process, it can be 15

concluded that generally, integerization does not introduce bias into synthesis 16

process. Only a few methods (including PopGenRandom and TRESIS) resulted 17

in limited bias. 18

4.2. Computational Effort 19

The time spent on the integerization of IPF tables as a criterion of 20

computational effort was measured on an Intel Core i3 (2.10 GHz) machine 21

with 4 GB of RAM running Windows 7.0. All methods runtimes were one 22

second except for the proposed and TRS methods. The proposed method spent 23

insignificantly 5 seconds longer on integerization. The single run and multiple 24

(30) runs of the stochastic TRS method took 6 and 37 seconds. The slower time 25

of the proposed method is due to solving one optimization problem with 26

relatively large number of binary variables for each zone (I*J*K variables). The 27

slowest time of TRS is due to random rounding of all cells simultaneously. It 28

seems that the run time is not a concerning issue. 29

4.3. Sensitivity Analysis 30

Integerization of values smaller than 1 which represent rare demographic 31

groups may be much more sensitive than integerization of large values. 32

Therefore, M is introduced into program A (P-A) to conduct sensitivity tests to 33

see how the proportions of rare demographic groups may vary due to 34

integerization as the following: 35

aijk =

fijk if nijk>1

M* fijk if nijk<1

where 1

aijk= coefficients of decision variables in the objective function (of P-A), 2

M= a coefficient defined for conducting sensitivity analysis, and 3

fijk= the fractional parts of cell values. 4

If M=1, then aijk= fijk (as it appears in P-A). M=1 implies that P-A treats 5

the integerization of small cell values the same as large ones. 6

However, there is a concern that small values (nijk< 1) are rounded down 7

to zero and the corresponding demographic groups disappear from zone 8

population. This may cause the bias of not sampling enough population units of 9

the rare groups. Thus, the proportions of rare demographic groups to the zone 10

population are assessed. The assessments show that the proportions remain the 11

same in the study area (at α=0.05 significance level). This implies that the 12

proposed method, which treated all values similarly, does not result in biased 13

sampling. 14

As M increases, equation 6 places higher priority on rounding up values 15

smaller than 1 and this may lead to biased oversampling of rare types. Decreasing 16

M may result in biased undersampling of rare groups. 17

FIGURE 4 shows the changes of proportions of rare groups with the 18

changes of M. FIGURE 4.a shows for household-level tables that if M varies 19

from 0.745 to 1.133, sampling of rare households is not biased (see the darkened 20

trapezoid in the middle of the figure). If M>1.133, however, proportions of rare 21

households increases significantly (at α=0.05 significance level) and this will 22

result in biased over-sampling of rare households (see the lighter trapezoid on the 23

right hand of FIGURE 4.a). On the other hand, for M<0.745, proportions of rare 24

households will decrease significantly (at α=0.05 significance level) and this 25

results in biased under-sampling of rare households (see the lighter trapezoid on 26

the left hand of FIGURE 4.a). Therefore, for an unbiased sampling, M should be 27

between 0.745 and 1.133. Deviating M slightly from 1 results in biased sampling. 28

So the best value for M is 1 in which integerization of all values are treated the 29

same. 30

(a) Households (observations=131) (b) Individuals (observations=131)

FIGURE 4 The sensitivity of the proportions of rare demographic groups 1

to changes in M for (a) Households (observations=131), and (b) Individuals 2

(observations=131). 3

4.4. Scoring and Ranking 4

Since several criteria are used to compare the integerization methods, 5

a multiple attribute decision-making (MADM) tool should be applied to assess 6

the overall performance of the methods and rank them. MADM is a sub-7

discipline of operations research and is the study of evaluating and choosing 8

alternatives based on the values (and the preferences) of the decision maker. 9

MADM techniques are divided into two classes: 1. compensatory, and 2. non-10

compensatory [33]. The compensatory techniques allow trade-offs between 11

criteria, where a poor result in one criterion can be negated by a good result in 12

another criterion. However, the non-compensatory techniques do not allow 13

trade-offs between criteria. Considering that all methods show strengths and 14

weaknesses according to the criteria taken into account, a compensatory 15

approach should be deployed for the assessments and ranking [34]. There are 16

many compensatory decision-making techniques in the literature such as 17

TOPSIS (technique for order of preference by similarity to ideal solution), AHP 18

(analytic hierarchy process), ELECTRE (elimination et choice translation 19

reality), and SAW (simple additive weighting) [33]. For brevity, only TOPSIS 20

is explained herein. Interested readers are referred to [33] for details of other 21

MADM techniques. Of the above methods, TOPSIS is selected due to the 22

following two reasons: 23

1. TOPSIS, unlike AHP and SAW, does not require expert judgment for 24

comparison or weighting and thus the decision-making process cannot be 25

biased by subjective judgments. 26

2. TOPSIS is more famous than methods such as ELECTRE and SAW. 27

0.4 0.6 0.8 1 1.2 1.4 1.6

0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

Biased

under-

sampling

Biased

sampling

Unbiased

sampling Biased

under-

sampling

Unbiased

sampling

Biased

sampling

(0.745, 0.251)

(1.133, 0.295)

(0.62, 0.030)

(1.443, 0.046)

TOPSIS is based on the concept that the chosen alternative should have 2

the shortest Euclidean distance from the positive ideal solution (PIS) (which has 3

the best score in each criterion) and should have the farthest Euclidean distance 4

from the negative ideal solution (NIS) (which has the worst score in each 5

criterion) [33]. 6

Alternatives that are closer to the PIS are more similar to PIS and they 7

receive higher ranks. TAEs of tables totals, marginals, and cells were the three 8

uncorrelated criteria applied for comparison. Inferential statistics were not 9

considered for ranking because all integerization methods were nearly similar 10

when these statistics were taken into account for comparison. 11

The results of TOPSIS (that summarized in TABLE 2) show that the 12

proposed deterministic method has the highest rank and ARC takes the second 13

place. The stochastic methods, TRS and PopGenRandom, are ranked 5th and 6th, 14

respectively. 15

Although the best-performing method is the proposed deterministic 16

method and there are other deterministic methods (e.g. PopGenBucket) that 17

outperform stochastic methods, we propose stochasticity be addressed in the 18

selection stage. When the integer frequency of each cell is obtained, 19

corresponding agents can be drawn randomly from the available sample 20

records. The randomness has been applied in selection phase since the exact 21

places of agents' residence are not known (in a given region). The second reason 22

for random selection is that the agents that are similar in controlled attributes 23

(the attributes used for building tables) may be completely dissimilar in other 24

attributes (uncontrolled variables). Moreover, if desired, stochastic 25

integerization can also be applied to produce multiple answers and appraise 26

quantitatively uncertainty introduced through integerization. 27

5. CONCLUSION 1

The proposed binary linear programming model outperforms the other methods 2

of integerization tested in the current paper and has the highest rank. This 3

method as a tabular rounding method is inspired by transportation problem, 4

which stems from economics and it was used previously to integerize two-way 5

tables. The proposed model decides whether to round up or down fractional 6

parts of cell values while minimizing deviation from IPF non-integer tables and 7

keeps tables totals and marginals the same through integerization. 8

Fortunately, hypothesis testing demonstrates that integerization methods 9

do not bias joint or marginal distributions of attributes of agents in zones. Also, 10

they do not alter the proportions of demographic groups in all zones of study 11

area, significantly. It does not bias joint or marginal distributions of agents' 12

attributes and the sampling of infrequent types of agents (at α=0.05 significance 13

level). Sensitivity tests reveal that integerization of small and large values can 14

be treated similarly. 15

Furthermore, deterministic and stochastic methods of rounding are 16

compared. The assessments demonstrate that deterministic methods produce 17

better results in terms of accuracy and fitting perfectly to the census data. 18

Although the proposed method is deterministic and the studied 19

deterministic methods outperform stochastic methods, randomness can be 20

addressed in selection phase when agents are drawn from the sample and 21

replicated into the zone population. Agents are drawn randomly since their 22

exact place of residence in a given region is not known. Also, the agents that are 23

similar in controlled variables may be completely different in uncontrolled 24

variables. 25

Our results provide insight into the advantages and disadvantages of nine 26

integerization methods and guidance to researchers aiming to use IPF and 27

produce integer values. Existing methods do not consider integerization of 28

tables as the tabular rounding problem. The current paper proposed a tabular 29

rounding method in the context of iterative proportional fitting to be applied in 30

population synthesis for activity-based models.31

References 1

1. Beckman, R. J., K. A. Baggerly, and M. D. McKay. Creating Synthetic Baseline 2

Populations. Transportation Research Part A, Vol. 30, No. 6, 1996, pp. 415–429. 3

2. Ryan, J., H. Maoh, and P. Kanaroglou. Population Synthesis: Comparing the Major 4

Techniques Using a Small, Complete Population of Firms. Geographical Analysis, Vol. 41, 5

No. 2, April 2009, pp. 181–203. 6

3. Barthelemy, J., and P. L. Toint. Synthetic Population Generation without a Sample. 7

Transportation Science, Vol. 47, No. 2, 2013, pp. 266-279. 8

4. Farooq, B., M. Bierlaire, R. Hurtubia, and G. Flötteröd. Simulation Based Population 9

Synthesis. Transportation Research Part B: Methodological, Vol. 58, 2013, pp. 243-263. 10

5. Bowman, J. L. A Comparison of Population Synthesizers Used in Microsimulation Models 11

of Activity and Travel Demand. 2004, http://jbowman.net/papers/, Accessed on 12

06/24/2014. 13

6. Lovelace, R., and D. Ballas. Truncate, Replicate, Sample: A Method for Creating Integer 14

Weights for Spatial Microsimulation. Computers, Environment and Urban Systems, Vol. 15

39, May 2013, pp. 172-181. 16

7. Fienberg, S. E. An Iterative Procedure for Estimation in Contingency Tables. Annals of 17

Mathematical Statistics, Vol. 41, 1970, pp. 907-917. 18

8. Pukelsheim, F. Biproportional Matrix Scaling and the Iterative Proportional Fitting 19

Procedure. Annals of Operations Research, Vol. 215, No. 1, 2014, pp. 269-283. 20

9. Muller, K., and K. W. Axhausen. Population Synthesis for Microsimulation: State of the 21

Art. Paper presented at the 90th Annual Meeting of the Transportation Research Board, 22

Washington, D.C., January 2011. 23

10. Williamson, P., M. Birkin, and H. Rees. The Estimation of Population Microdata by Using 24

Data from Small Area Statistics and Samples of Anonymized Records. Environment and 25

Planning A, Vol. 30, No. 5, 1998, pp. 785–816. 26

11. Pritchard, R. D., and E. J. Miller. Advances in Population Synthesis: Fitting Many 27

Attributes per Agent and Fitting to Household and Person Margins Simultaneously. 28

Transportation, Vol. 39, No. 3, 2012, pp. 685-704. 29

12. Guo, J. Y., and C. R. Bhat. Population Synthesis for Microsimulating Travel Behavior. In 30

Transportation Research Record: Journal of the Transportation Research Board, Vol. 2014, 31

No. 12, Transportation Research Board of the National Academies, Washington, D.C., 32

2007, pp. 92–101. 33

13. Bowman, J. L. Population Synthesizers. Traffic Engineering and Control, Vol. 49, No. 9, 34

2009, pp. 342. 35

14. Parsons Brinckerhoff Consult, HBA Specto Inc., and EcoNorthwest. Oregon2 Model 36

Development, HA Module Description at Finalization. Oregon Department of 37

Transportation, USA, 2003. 38

15. Auld, J., and A. K. Mohammadian. PopSyn-Win V 4.1 Methodology and Program 39

Documentation. Prepared for Chicago Metropolitan Agency for Planning, 2007, 40

http://www.travelbehavior.com/PopSynWINVersion 4.1 Users Guide.html, Last viewed 41

on 05/26/2014. 42

16. Auld, J., and A. K. Mohammadian. Efficient Methodology for Generating Synthetic 43

Populations with Multiple Control Levels. In Transportation Research Record: Journal of 44

the Transportation Research Board, No. 2175, Transportation Research Board of the 45

National Academies, Washington, D.C., 2010, pp. 138-147. 46

17. Auld, J., A. K. Mohammadian, and K. Wies. Population Synthesis with Sub-Region Level 1

Control Variable Aggregation. ASCE Journal of Transportation Engineering, September 2

2009, pp. 632-639. 3

18. Causey, B. D., L. H. Cox, and L. R. Ernst. Applications of Transportation Theory to 4

Statistical Problems. Journal of the American Statistical Association, Vol. 80, No. 392, 5

1985, pp. 903-909. 6

19. Šíma, J. Aunt's Problem: Table Rounding. Computers and Artificial Intelligence, Vol. 18, 7

1999, pp. 175–189. 8

20. Ye, X., K. Konduri, R. M. Pendyala, B. Sana, and P. Waddell. A Methodology to Match 9

Distributions of Both Household and Person Attributes in the Generation of Synthetic 10

Populations. Paper presented at the 88th Annual Meeting of the Transportation Research 11

Board, Washington, D.C., January 2009. 12

21. SimTRAVEL Research Group. Synthetic Population Generation for Travel Demand 13

Forecasting. 2010, http://urbanmodel.asu.edu/popgen/trainingmaterials.html, Accessed on 14

06/24/2014. 15

22. Salazar-González, J. J. Controlled Rounding and Cell Perturbation: Statistical Disclosure 16

Limitation Methods for Tabular Data. Mathematical Programming, Vol. 105, No. 2-3, 17

2006, pp. 583-603. 18

23. Los Alamos National Laboratories (LANL). Population Synthesizer: TRANSIMS 4.0.10 19

User Manual. 2010, http://sourceforge.net/projects/transims/files/documentation/4.0.06/, 20

Accessed on 06/24/2014. 21

24. Hensher, D. A., and T. Ton. TRESIS: A Transportation, Land Use and Environmental 22

Strategy Impact Simulator for Urban Areas. Transportation, Vol. 29, No. 4, 2002, pp. 439-23

457. 24

25. Atlanta Regional Commission. Activity-Based Travel Model Specifications: Coordinated 25

Travel – Regional Activity Based Modeling Platform (CT-RAMP) for the Atlanta Region. 26

Updated December 2012. 27

26. Parsons Brinckerhoff Consult. Task 2: Household and Population Synthesis. Report 28

prepared for the Mid-Ohio Regional Planning Commission (MORPC) as part of The 29

MORPC Model Improvement Project, March 2003. 30

27. Hitchcock, F. L. The Distribution of Product from Several Sources to Numerous Localities. 31

Journal of Mathematical Physics, Vol. 20, No. 2, 1941, pp. 224-230. 32

28. Dantzig, G. B. Linear Programming and Extensions. Princeton University Press, Princeton, 33

New Jersey, 1963. 34

29. Hobeika, A. TRANSIMS Fundamentals: Population Synthesizer. U.S. Department of 35

Transportation, Washington, D.C., 2005. 36

30. Voas, D., and P. Williamson. Evaluating Goodness-of-Fit Measures for Synthetic 37

Microdata. Geographical & Environmental Modeling, Vol. 5, No. 2, 2011, pp. 177-200. 38

31. Read, T. R., and N. A. Cressie. Goodness-of-Fit Statistics for Discrete Multivariate Data. 39

Springer, 1988. 40

32. Koehler, K. J. Goodness-of-Fit Tests for Log-Linear Models in Sparse Contingency Tables. 41

Journal of the American Statistical Association, Vol. 81, No. 394, 1986, pp. 483-493. 42

33. Tzeng, G. H., and J. J. Huang. Multiple Attribute Decision Making: Methods and 43

Applications. CRC Press, 2011. 44

34. Jeffreys, I. The Use of Compensatory and Non-Compensatory Multi-Criteria Analysis for 45

Small-scale Forestry. Small-scale Forest Economics, Management and Policy, Vol. 3, No. 46

1, 2004, pp. 99-117. 47

Tabular Rounding in Iterative Proportional Fitting for ...docs.trb.org/prp/15-0841.pdf1 Tabular...

Documents