Leveraging the Power of Place: A Data-Driven Decision Helper to … · 2020. 7. 29. · Leveraging...

Working Paper Series

IPL working papers are circulated for discussion and comment purposes. They have not been formally peer reviewed. © 2020 by Jeremy Ferwerda, Nicholas Adams-Cohen, Kirk Bansak, Jennifer Fei, Duncan Lawrence, Jeremy M.

Weinstein, and Jens Hainmueller. All rights reserved.

Working Paper No. 20-06

Jeremy Ferwerda, Nicholas Adams-Cohen, Kirk Bansak, Jennifer Fei, Duncan Lawrence, Jeremy M. Weinstein, and Jens Hainmueller

Leveraging the Power of Place: A Data-Driven Decision Helper to Improve the Location Decisions of Economic Immigrants

arX

iv:2

007.

1390

2v1

[cs

.CY

] 2

7 Ju

l 202

0

Leveraging the Power of Place: A Data-Driven Decision Helper to Improve theLocation Decisions of Economic Immigrants

Jeremy Ferwerda1,2,∗, Nicholas Adams-Cohen1,∗, Kirk Bansak1,3,∗, Jennifer Fei1, DuncanLawrence1, Jeremy Weinstein1,4, and Jens Hainmueller 1,4,5,†

1Immigration Policy Lab, Stanford University2Department of Government, Dartmouth College

3Department of Political Science, University of California San Diego4Department of Political Science, Stanford University

5Graduate School of Business, Stanford University∗Equal contributor

†Project director and corresponding author. Contact: [email protected].

July 2020

A growing number of countries have established programs to attract immigrantswho can contribute to their economy. Research suggests that an immigrant’s initialarrival location plays a key role in shaping their economic success. Yet immigrantscurrently lack access to personalized information that would help them identify op-timal destinations. Instead, they often rely on availability heuristics, which can leadto the selection of sub-optimal landing locations, lower earnings, elevated outmigra-tion rates, and concentration in the most well-known locations. To address this issueand counteract the effects of cognitive biases and limited information, we propose adata-driven decision helper that draws on behavioral insights, administrative data,and machine learning methods to inform immigrants’ location decisions. The deci-sion helper provides personalized location recommendations that reflect immigrants’preferences as well as data-driven predictions of the locations where they maximizetheir expected earnings given their profile. We illustrate the potential impact of ourapproach using backtests conducted with administrative data that links landing dataof recent economic immigrants from Canada’s Express Entry system with their earn-ings retrieved from tax records. Simulations across various scenarios suggest thatproviding location recommendations to incoming economic immigrants can increasetheir initial earnings and lead to a mild shift away from the most populous land-ing destinations. Our approach can be implemented within existing institutionalstructures at minimal cost, and offers governments an opportunity to harness theiradministrative data to improve outcomes for economic immigrants.

1 Introduction

Immigration has long been recognized as a driver of economic growth (Peri, Shih andSparber 2015; Kerr et al. 2016). Immigrants increase the size and diversity of the work-force, fill skill shortages, start businesses, and contribute to innovation (Hunt and Gauthier-Loiselle 2010; Borjas 1995; Burchardi et al. 2020). To encourage these positive effects,many countries have complemented family-based and humanitarian admission streamswith economic immigration programs, which prioritize the admission of skilled profes-sionals. A prominent example is Canada’s Express Entry system, through which approx-imately 100,000 immigrants are admitted each year. Applicants earn points for qualifica-tions, such as language ability, educational degrees, and occupational experience, as wellas other factors that have been shown to be associated with long-term economic successin Canada. Applicants who are above a certain threshold for that particular applicationround receive invitations to apply for permanent residence (Desiderio and Hooper 2016).Several other countries, such as Australia, New Zealand, and the United Kingdom, haveimplemented similar policies (Kerr et al. 2017).

The goal of these programs is to admit immigrants who are likely to succeed econom-ically and contribute to the destination country. Yet despite these programs’ intentions,immigrants nevertheless face a number of barriers to economic success. For instance,while a subset of individuals will have a preexisting job offer, the majority must selectan initial location within the destination country in which to settle and begin their jobsearch. However, immigrants generally lack access to personalized information on thelocations that are aligned with their preferences and skill sets. As a result, the initial lo-cation decision can be affected by a variety of decision-making biases, such as availabilityheuristics. When economic immigrants choose suboptimal landing locations, economicadmission programs cannot realize their full potential. Indeed, previous studies havedemonstrated that the initial location of immigrants has a sizable impact on their short-as well as long-term economic outcomes (Åslund and Rooth 2007; Damm 2014; Bansaket al. 2018).

In this study, we propose a data-driven decision helper that leverages administrativedata and machine learning methods to improve the initial location decisions of economicimmigrants. Drawing on behavioral insights that have been used to improve decisions inother policy domains (Thaler and Sunstein 2009; OECD 2017), our approach seeks to en-hance the choice architecture that shapes immigrants’ decisions by providing them withsystematic personalized information, delivered in the form of informational nudges,about which locations in the destination country would likely be beneficial to them.Building upon the outcome-based matching algorithm developed in Bansak et al. (2018),the decision helper provides newly invited immigrants with location recommendationsthat align with their preferences and maximize their expected economic outcomes. Theserecommendations draw on machine learning models applied to administrative data,which predict how immigrants with similar profiles have fared across possible land-ing locations, in combination with elicited preferences. The recommendations from thedecision helper are not meant to be binding, but provide additional information that

1

assists newly invited economic immigrants to make more informed location decisions.To illustrate the potential of our approach, we evaluate administrative tax and land-

ing data on recent cohorts from the economic immigration programs within Canada’sExpress Entry system. We find that landing locations are highly concentrated, and manyeconomic immigrants settle in destinations that are sub-optimal from the perspective ofpredicted earnings. Using backtests and simulations, we find that providing data-drivenlocation recommendations could significantly increase the annual income of economicimmigrants and more widely disperse the benefits of economic migration across Canada.These gains would be realized at limited marginal cost, since the Canadian governmentalready collects the administrative data used to train the models and communicatesregularly with economic immigrants throughout the application process. Although thedecision helper should be tested prospectively via a randomized controlled trial to eval-uate its full impact on a variety of outcomes, our results suggest that nudging incomingeconomic immigrants with personalized information could improve their outcomes andcreate an opportunity for governments to leverage their existing data to offer an innova-tive resource at scale.

2 A Data-Driven Decision Helper

2.1 Motivation

Our approach is motivated by a growing body of research that demonstrates the im-portance of the initial landing location in shaping immigrants’ outcomes. For example,studies have used quasi-experimental designs to demonstrate that an immigrant’s ini-tial landing location has an impact on short- as well as long-term economic success(Åslund and Rooth 2007; Damm 2014; Bansak et al. 2018). Similarly, when examiningoutcomes among non-immigrants, experiments have shown that families who were ran-domly offered housing vouchers to move to lower-poverty neighborhoods had improvedlong-term outcomes in terms of earnings and educational attainment for their children(Chetty, Hendren and Katz 2016; Ludwig et al. 2013).

The initial destination choice is also consequential given that many immigrants tendto remain in their landing location (Kaida, Hou and Stick 2020; Mossad et al. 2020). Forexample, in Canada, more than 80% of recent economic class immigrants remained intheir arrival cities ten years later (Kaida, Hou and Stick 2020). In addition, landing lo-cations are often highly concentrated. For example, in Australia, more than half of allrecent immigrants settled in Greater Sydney and Greater Melbourne, and only 14% set-tled outside the major capital cities (Tuli 2019). If initial settlement patterns concentrateimmigrants in a few prominent landing regions, many areas of the country may notexperience the economic growth associated with immigration. Moreover, undue con-centration may impose costs in the form of congestion in local services, housing, andlabor markets. To address the uneven distribution of immigrants, governments includ-ing Canada and Australia have implemented policy reforms to regionalize immigration

2

and encourage settlement outside of well-known major cities (Taylor et al. 2014; Fotros2018; Hugo 2008; Brezzi et al. 2010).

Although the evidence suggests that the initial landing location shapes immigrants’outcomes, choosing an optimal destination from the large set of potential options is aformidable task. Research suggests that immigrants consider the location of family orfriends, the perceived availability of employment opportunities, or preferences regardingclimate, city size, and cultural diversity (Chiswick and Miller 2004; Hyndman, Schuur-man and Fiedler 2006; Akbari and Harrington 2007; Massey 2008; Brezzi et al. 2010;Tonkin and Tonkin 1993; Damm 2009; Mossad et al. 2020). While some of these consid-erations may lead immigrants to correctly identify an optimal location, research also sug-gests that location decisions are impaired by common cognitive biases. One such bias,which has been well documented as a powerful influence across many choice settingswith incomplete information (Tversky and Kahneman 1974; Thaler and Sunstein 2009),is an availability heuristic. This heuristic suggests that immigrants prioritize places theyhave heard about, and that they overlook less prominent locations even though theselocations may actually align with their preferences and skills.

For example, in the Canadian context, studies indicate that many location decisionsare linked to the international prominence of destinations (Di Biase and Bauder 2005). AsBégin-Gillis (2010) argues, “many immigrants choose Toronto simply because that is allthey know of Canada” (also see McDonald (2004)). Similarly, Teo (2003) concludes that“unfamiliarity means that decisions regarding their initial destination are often relianton secondary information sourced from earlier migrants, immigration companies, theInternet or other sources” (also see Fotros (2018)). Recognizing that perceptions of placescan skew location decisions toward prominent cities, some provinces have attempted toinfluence perceptions by providing prospective immigrants with information about lessprominent locations. Evaluating these programs, Bégin-Gillis (2010) notes that when“prospective immigrants are provided with more information and provided with morechoices, they often choose differently.”

Interventions that use behavioral insights to counteract the effects of cognitive biaseshave been used in a wide variety of policy domains (Thaler and Sunstein 2009; OECD2017). Our approach builds on these behavioral insights, and seeks to enhance the choicearchitecture for newly invited economic immigrants by systematically providing person-alized recommendations as they make a decision about where to settle in the destinationcountry. The recommendations from our decision helper reflect individuals’ locationpreferences as well as data-driven predictions about the locations where they are likelyto attain the highest earnings given their profile. The recommendations thus act as infor-mational nudges that counteract the effects of limited information and cognitive biases,assisting immigrants in making more informed location decisions.

The primary anticipated users of the tool are economic immigrants who have beeninvited to apply for permanent residence via an economic admissions stream and are inthe process of selecting a landing location within the destination country. Since govern-ments communicate regularly with immigrants during this transition period to provide

3

information about the immigration process, the decision helper could be offered onlinevia a user interface at minimal cost. We expect that immigrants’ likelihood of usingthe decision helper tool will depend on their prior level of certainty in their destinationdecision. As a result, the decision helper provides a complementary source of informa-tion with minimal disruption to existing sets of resources that guide decision making.Our decision helper could—with appropriate adjustments—also be useful for other im-migrants or even Canadian-born residents as an informational tool in deciding if andwhere to move. Note that even Canadian-born residents typically do not have accessto granular administrative data which would allow them to discern how workers withsimilar skill-sets and backgrounds fare in various locations.

2.2 Design

The decision helper approach combines three main stages: modeling and prediction,preference constraints, and recommendations. Figure 1 is a flowchart of these differentsteps.

2.2.1 Modeling/Prediction

Our approach leverages individual-level administrative data from prior immigrant ar-rivals. Governments routinely gather information on applicants to economic immigra-tion programs, ranging from individuals’ skills, education, and prior job experienceto their age, gender, and national origin. Using unique identifiers, these backgroundcharacteristics can be merged with applicants’ initial landing locations and economicoutcomes, such as earnings or employment. Although governments with economic ad-mission streams collect these data as part of normal program administration, to ourknowledge they have not been systematically leveraged to predict how immigrants withdifferent profiles fare across various landing locations.

Our approach leverages these historical data to fit a set of location-specific, super-vised machine learning models that serve as the basis for recommended landing loca-tions for future immigrants. These models learn how immigrants’ background character-istics and skill sets are related to taxable earnings within each potential landing location,while also accounting for local trends over time. The models can then be used to predictan economic immigrant’s expected earnings at any of the possible landing locations. Tominimize bias in the models’ earnings predictions, a broad set of characteristics relatedto immigrants’ backgrounds, qualifications, and skills are included as predictors. Fur-thermore, to reduce the possibility that observed patterns are driven by location-specificself-selection bias, the data used to train the model could exclude prior immigrants withstanding job offers, family ties to specific locations, or other special situations. See theSupplementary Material (SM) appendix for a formalization and decomposition of thepossible selection bias in the models, along with discussion on how such bias can belimited.

4

Models

For each of the K total locations, weestimate a function beta using as inputs all

j living in k.

We use these betas to predict expectedoutcomes in k.

Predicted Outcome

For all individuals i in the client data, inputcovariates x to all K functions and obtain a vector of

predicted outcomes in each location.

Underlying Preferences

We assume every person has a utility for each locationk.

Client Data

In the client data, for eachindividual i, we have an

equivalent vector of covariatesx as used in the modeling step.

Restricted Set of Locations

We further assume every person has a utility threshold,below which they will not live in a location.

Define S as the subset of locations for person i whereutilities are above this utility threshold.

Define t as the cardinality of set S for individual i.

For each i, we find all locations in S with a surveymethod.

Recommend Locations

The final recommendation function takes as input the predicted outcomes and set ofrestricted locations S and outputs ranked t predicted outcomes for each location k in S.

Our interface then recommends locations top z locations in order.

Modeling/Predictions Preference Constraints

Recommendations

Administrative Data

In the administrative data, for eachindividual j, we have a vector of

covariates x and outcome y. In addition,we have their landing location choice k.

Figure 1: Decision Helper

5

2.2.2 Preference Constraints

Next, the decision helper elicits and incorporates individuals’ location preferences. Evenif a particular location is predicted to be the best for an immigrant in terms of expectedearnings, if the immigrant strongly prefers not to live there, then recommending such alocation will result in limited uptake and downstream dissatisfaction. To accommodateindividual preferences, we identify the list of locations each user would consider unac-ceptable, and then limit the choice set to the remaining locations prior to optimizing onexpected earnings.

Our approach is agnostic to the specific method used to rule out unacceptable lo-cations prior to making recommendations. For instance, users can be directly queriedto indicate regions of the country where they are unwilling to settle. Alternately, userscan be asked to provide preferences regarding specific location characteristics via di-rect questioning or conjoint survey tasks, spanning identifiable features such as urbandensity, climate, and the relative availability of amenities. These can be mapped onto ob-servable characteristics of each landing location in order to identify acceptable locationsaccording to the expressed preferences.

2.2.3 Recommendations

After taking individual preferences into account to constrain the choice set, the remain-ing locations are then ranked with respect to the individual’s predicted earnings in eachlocation. The decision helper delivers these recommendations to the user in the formof an informational nudge via an online interface. Users can be given either a reducednumber of the top ranked locations (e.g. the top 3 locations) or a full ranked list ofthe locations, along with accompanying information. Users choose whether to use thetool and follow the recommendations. This approach is non-coercive, seeking only toinform those who can benefit while not interfering with immigrants who may alreadyhave solid plans or private information guiding their selection of a particular location.

3 Empirical Analysis

We illustrate the potential of a decision helper to recommend initial landing locationsfor newly invited immigrants by evaluating data from Canada’s flagship Express Entrysystem. Express Entry is a system that manages applications to Canada’s high skilledeconomic immigration programs, which select skilled workers for admission through apoints-based system.1 The Express Entry application process involves several stages andis designed to select applicants who are most likely to succeed in Canada (Immigration,Refugees and Citizenship Canada 2019). First, eligible candidates create and submit a

1 These economic immigration programs are the Federal Skilled Worker Program, theFederal Skilled Trades Program, the Canadian Experience Class, and a portion of theProvincial Nominee Program.

6

profile to indicate their interest in moving to Canada. If candidates meet the minimumrequirements for one of the programs managed by Express Entry, they are entered intothe Express Entry pool, awarded points based on information in their profile and rankedaccording to the Comprehensive Ranking System (CRS). The CRS awards points basedon human capital characteristics including language skills, education, work experience,age, employment and other aspects that previously have been shown to be associatedwith long-term economic success in Canada. Factors in the CRS are generally groupedunder two categories: core points and additional/bonus points. Candidates with thehighest rankings in the pool are invited to apply online for permanent residence follow-ing regular invitation rounds. If candidates’ CRS scores are above a specified thresholdfor an invitation round, they receive an invitation to apply for permanent residence,to be submitted within 90 days of receiving an invitation (Immigration, Refugees andCitizenship Canada 2019). If an application is approved by an IRCC officer, perma-nent resident visas are issued so that the applicant and his or her accompanying familymembers can be admitted to Canada. Processing times for Express Entry profiles varydepending on the program of admission, but the majority of applications are processedwithin six months.

Since the system initially launched in 2015, it has steadily expanded. In 2018, 280,000Express Entry profiles were submitted, and 92,331 people were admitted to Canadathrough the Express Entry system (Immigration, Refugees and Citizenship Canada 2019).The growing importance of this system is mirrored by similar developments within otheradvanced economies. For instance, Australia and New Zealand also use a similar expres-sion of interest process to determine applicants’ eligibility and offer invitations to applyfor permanent residence. Countries such as Austria, Japan, South Korea, and the UnitedKingdom also have aspects of points-based admissions systems built into their economicimmigration programs.

3.1 Data

We draw on data from the Longitudinal Immigration Database (IMDB)—the integratedadministrative database that Immigration Refugees and Citizenship Canada reports onthe outcomes of immigrants. IMDB was initially a basic linkage between tax files and thePermanent Residents Database. The IMDB (2019 release) includes more than 12 millionimmigrants who landed in Canada between 1952 and 2018 and income tax records from1982 to 2017, as well as all temporary residents, Express Entry Comprehensive RankingSystem scores, citizenship uptake, and service usage for settlement programs. We sub-set the data to include principal applicants who arrived between 2012 and 2017 underthe following programs: the Federal Skilled Worker program, the Federal Skilled Tradesprogram, and the Canadian Experience Class. Applications for these admission streamsare managed by the Express Entry system. We further subset the data to exclude indi-viduals who were minors at the time of arrival (≤18), as well as individuals who did notfile a tax return while living in Canada. Given that the Express Entry system does notapply to Quebec, we also exclude all immigrants who first landed in Quebec or entered

7

on an immigration program run by Quebec. The final sample size consists of 203,290unique principal applicants.

3.2 Measures

The outcome measure is immigrants’ individual annual employment income, measuredat the close of the first full calendar year after arrival. We model this outcome as afunction of a variety of predictors that are either prior to or contemporaneous withan immigrant’s arrival. Predictors used in the modeling stage include age at arrival,citizenship, continent of birth, education, family status, gender, intended occupation,skill level, English ability, French ability, having a prior temporary residence permit forstudy in Canada, having a prior temporary residence permit for work in Canada, havingpreviously filed taxes in Canada, arrival month, arrival year, immigration category, andExpress Entry indicator. See Table S2 in the SM appendix for more information andsummary statistics on these measures.

We map immigrants’ landing locations to a specific Economic Region (ER) withinCanada, using census subdivision codes. As regional predictors, we also include thepopulation and the unemployment rate within the ER in the quarter of immigrants’ ar-rival. An ER is a Canadian census designation that groups neighboring census divisionsto proxy regional economies. We use the ER as the primary unit throughout the analysis.There are 76 ERs in total, but in our analysis there are 52 after excluding Quebec andmerging the smallest ERs using standard census practices.

3.3 Models

The modeling approach is based on the methodology developed in Bansak et al. (2018).We first merge historical data on immigrants’ background characteristics, economic out-comes, and geographic locations. Using supervised machine learning methods, we fitseparate models across each ER estimating an immigrant’s annual employment incomeas a function of the predictors described above. These models serve as the basis for thedecision helper tool’s recommendations, as they allow for the generation of annual em-ployment income predictions across each ER that are personalized to each immigrant’sbackground characteristics.

As our modeling technique, we use stochastic gradient boosted trees. We use 10-foldcross-validation within the training data to select tuning parameter values, including theinteraction depth, bag fraction, learning rate, and number of boosting iterations.2 More

2 The cross-validated R2 for our primary set of models (where the units of analysisare principal applicants) is 0.54. Within the context of incomes – which are highlyskewed and difficult to predict – this is relatively high. This represents a substan-tial improvement over the R2 for an analogous linear regression model using cross-validation (0.34). See the SM appendix for more details, including a breakdown of therelative importance of the predictors in the boosted trees models.

8

details are provided in the SM appendix.

3.4 Simulations

To estimate how our proposed decision helper would affect income and influence lo-cation decisions, we perform a series of backtests using historic Express Entry cohorts.Specifically, we implement a series of simulations in which the decision helper pro-vides recommendations to individual immigrants. We then simulate uptake of theserecommendations, and for individuals who follow the recommendation, we comparethe expected income at that location and at the location where they actually landed.

After training the models, we input the background characteristics of 2015 and 2016Express Entry principal applicants (n = 17,640) to obtain predicted income across ERs.These predictions serve as the basis for the simulated recommendations. The degree towhich immigrants would follow such recommendations is unknown. To model thesedynamics, the simulations vary two parameters that reflect different assumptions con-cerning the influence of the recommendation on immigrants’ location decisions.

The first parameter is the compliance rate, denoted by π, which is defined as theprobability that individuals will follow the recommendations. We assume that the prob-ability of following a recommendation decreases linearly across income quantiles. Weapply an upper bound πmax to the individuals with the lowest actual income. We thenlinearly interpolate to a value of π = 0 across the income distribution. Each individualwithin the prediction set thus receives an individual compliance parameter, πi. Function-ally, the average compliance rate across the distribution is πmax/2. For example, in thesimulations with πmax = .30, on average 15 percent of immigrants are expected to followthe recommendation, and the probability varies from a high of 30 percent for the lowest-income immigrants to close to zero percent for the highest-income immigrants. We choseto vary πi as a function of income under the assumption that wealthier individuals aremore likely to have self-selected into location-specific employment opportunities. Sim-ulations that impose a uniform compliance parameter instead also suggest substantialgains (see SM appendix).

The second parameter is the number of acceptable locations, denoted by φ. Eachapplicant is assumed to have a set of idiosyncratic preferences regarding locations, whichresults in a ranked preference order of ERs, ranging from the most attractive (1) to theleast attractive (52). The φ parameter determines how many top-preference-ranked ERsare included within the optimization. Location preferences are unmeasured within theadministrative data, and must be inferred. Using the landing ER as the dependentvariable, we fit a multinomial logit model and proxy immigrants’ preferences usingtheir predicted probability of landing in each ER. After obtaining predictions for eachExpress Entry case, we rank order locations by each individual’s predicted probability oflanding, randomly breaking ties. For each individual, the resulting preference ranks arethen used in conjunction with the parameter φ to define their set of acceptable locations,which serves as the initial set of locations considered when selecting the locations withthe highest expected incomes. To guarantee that gains are not entirely driven by subsets

9

of locations with certain characteristics, we run simulations entirely removing certainERs from consideration, and find no major deviation from our core results (see SMappendix).

We conduct a simulation for various combinations of parameters. For each immi-grant we consider only the top φ preference-ranked locations, and return the three loca-tions within this subset that are expected to yield the highest employment income for theimmigrant. We assume that individuals who follow the recommendations have an equalprobability of selecting each of these three locations. However, with probability (1-πi),individuals will select their original location rather than any of the recommendations.For each case, we draw from a uniform distribution bounded by 0 and 1 to determinewhether the case will take the recommendation or not. If (draw) > π, cases are assumednot to have followed the recommendation, and their location is recorded as their actuallocation. Their income is recorded as their predicted income within their actual location;for these immigrants there is no gain in income from using the tool. For immigrantswhere (draw) < π, we perform a second draw to determine which of the three loca-tions they will select. For these immigrants, the expected gain in income is computedas the difference between the expected income they would earn in the recommendedlocation and the location they would have chosen without the tool. After performingthese random draws for each of the 2015 and 2016 Express Entry users, we obtain thetotal expected difference in location counts and income.

4 Results from Empirical Analysis

4.1 Current Settlement Patterns

As shown in Figure 2, economic immigrants who arrived in the 2015 and 2016 arrivalcohorts through the Express Entry system are highly concentrated within a few regionsof Canada. About 78% settled in one of the four largest ERs as their initial destination,and 31% of immigrants selected Toronto. In stark contrast, only about 44% of the overallpopulation is concentrated in those four ERs.3

To what extent do these concentrated settlement patterns support the goal of max-imizing incomes? We evaluate this by estimating each economic immigrant’s expectedincome in every potential landing region as a function of their background characteris-tics and qualifications (see SM for details). Among economic immigrants who selectedone of the four most selected locations, Figure 3 displays how the selected ER wouldrank in terms of expected earnings relative to all other potential ERs. For example, arank of 1 for a given immigrant in the top left panel indicates that, at the time of ar-rival, the models estimate that Toronto ranked first (i.e. best) out of 52 possible landinglocations in terms of the expected employment income for that immigrant.

Although the models suggest that a small subset of individuals selected an initial

3 We exclude Quebec from this computation to have an accurate comparison.

10

S. Cst., Nt. Dame (NL)W. Coast (NL)

P. Edward Isl. (PE)Cape Breton (NS)

N. Shore (NS)Annap. Valley (NS)Campbellton (NB)Edmundston (NB)

Southeast (MB)S. Cen., N. Cen. (MB)

Southwest (MB)Interlake (MB)

Yukon (YT)Southern (NS)Moncton (NB)

Saint John (NB)Fredericton (NB)

Stratford (ON)Northwest (ON)

Parklands, N. (MB)Swift Current (SK)

Yorkton (SK)P. Alb., Northern (SK)

N.W. Territories (NT)Avalon Pen. (NL)

Kingston (ON)Muskoka (ON)

Northeast (ON)Kootenay (BC)

Cariboo (BC)N. Coast, Nech. (BC)

Windsor (ON)Northeast (BC)

Regina (SK)Winnipeg (MB)

Saskatoon (SK)Halifax (NS)

Thompson (BC)London (ON)

Hamilton (ON)Camrose (AB)

Vancouver Isl. (BC)Red Deer (AB)

Lethbridge (AB)Kitchener (ON)

Wood Buffalo (AB)Ottawa (ON)

Banff, Athab. (AB)Calgary (AB)

Lower Mainland (BC)Edmonton (AB)

Toronto (ON)

0 2000 4000Number of EE Cases

Figure 2: Landing Economic Regions for Express Entry Cohorts Arriving 2015-2016.N=17,640.

11

Figure 2: Estimated Annual Income: Rank of Landing ER vs. All ERs

0

100

200

0 20 40

Rank

Cou

nt

Toronto

0

30

60

90

120

0 20 40

Rank

Cou

nt

Edmonton

0

50

100

0 20 40

Rank

Cou

nt

Lower Mainland

0

25

50

75

0 20 40

Rank

Cou

nt

Calgary

5

Figure 3: Estimated Annual Income: Rank of Landing Economic Region vs. All Eco-nomic Regions. N=17,640

12

location that maximizes their expected income, we find that for many economic immi-grants the chosen location is far from optimal in terms of expected income. For instance,among economic immigrants who chose to settle in Toronto, that landing location onlyranked approximately 20th on average out of the 52 ERs in terms of maximizing ex-pected income in the year after arrival. In other words, the data suggest that for theaverage economic immigrant who settled in Toronto, there were 19 other ERs where thatimmigrant had a higher expected income than in Toronto. The situation is similar forother prominent locations, including Edmonton, Lower Mainland, and Calgary, wherethe average ranks are 28, 24, and 26, respectively. Across Canada as a whole, the averagerank is a mere 26.5. This suggests that many immigrants do not select locations whereindividuals with similar background characteristics tend to achieve the best economicoutcomes, and there is potential to improve immigrants’ landing choices.

4.2 Changes in Expected Income and Arrival Locations

Figure 4 displays the results from backtests that examine how a data-driven decisionhelper tool may influence the expected incomes and location decisions of economic im-migrants who enter through the Express Entry system (see SM for details). The toppanel shows the estimated effects on the average expected income one year after arrivalfor economic immigrants across the entire backtest cohort of 2015 and 2016 arrivals. Wesimulate effects using a varying set of parameters, including the share of immigrantswho are assumed to follow the recommendations (horizontal axis) and the number oflocations that are considered for each recommendation, based on the immigrants’ mod-eled location preferences (colors and symbols). These results report the average gainfrom 100 simulation runs.

The simulations suggest gains in expected annual incomes, even under scenarios inwhich compliance is low and/or location preferences are highly restrictive. For exam-ple, using the assumption that on average only 10% of immigrants settle in one of therecommended locations, and that individuals’ location preferences will rule out 42 ofthe 52 possible locations as unacceptable (the scenario labeled “Top 10”), the simulationyields an average gain in expected annual employment income one year after arrivalof $1,100, averaged across the full cohort. This amounts to a cumulative gain of $55million in total income for every 50,000 cases that enter Canada via the Express Entrysystem. Note that these gains are entirely driven by the 10% of immigrants who followthe recommendations, since we assume zero gains for the rest of the cohort. Immigrantswho do follow the recommendation increase their expected annual employment incomeone year after arrival by $10,600 on average, relative to the estimated income at the loca-tion they would have selected without using the decision helper. These gains are largerelative to the observed average first-year income within the prediction sample ($49,900).

Across the full cohort, the total expected gains from implementing the decisionhelper would be larger if immigrants had less restrictive location preferences and/ormore immigrants followed the recommendations. For example, under the assumptionthat 15% of immigrants follow the recommendations, and the recommended locations

13

are chosen from a set of 25 acceptable locations, the average expected annual income oneyear after arrival across the cohort increases by $3,400. The SM demonstrates that theseresults are similar across various robustness checks, including replicating the analysiswith cost of living adjusted income (Figure S3), recommending locations to maximize thejoint income of principal applicants and their spouses (Figure S4), or removing smaller,larger, or growing ERs from consideration (Figure S7).

The lower panel in Figure 4 displays the anticipated impact on the distribution ofeconomic immigrants across initial landing regions under the 15% compliance and Top25 location simulation scenario. The results suggests that we would see a mild shift fromthe most populous destinations toward mid-sized landing regions. For instance, about15% of immigrants who chose one of the four largest locations would have chosen analternate location in Canada if they had followed the recommendation. Although thereis not a marked redistribution of arrivals toward the smallest locations, the estimatesfor smaller locations are likely conservative given that the preferences for our simulationwere derived from data on the existing residential patterns of immigrants across Canada.Figure S5 in the SM shows the expected distribution if no location preferences weretaken into account. While these scenarios find that the majority of outflows continueto be associated with the four largest ERs, we find more movement into a subset of thesmaller locations when we do not restrict locations based on the inferred preferences.

4.3 Changes in Expected Income for Subgroups

We also assess the distribution of potential gains across subgroups to understand thedifferential effects our approach could have for economic immigrants of various back-grounds. Figure 5 shows the estimates of the change in average expected incomes oneyear after arrival across a variety of different subgroups, again using the assumption that15% of immigrants would follow the recommendations and that recommended locationsare chosen from a set of 25 acceptable locations. While expected gains vary as a functionof individuals’ characteristics, the overall increase in income does not appear to be theresult of disproportionate benefit on the part of any particular demographic or socioeco-nomic groups. Instead, we find comparable average gains across a range of subgroups ofeconomic immigrants, including groups stratified by gender, education level, case size,landing year, and immigration category.4

5 Potential Limitations

The impact assessment is limited to backtests applied to historical data. Such backtestsare commonly used to examine the potential impact of new approaches, but they cannot

4 Results for case size subgroups are only shown for case sizes of 1 and 2 due to aninsufficient number of cases of size greater than 2.

14

●

●

●

●●

●

●

1000

2000

3000

4000

5 10 15Percent of Express Entry cases who follow recommendation

Ave

rage

Gai

n in

Ann

ual A

djus

ted

Inco

me

(CA

D):

One

Yea

r A

fter

Arr

ival

Economic Regions Considered: ●Top 25 Top 15 Top 10

S. Cst., Nt. Dame (NL)W. Coast (NL)

P. Edward Isl. (PE)Cape Breton (NS)

N. Shore (NS)Annap. Valley (NS)Campbellton (NB)Edmundston (NB)

Southeast (MB)S. Cen., N. Cen. (MB)





Parklands, N. (MB)Swift Current (SK)

Yorkton (SK)P. Alb., Northern (SK)

N.W. Territories (NT)Avalon Pen. (NL)



Cariboo (BC)N. Coast, Nech. (BC)






Vancouver Isl. (BC)Red Deer (AB)



Banff, Athab. (AB)Calgary (AB)


Toronto (ON)

0 2000 4000Number of Express Entry Cases

Actual Counterfactual

Figure 4: Estimated Average Income Gains and Shifts in Arrival Locations. N=17,640

15

Immigration Category Landing Year Case Size

Age Education Gender

Skilled Worker

Skilled Trades

Canadian Experience

2015 2016 1 2

20−29 30−39 40+ < BA BA MA PhD Male Female

0

2000

4000

0

2000

4000

Ave

rage

Gai

n in

Ann

ual A

djus

ted

Inco

me

(CA

D):

One

Yea

r A

fter

Arr

ival

Figure 5: Estimated Average Income Gains for Subgroups. N=17,640

16

fully capture all the factors that may affect the potential impact of our decision helper ina prospective implementation.

For instance, economic outcomes could be influenced by compositional effects ifmany immigrants with a similar profile receive the same location recommendation.Modeling such effects in a backtest is challenging because their direction and magni-tude is theoretically ambiguous. An increased concentration could lower expected in-comes due to saturation, or alternatively, increase incomes due to local agglomerationeconomies, which are common for high-skilled migration (Kerr et al. 2017). Similarly,economic immigrants account for a relatively small fraction of the local labor market,implying that the direct impact of the tool on labor market saturation is difficult todetermine a priori.

Although we do not model compositional effects directly in the backtests, in aprospective application the decision helper can learn potential compositional effects overtime as the models are continually updated based on observed data from new arrivals,as well as local economic conditions at the time of arrival. Therefore, should a particularlocation cease to be a good match for a particular immigrant profile due to increased con-centration, the models would adjust to this pattern over time and no longer recommendthat location. In addition, the approach can be adjusted to incorporate location-specificquotas if desired by governments.

The results in a prospective implementation could also differ from the backtests ifthe expected incomes predicted by the machine learning models over- or under- esti-mate the actual incomes that newly invited immigrants attain if they were to followthe recommendations. Such prediction errors could occur for a variety of reasons, in-cluding immigrants selecting into locations based on unobserved characteristics. Suchunobserved characteristics can be separated into two broad categories. The first cate-gory includes unobserved characteristics that are unrelated to any particular location,and hence could be associated with higher (or lower) earnings potential in all possiblelocations. Examples would include an individual’s unmeasured abilities or motivation.The second category includes unobserved characteristics that are unique to specific lo-cations, and hence represent an earnings advantage for a particular individual only ina select location (or select locations). Examples would include an individual’s unknownjob offer or social network (e.g. family members) in a specific location. See the SM ap-pendix for a more comprehensive discussion, formalization, and decomposition of thepossible selection bias.

These concerns about selection bias driving the results in our backtests are partiallyaddressed by the fact that we train flexible models on a rich set of covariates derivedfrom application data to economic admissions programs. That is, conditional on thebroad set of background and skills-based characteristics we observe, it is not likely thatindividuals have self-selected into locations as a function of unobserved non-location-specific characteristics that account for the full magnitude of the estimated gains weobserve in the backtests. For instance, for individuals who are identical on the observedcharacteristics (e.g. same age, education, profession, skills, etc.), their variation in un-

17

measured variables such as motivation would need to both be integral in their locationchoices and significantly affect their earnings potential (see SM for details).

Finally, while the backtest results suggest the possibility of gains if the decisionhelper were implemented, it is important to obtain reliable estimates of impact througha prospective randomized-controlled trial (RCT). An RCT would randomly assign newarrivals to receive recommendations from the decision helper, allowing for a rigorousevaluation of the tool’s impact on a variety of outcomes, including incomes, satisfaction,and location patterns.

6 Potential Risks

To examine potential risks, it is important to consider how introducing a decision-helpertool would influence the status quo. The decision-helper provides additional informa-tion to incoming economic immigrants so that they can make more informed locationdecisions, from among the set of possible locations that match their preferences. In doingso, it does not limit immigrants’ agency to choose their final settlement location. Giventhat the recommendations are based on historical data, the predictions may be subject toerror. As a result, the decision helper should be viewed as a complement, rather than areplacement, of the existing information streams and processes that governments use toinform immigrants about potential destinations.

Care needs to be taken to transparently communicate the locational recommenda-tions to users. In particular, users should be made aware that the recommendationsare based upon the goal of maximizing the particular metric of near-term income (orwhichever specific metric has been applied) and that the predictions reflect the out-comes that recent immigrants with similar profiles have attained in the past. This doesnot guarantee that the user’s realized income will be optimal at the recommended lo-cation, as expected income cannot take into account all possible factors that are uniqueto an individual. Nor does it imply that the recommended location would necessarilybe optimal in terms of other possible life goals or long-term earnings. While we do nothave evidence that selecting locations based on near-term earnings will have a negativeimpact on these longer-term outcomes, this is a theoretically possible risk that wouldneed to be monitored over the course of a prospective implementation.

7 Discussion

A growing number of countries have implemented economic immigration programs toattract global talent and generate growth. However, economic immigrants lack access topersonalized information that would help them identify their most beneficial initial set-tlement locations. In this study, we propose a data-driven decision helper that deliversinformational nudges to counteract the effects of limited information and cognitive bi-ases. The decision helper harnesses insights from administrative data to recommend the

18

locations that would maximize their expected incomes and align with their preferences.We illustrate its potential by conducting backtests on historical data from the CanadianExpress Entry system. The results suggest that economic immigrants currently selectsub-optimal locations, and that there could be gains in expected incomes from provid-ing data-driven, personalized location recommendations. While the results from ourbacktests suggest potential gains in the Canadian context, it is important to assess theimpact in the context of a pilot initiative with a randomized controlled trial.

The decision helper outlined in this study is adaptable and could be rapidly im-plemented within existing institutional structures in various countries. First, as wedemonstrate via application to Canadian data, many governments are already collectingadministrative data that can be used to generate recommendations. Moreover, govern-ments administering economic admission programs engage in regular communicationwith applicants, implying that the decision helper can easily be made accessible to alarge group of users. In light of the gains observed in the backtests, these limited costssuggest a positive return on investment, even in scenarios where only a small shareof immigrants follow the recommendations. Second, the approach is flexible in termsof implementation and can be adjusted to the specific priorities identified by the desti-nation government. For example, the decision helper could be used to improve othermeasurable integration outcomes (for example, longer-term measures of income), in-corporate location-specific quotas, and allow for a wide variety of approaches to elicitpreferences and display recommendations. Third, the approach is designed as a learningsystem such that the models for the predictions are continually updated using observeddata from new arrivals and changing local economic conditions. The decision helpertherefore learns synergies between personal characteristics and landing locations as theyevolve over time and adjusts the recommendations accordingly. Finally, the approachsupports individual agency. The decision helper provides immigrants with personalizedrecommendations that help them make a more informed decision, but immigrants de-cide whether to use it, and they can decline the recommendations. The decision helperthus complements rather than replaces existing information streams and processes.

In sum, a data-driven decision helper holds the potential to assist incoming economicimmigrants in overcoming informational barriers and choosing better landing locations.In addition, the approach we outline offers governments the ability to leverage adminis-trative data to increase economic returns within the structures of their existing admissionprocess. Together, we expect these factors to improve the well-being of economic immi-grants and the communities in which they settle.

8 Acknowledgments

This study was completed as part of a Data Partnership Arrangement between Immi-gration, Refugees and Citizenship Canada (IRCC) and the Immigration Policy Lab. Theanalysis, conclusions, opinions, and statements expressed in the material are those ofthe authors, and not necessarily those of the IRCC. This research received generous sup-

19

port from Eric and Wendy Schmidt by recommendation of Schmidt Futures. We alsoacknowledge funding from the Charles Koch Foundation. These funders had no role inthe data collection, analysis, decision to publish, or preparation of the manuscript.

References

Akbari, Ather H and Jennifer S Harrington. 2007. Initial location choice of new im-migrants to Canada. Technical report. Working Paper 05-2007, Atlantic MetropolisCentre.

Åslund, Olof and Dan-Olof Rooth. 2007. “Do when and where matter? Initial labourmarket conditions and immigrant earnings.” The Economic Journal 117(518):422–448.

Bansak, Kirk, Jeremy Ferwerda, Jens Hainmueller, Andrea Dillon, Dominik Hangart-ner, Duncan Lawrence and Jeremy Weinstein. 2018. “Improving refugee integrationthrough data-driven algorithmic assignment.” Science 359(6373):325–329.

Bégin-Gillis, Margot. 2010. “Immigrant settlement in rural Nova Scotia: Impacting thelocation decisions of newcomers.” Papers in Canadian Economic Development 12:1–18.

Borjas, George J. 1995. “The economic benefits from immigration.” Journal of EconomicPerspectives 9(2):3–22.

Brezzi, Monica, Jean-Christophe Dumont, Mario Piacentini and Cécile Thoreau. 2010.Determinants of localization of recent immigrants across OECD regions. Technicalreport. Paper for OECD Workshop “Migration and Regional Development”, June 7,2010, Paris.

Burchardi, Konrad, Thomas Chaney, Tarek Alexander Hassan, Stephen Terry and LisaTarquinio. 2020. Immigration, Innovation, and Growth. Technical report. NBER Work-ing Paper No. 27075.

Chetty, Raj, Nathaniel Hendren and Lawrence F Katz. 2016. “The effects of exposureto better neighborhoods on children: New evidence from the Moving to Opportunityexperiment.” American Economic Review 106(4):855–902.

Chiswick, Barry R and Paul W Miller. 2004. “Where immigrants settle in the UnitedStates.” Journal of Comparative Policy Analysis: Research and Practice 6(2):185–197.

Damm, Anna Piil. 2009. “Determinants of recent immigrants’ location choices: Quasi-experimental evidence.” Journal of Population Economics 22(1):145–174.

Damm, Anna Piil. 2014. “Neighborhood quality and labor market outcomes: Evidencefrom quasi-random neighborhood assignment of immigrants.” Journal of Urban Eco-nomics 79:139–166.

20

Desiderio, Maria Vincenza and Kate Hooper. 2016. The Canadian expression of interestsystem: A model to manage skilled migration to the European Union? Technicalreport. Migration Policy Institute Europe.

Di Biase, Sonia and Harald Bauder. 2005. “Immigrant settlement in Ontario: Locationand local labour markets.” Canadian Ethnic Studies 37(3):114.

Fotros, Homayoon. 2018. Destination matters: Policy options to balance the distribu-tion of Iranian immigrants in Canada. Technical report. PhD Thesis, Simon FrazerUniversity.

Hugo, Graeme. 2008. “Immigrant settlement outside of Australia’s capital cities.” Popu-lation, Space and Place 14(6):553–571.

Hunt, Jennifer and Marjolaine Gauthier-Loiselle. 2010. “How much does immigrationboost innovation?” American Economic Journal: Macroeconomics 2(2):31–56.

Hyndman, Jennifer, Nadine Schuurman and Rob Fiedler. 2006. “Size matters: Attractingnew immigrants to Canadian cities.” Journal of International Migration and Integration7(1):1.

Immigration, Refugees and Citizenship Canada. 2019. Express Entry year-end report2018. Technical report.

Kaida, Lisa, Feng Hou and Max Stick. 2020. “Are refugees more likely to leave initialdestinations than economic immigrants? Recent evidence from Canadian longitudinaladministrative data.” Population, Space and Place p. e2316.

Kerr, Sari Pekkala, William Kerr, Çaglar Özden and Christopher Parsons. 2016. “Globaltalent flows.” Journal of Economic Perspectives 30(4):83–106.

Kerr, Sari Pekkala, William Kerr, Çaglar Özden and Christopher Parsons. 2017. “High-skilled migration and agglomeration.” Annual Review of Economics 9:201–234.

Ludwig, Jens, Greg J Duncan, Lisa A Gennetian, Lawrence F Katz, Ronald C Kessler,Jeffrey R Kling and Lisa Sanbonmatsu. 2013. “Long-term neighborhood effects on low-income families: Evidence from moving to opportunity.” American Economic Review103(3):226–31.

Massey, Douglas S. 2008. New Faces in New Places: The Changing Geography of AmericanImmigration. Russell Sage Foundation.

McDonald, James Ted. 2004. “Toronto and Vancouver bound: The location choice of newCanadian immigrants.” Canadian Journal of Urban Research pp. 85–101.

Mossad, Nadwa, Jeremy Ferwerda, Duncan Lawrence, Jeremy M Weinstein and JensHainmueller. 2020. “In search of opportunity and community: Internal migration ofrefugees in the United States.” Science Advances (Forthcoming).

21

OECD. 2017. Behavioural insights and public policy: Lessons from around the world.Technical report.

Peri, Giovanni, Kevin Shih and Chad Sparber. 2015. “STEM workers, H-1B visas, andproductivity in US cities.” Journal of Labor Economics 33(S1):S225–S255.

Taylor, Andrew J, Lauren Bell, Rolf Gerritsen et al. 2014. “Benefits of skilled migrationprograms for regional Australia: Perspectives from the Northern Territory.” Journal ofEconomic & Social Policy 16(1):35.

Teo, Sin Yih. 2003. Imagining Canada: Tracing the cultural logics of migration amongstPRC immigrants in Vancouver. Technical report. PhD Thesis, University of BritishColumbia.

Thaler, Richard H and Cass R Sunstein. 2009. Nudge: Improving Decisions about Health,Wealth, and Happiness. Penguin.

Tonkin, Sue and Sue Tonkin. 1993. Initial location decisions of immigrants: Resultsfrom the longitudinal survey of immigrants to Australia (LSIA) pilot. Technical report.Australian Government Pub. Service.

Tuli, Sajeda. 2019. “Migrants want to live in the big cities, just like the rest of us.” TheConversation March 31, 2019.

Tversky, Amos and Daniel Kahneman. 1974. “Judgment under uncertainty: Heuristicsand biases.” Science 185(4157):1124–1131.

22

Supplementary Material

S1 Decision Helper Workflow

In this section, we provide additional details and formalize our decision helper approach.This workflow consists of three stages, modeling/prediction, preference constraints, andrecommendations. We repeat the visualization of the workflow with additional formalnotation in Figure 1, and describe each step of the process in detail.

S1.1 Modeling/prediction

In the first stage, we use training data to build a series of models that predict expectedoutcomes in a particular location. This process begins by gathering a set of Admin-strative Data, an individual-level dataset containing information about prior immigrantarrivals. This dataset must consist of individuals that are similar to the eventual users ofthe decision helper tool.

For each individual in the administrative data, we need three pieces of information:a collection of covariates, a measurable outcome, and a choice of landing location. Be-cause our goal is to determine unique synergies between individual-level profiles andoutcomes in a particular landing spot, we estimate models for each location separately.

For each individual in the administrative data j = 1, . . . ,m, let the outcome of inter-est be denoted yj and the landing decision denoted wj ∈ {1, . . . , K}. Let ~xj represent ap-dimensional vector of relevant covariates for individual j, and xir represent the r-th fea-ture in the p-dimensional vector. Our goal in the model training portion of the workflowis to predict the outcome based on the relevant covariates and specified landing location;that is, to estimate function β mapping ~xj to yj. As we want to find separate functionalforms for each location k = 1, . . . , K, we estimate K total functions βk(~xj|wj = k). We findan approximation βk to βk by minimizing the expected value of a specified loss functionL(yj, βk(~xj) over the joint distribution of (y,~x):

βk = argminβk

E(y,~x) L(y, βk(~x))

After estimating βk for all K landing locations in the administrative data, we thenapply these models to the Client Data, the potential users of the decision helper tool.For each client i = 1, . . . ,n, we have the same set of covariates used to train the models~xi. By inputting ~xi to each of the βk location functions, we obtain individual predictedoutcomes in each possible location. The following delineates each step in this process:

23

1. Denote the administrative data by matrix A

A =

y1 w1 x11 . . . x1r . . . x1p...

...... . . . ... . . . ...

yj wj xj1 . . . xjr . . . xjp...

...... . . . ... . . . ...

ym wm xm1 . . . xmr . . . xmp

2. Train a set of K models,

L = {β1(~xj), . . . , βk(~xj), . . . , βK(~xj)}

as follows.

For k = 1, . . . , K:

a) Subset A to individuals for whom wj = k and call this Ak

Ak =

y1 x11 . . . x1r . . . x1p...

... . . . ... . . . ...yj xj1 . . . xjr . . . xjp...

... . . . ... . . . ...yml xmk1 . . . xmkr . . . xmk p

w=k

=

y1 ~x1...

...yj ~xj...

...ym ~xmk

w=k

Where mk denotes the number of individuals in the administrative data forwhom wj = k.

b) Using the data in Ak, model and estimate βk.

Note that while there are many ways to potentially model βk, we have foundthat using supervised machine learning methods provides the best flexiblesolution to capture complex non-linearities, interactions between covariates,and automatically engage in feature selection. To avoid overfitting on thetraining set, wherin βk has very high predictive power in the training set butlow out-of-sample predictive power, it is necessary to use cross-validation inthe training process.

3. Denote the client data by matrix C.

C =

x11 . . . x1r . . . x1p

... . . . ... . . . ...xj1 . . . xjr . . . xjp... . . . ... . . . ...

xn1 . . . xnr . . . xnp

=

~x1...~xj...~xn

24

4. For all clients in C and all k locations, estimate βk : ~xi→ yi as follows:

For i = 1, . . . ,n

For d = 1, . . . , k

Estimate βk(~xi) by applying the k-th model in L to ~xi, where βk(~xi) = yik

Arrange yik into a vector ~yi = [yi1, yi2, . . . , yiK]

5. Produce a matrix of predicted outcomes M, with rows corresponding to clients andcolumns responding to potential landing locations as follows.

M =

~y1...~yi...~yn

=

y11 . . . y1k . . . y1K

... . . . ... . . . ...yi1 . . . yik . . . yiK... . . . ... . . . ...

yn1 . . . ynk . . . ynK

This represents the final ouptut of the modeling/prediction phase of the workflow

S1.2 Preference Constraint

The next stage of our approach involves eliciting clients’ underlying preferences andruling out locations that are inconsistent with these preferences. Specifically, we as-sume that for every client i = 1, . . . ,n, preferences for each location k ∈ {1, . . . , K} can beexpressed by a utility value uik ∈ R. The set of utility values over each K location isarranged in a vector ~ui = [ui1,ui2, . . . ,uiK].

We further assume that every individual has a utility threshold below which theywill find a location unacceptable to live in. We denote this utility threshold ψi. Wedenote the subset of acceptable locations for each individual i as Si = {k′} ⊂ K, anddefine acceptable locations in {k′} ⊂ K as those locations where an individual’s utilityvalue is above their utility threshold value ψi.

Si = {k′} ⊂ K|uik′ > ψi∀k′

Given every individual has their own utility vectors ~ui and utility threshold value ψi,the cardinality of set Si is different for each i. Define ti = |Si|, the number of acceptablelocations for person i. We assume ti ≥ 1; that is that Si is non-empty and at least onelocation is above the utility threshold.

We find set Si acceptable locations for each i with a survey method. As expressedin the main paper, we are agnostic as to which survey device is used, as long as themethod allows us to restrict the set of locations to those consistent with i’s underlyingpreferences.

25

S1.3 Recommendations

The final stage of our workflow uses as input the predicted outcome vector ~yi and setof feasible locations Si to produce a final set of recommendations for individual i. Thisprocess is formalized as follows:

Define µ as a function with predicted outcome vector ~yi and set of feasible locationsSi as inputs. The function µ then outputs vector yiR which ranks expected outcomes forall ti locations within feasible set Si.

µ : (~yi,Si) −→ ~yiR = [yik′1, yik′2

...yik′ti]

s.t. yik′1≥ yik′2

≥ ...≥ yik′tiand k′ ∈ Si

Define z as the maximum number of recommendations to present to the user, and z′ias the minimum between z and ti.

z′i = min(z, ti)

The final interface will recommend the top z′i locations to user i order. Variousformats could be used to present the recommendations.

(k′1,k′2, ...,k′z′i)

S2 Properties of the Modeling/Prediction Stage

For individuals denoted by i, let Yi denote observed outcomes, Ai denote their chosenlocations, and Xi denote their observed characteristics (which can denote a vector ofcovariates or a single fully stratifying variable). Further, let Yi(a) denote the potentialoutcome for individual i in any location a ∈ SA, where SA denotes the set of possiblelocations. In other words, Yi(a) represents the outcome that individual i would achieve ifthat individual had chosen location a, and Yi =Yi(Ai).5 In the modeling/prediction stageof the decision helper, the goal is to determine the optimal location for each individualas a function of their observed characteristics. In other words, for each stratum Xi = xand at each location a ∈ SA, the goal is to determine the following quantity of interest:

θa(x) ≡ E[Yi(a)|Xi = x]

where the expectation (and all expectations presented hereafter) is defined over the dis-tribution of the population of interest (i.e. the population for whom the decision helper

5 Note that the definition of the potential outcomes implies the stable unit treatmentvalue assumption (SUTVA).

26

is targeted).The goal is then to use this quantity for all a ∈ SA to determine each individual’s

optimal location(s)—that is, the location(s) for which the quantity is highest, perhapssubject to additional constraints—and then deliver informational nudges to encourageindividuals to land in these locations.

However, a key impediment to using θa(x) in this ideal manner is that θa(x) is notnecessarily identifiable with observed data. Instead, what is identified is the following:

θ′a(x) ≡ E[Yi(a)|Xi = x, Ai = a] = E[Yi|Xi = x, Ai = a]

That is, it is based upon θ′a(x) (or estimates thereof) that optimal locations will be in-ferred for each individual, and these inferences may not perfectly match the true optimallocations as defined by θa(x).

Therefore, it is useful to characterize the potential bias of θ′a(x) with respect to θa(x)in order to (a) understand the extent to which that bias may result in suboptimal infor-mational nudges and (b) identify concrete actions that can be taken to limit or eliminatethe bias. To do so, the following additional quantities are first defined:

θ′′a (x) ≡ E[Yi(a)|Xi = x, Ai 6= a]

pa(x) ≡ P(Ai = a|Xi = x)

In addition, assume that 0 < pa(x) for all a and x, and note that θa(x) = θ′a(x)pa(x) +θ′′a (x)(1 − pa(x)). Hence, the bias of θ′a(x) with respect to θa(x) is bounded by thefollowing:

Ba(x) ≡ limpa(x)→0

(θ′a(x)− θa(x)

)= θ′a(x)− θ′′a (x)

= E[Yi(a)|Xi = x, Ai = a]− E[Yi(a)|Xi = x, Ai 6= a]

This term is a form of selection bias that represents, within a stratum of x, the extent towhich the mean potential outcome in location a is higher for individuals who actuallychoose a versus individuals who do not choose a.

If it can be assumed that Yi(a) |= Ai|Xi, then the selection bias is eliminated (i.e. se-lection on observables). However, it could be that the potential outcomes are also relatedto unobserved characteristics of an individual that may also be correlated with locationchoices. Such unobserved characteristics can be separated into two broad categories. Thefirst, denoted by Ui, are unobserved characteristics that are unrelated to any particularlocation. Examples would include an individual’s unmeasured abilities or motivation.The second, denoted by Vai, are unobserved characteristics that are unique to a particu-lar location a in question. Examples would include an individual’s unknown job offer orsocial network (e.g. family members) in location a.

Taking these unobserved characteristics into account, for any location a let the po-tential outcome Yi(a) be modeled as an arbitrary (and arbitrarily complex) function of

27

Xi, Ui, and Vai as well as an exogenous error term:

Yi(a) = ga(Xi,Ui,Vai) + εi

where E[εi|Xi, Ai] = 0. By extension, we have the following:

Ba(x) = E[Yi(a)|Xi = x, Ai = a]− E[Yi(a)|Xi = x, Ai 6= a]

= E[ga(Xi,Ui,Vai) + εi|Xi = x, Ai = a]− E[ga(Xi,Ui,Vai) + εi|Xi = x, Ai 6= a]

=∫

ga(x,u,va)dFUi,Vai|Xi=x,Ai=a(u,va)−∫

ga(x,u,va)dFUi,Vai|Xi=x,Ai 6=a(u,va)

=∫ ∫

ga(x,u,va)dFVai|Ui=u,Xi=x,Ai=a(va)dFUi|Xi=x,Ai=a(u)

−∫ ∫

ga(x,u,va)dFVai|Ui=u,Xi=x,Ai 6=a(va)dFUi|Xi=x,Ai 6=a(u)

where FU,Va|X,A denotes the joint conditional distribution function of U and Va givenX and A; FVa|U,X,A denotes the conditional distribution function of Va given U, X, A;and FU|X,A denotes the conditional distribution function of U given X and A, all in thepopulation of interest.

These results help to highlight what assumptions are required, and what correspond-ing design decisions could be made, to limit or eliminate this bias. For instance, pro-vided a sufficiently rich set of covariates are observed in Xi, the following assumptionmay hold:

Ui |= Ai|Xi

In words, this assumption states that within strata of X, individuals are not self-selectinginto locations as a function of unobserved non-location-specific characteristics. For in-stance, for individuals who are identical on Xi (e.g. same age, education, profession,skills, etc.), their variation in unmeasured variables Ui such as motivation is unrelated totheir location choices Ai. Under the assumption that Ui |= Ai|Xi, FU|X,A = FU|X and hencethe bias term simplifies to:

Ba(x) =∫ ∫

ga(x,u,va){dFVai|Ui=u,Xi=x,Ai=a(va)− dFVai|Ui=u,Xi=x,Ai 6=a(va)}dFUi|Xi=x(u)

In other words, under this assumption, the bias is driven by the difference in the distri-bution of Vai for individuals who choose a versus do not choose a, by joint strata of Xand U. If we make this assumption in the context of the simulated backtests applied tothe Canada Express Entry applicants, then for the resulting estimated gains to be drivenpurely by bias, this would mean that the average estimated gains among compliers canbe accounted for by bias attributed solely to location-specific links or advantages thatthe individuals who chose any particular Economic Region had over otherwise identicalindividuals who did not choose that Economic Region. In other words, the individualswho select into a particular location have an average annual employment income ad-

28

vantage at that location of between $11,000 and $22,700 (depending on the simulationscenario) due to pre-determined job offers or family/social network ties compared tootherwise identical individuals who chose different locations.

Another assumption that could be made is that Vai is constant in the populationof interest, which could be ensured by design by redefining the population of interestand excluding individuals accordingly, e.g. excluding all individuals likely to have pre-determined job offers.6 Under the assumption that Vai = va for all individuals in thepopulation of interest, the bias term simplifies to the following:

Ba(x) =∫

ga(x,u, va)dFUi|Xi=x,Ai=a(u)−∫

ga(x,u, va)dFUi|Xi=x,Ai 6=a(u)

If the previous assumption that Ui |= Ai|Xi is added back in, the bias is completely elim-inated:

Ba(x) =∫

ga(x,u, va){dFUi|Xi=x(u)− dFUi|Xi=x(u)} = 0

S3 Application of Decision Helper Workflow: CanadaExpress Entry

The empirical application of our proposed decision helper workflow analyzed data fromCanada’s Express Entry system. While we provide an overview of this method in thebody of our paper, here we provide additional methodological details on how we imple-ment the Model/Predictions and Preference Constraint portion of our workflow.

S3.1 Data Sources

We merged three datasets at the Federal Research Data Center in Ottawa to conduct ouranalysis:

• IMDB Integrated Permanent and Non-permanent Resident File (PNRF) 1980-2018(2019 release)

• IMDB Tax Year Files (t1ff) 2013-2017 (2019 release)

• Express Entry File (2018 release, case level data)

6 Note that excluding those with job offers from the training data set would have meantexcluding a significant proportion of immigrants who came through Express Entryin the first 2 years (2015-2016). However, this limitation becomes less salient as theshare of admissions with job offers has declined considerably in recent years with thereduction in number of points for arranged employment in the CRS. For example, in2017-2019 only about 10% of invited candidates had a job offer or arranged employ-ment.

29

In addition, we leveraged several additional datasets to provide supplementary in-formation on population levels, unemployment rates, and price indices by geographicregion:

• Canada Labour Force Survey (LFS): A monthly survey providing data on the labourmarket, including estimates of employment and demographics of the working pop-ulation. Estimates are available at different levels of geographic aggregation, in-cluding Economic Region.

• Canadian Rental Housing Index: Public index compiled by the BC Non-ProfitHousing Association, based on the 2016 Census. The index reports the averagerental price for a single-family apartment, by Census Subdivision (CSD).

A full list of considered variables is found in Table S1. Note that all analyses wasconducted in the Federal Research Data Centre in Ottawa and all data output presentedhere was approved for release.

S3.1.1 Administrative Data

Although the population of interest consists of immigrants entering through Canada’sExpress Entry system, limited data on this relatively new initiative required us to ex-pand our training data to similar economic immigrants entering Canada. Specifically,in addition to including all Express Entry clients entering between 2015 and 2016 whofiled a tax return, we expand our training set to include Non-Express Entry clients whoentered between 2012 and 2016 and filed a tax return under four admission categoriesthat would be managed by the Express Entry system: Federal Skilled Workers (A1111),Skilled Trades (A1120), Canadian Experience (A1130), and Provincial Nominees (A1300).

We restrict our training set by removing:

• Individuals who were selected by Quebec or landed in Quebec, given that ExpressEntry does not apply to this province.

• Individuals whose yearly income in the year after arrival exceeded the 99th per-centile, to avoid overfitting to outliers.

• Accompanying children (LANDING_AGE > 18)

• Individuals who died in the year of arrival or in the following year

This set of training data corresponds to matrix A in the decision helper helper work-flow.

30

S3.1.2 Client Data

The client dataset consists of the Express Entry cohort who entered prior to 2017 (thefinal year available in the outcome data), along with their associated characteristics fromthe PNRF file. This represents the group of economic immigrants we consider in all oursimulation results. A set of descriptive statistics for this subgroup is found in Table S2.

This set of prediction data corresponds to C in the general decision helper helperworkflow.

S3.2 Modeling Decisions

Our workflow allows for a wide variety of potential models to be used in predictingoutcomes. In this section, we describe the particular modeling decisions we made in thecontext of analyzing Canadian immigration data.

S3.2.1 Models

We used a supervised machine learning framework to fit and train our models. Weuse this class of models due to their ability to both flexibly fit the training data whileretaining high out-of-sample accuracy with proper model tuning. While any numberof supervised machine learning methods might be applicable, we chose to use gradientboosting machines due to their ability to automatically engage in feature selection anddiscover complex interactions between covariates.

We implement the modeling stage on a location-by-location basis. Specifically, foreach economic region, we first subset the training data to those individuals who origi-nally landed in that location, and fit the supervised learning model using individuals’background characteristics to predict their employment earnings. We model synergiesusing stochastic gradient boosted trees, which we run with a customized script usingthe gradient boosting machine (gbm) package within R.

In our implementation of gradient boosted trees, we used 10-fold cross-validationwithin the training data to select tuning parameter values, including the interactiondepth (the maximum nodes per tree), bag fraction (the proportion of the training setconsidered at each tree expansion), learning rate (the size of each incremental step in thealgorithm), and number of boosting iterations (number of trees considered).

To determine the best model, we first fix an interaction depth, bag fraction, and learn-ing rate. For this fixed set of parameters, we then fit models over a sequence of boostingiterations (normally, 1 to 1,000 trees). For each model, we calculate the cross-validationroot mean square error (RMSE), and choose the model which minimizes this error. Toavoid potentially choosing a local minimum, if the best model is within 100 trees of themaximum number of trees we consider, we re-run this process by increasing the maxi-

31

mum considered trees by 500. We repeat this process as many times as necessary, andrecord the final tree count and RMSE for the fixed interaction depth, bag fraction, andlearning rate.

We repeat the above process tuning over different values of interaction depth, bagfraction, and learning rate. Finally, we pick the model with the set of parameters with thelowest cross-validation RMSE for each separate location model. The set of parameterswe consider are:

• Interaction Depth: 5-7

• Learning Rate: .1 and .01

• Bag Fraction: .5-.8 by .15

The set of final models, one for each location, correspond to the set of L models inthe general workflow.

In order to investigate which covariates are the most predictive of income, we calcu-late a variable importance measure for each predictor in every separate tuned locationmodel (see summary.gbm in the gbm package for details on how this statistic is calcu-lated). We present these variable importance measures in Figure S1.

This figure demonstrates one of the advantages of fitting each location model sepa-rately – in each model, the importance measures of each covariate differs, demonstratinghow the set of characteristics that lead to better or worse economic outcomes vary be-tween ERs. Some overall trends emerge, with occupation and citizenship in the top mostinfluential covariates in every model. Whether or not a client had a previous temporaryresidence permit is also a highly influential predictor in certain Economic regions, es-pecially in Toronto. We further note certain variables have little influence on predictedincome across each model, such as the language (French and English) indicators andlanding year.

S3.2.2 Predicted Outcomes

We then apply each fitted model to the prediction set to estimate the income for newExpress Entry clients should they select the economic region in question. For the pre-diction set, we remove any Express Entry client with the provincial nomination (A1300)category, as these clients do not have flexibility in choosing the initial landing location.This process is performed separately and independently for each location, which yieldsa vector of predicted income across possible economic regions for each individual withinthe prediction set. The final result is a matrix of predicted annual income with rows rep-resenting individual Express Entry clients and columns representing economic regions,

32

corresponding to M in the general workflow.

In order to asses model fit, we compare predicted income within the principal appli-cants’ actual location to their observed income in that location in the top panel of FigureS2. The bottom two panels show the histogram of predicted incomes and actual incomesrespectively. Overall, predictions are well calibrated, albeit slightly more conservativethan observed income at the tails of the distribution.

In our implementation, the cross-validated R2 for the tuned PA model is 0.54. Withinthe context of incomes – which are highly skewed and difficult to predict – this is rel-atively high. This represents a substantial improvement over the R2 for an analogouslinear regression model using cross-validation (0.34). The RMSE for the tuned model is29,486, as compared to an observed mean income of 58,000 and a standard deviation of41,600.

S3.3 Approximating Locational Preferences

While a prospective use case would use a survey restrict locations to a set that align withan individual’s underlying preferences, we were unable to engage in this exercise in ourbacktests. Therefore, we use the administrative data and existing migration patterns toestimate preferences. We explain the details of this methodology in this section.

S3.3.1 Underlying Preferences

Upon entering Canada, clients plausibly have a series of idiosyncratic locational pref-erences related to geographic location, climate, local demographics, labor markets, andother potential factors. We approximated individual locational preferences by examin-ing how Express Entry clients with different background characteristics varied in termsof their original landing locations. Using the landing Economic Region as the dependentvariable, we fit a multinomial logit model on Express Entry clients and a random subsetof 20% of the non-Express Entry economic immigrants. Given that these preferences areproxies, we use a coarse set of covariates including education, birth region, age, immi-gration category, case size, and indicators for work and study permits. These predictionsapproximate the vector of utilities u in the workflow.

S3.3.2 Restricted Set of Locations

After obtaining predictions for each Express Entry case, we rank order locations by pre-dicted preferences, randomly breaking ties. The resulting ranks are used in conjunctionwith the parameter φ (the number of acceptable locations we consider) to determinethe initial set of locations considered when selecting the locations with the top optimalincome. This set of φ acceptable locations represents subset Si in the workflow.

33

S4 Robustness Checks

Below we discuss several robustness checks to our study. In particular, we outline theimpact of our backtest when considering 1) cost-of-living adjustments, 2) maximizingprincipal applicant plus spouse income, and 3) alternative simulation specifications.

S4.1 Cost-of-living Adjustments

An important consideration influencing relative quality of life across landing locationsrelates to living costs. To take this factor into account during the recommendation pro-cess, we ran alternate simulations where we define outcomes as total income less esti-mated yearly rental costs of a two bedroom apartment. To our knowledge, rental pricesare the most granular cost index currently available across small geographic regions. Theestimates we pull are from the Canadian Rental Housing Index, a public index compiledby the BC Non-Profit Housing Association and based on the 2016 Census. The indexreports the average rental price for a single-family apartment, by Census Subdivision(CSD).

In Figure S3, we replicate the results in our main paper, demonstrating average gainsin employment under various simulation parameters in the top panel and visualizingexpected movement patterns in the bottom panel.

S4.2 Principal Applicant and Spouse Model

While our main paper reports our findings for principal applicants only, we addition-ally fit a set of models that consider both principal applicants and their accompanyingpartners. As a simplifying assumption, we assume that both individuals in a case havesimilar (joint) locational preferences, and derive these preferences from a PA-only model.The income predictions, however, take into account the joint income of the PA and part-ner divided by the number of adults in the family unit (average family income). We thenestimate the models assuming a family unit will move jointly. In Figure S4, we replicatethe results in the main paper using this approach, with similar results.

S4.3 Alternative Simulation Specifications

Our simulations vary two parameters: the number of acceptable locations considered (φ)and the compliance rate (π). In this section, we consider the impact of further varyingthese parameters on our core results.

S4.3.1 No Locational Preferences

In our main analysis, we infer regional preferences by analyzing existing residential pat-terns and then using these estimated preferences to restrict the choice set in our simu-

34

lations. However, expected income gains are maximized when no locational preferencesare taken into account. In Figure S5, we show simulated movement patterns with nolocational preferences under different compliance rates. Relative to the case presentedin the main paper, the results similarly suggest that the majority of outflow is from thefour largest locations, but display increased recommendations to smaller locations.

S4.3.2 Constant Compliance Rate

In the body of the paper, we present results where we vary the compliance rate π asa function of income. Specifically, we specify an upper bound πmax to the individualswith the lowest actual income before linearly interpolating to the value π = 0 across theentire income distribution. Thus, each individual receives a heterogeneous complianceparameter πi, and the average compliance rate in a particular simulation run is reportedas πmax/2.

To ensure that results are not driven by this modeling decision, we repeat our sim-ulations with a constant compliance rate in an individual simulation run. In each ofthese tests, we set a single π that represents each individuals’ likelihood of complying,which is constant across income. We present these results in Figure S6, which revealsvery similar potential average gains in income across each simulation.

S4.3.3 Removing Economic Regions

In order to evaluate whether the gains we report in the main analysis of our paper arebeing driven by a specific subset of ERs, we rerun the simulations exactly as described inthe body of the paper but remove from consideration certain subsets of landing locations.That is, if an individual ‘complies’ with probability πi, we limit the set of locations theycan potentially move to in the simulation.

We begin by specifying three alternative models, in each case excluding a subsetof ERs that could potentially drive our results. In the first alternative model, we donot allow individuals to move to the largest ERs, those with a population greater than1,500,000 according to the 2016 census. In the second, we extend this to include large andgrowing ERs, defined as all ERs with a population greater than 1,000,000 and a growthrate in population from the 2011 to 2016 census above 5%. In the third, we remove thesmallest ERs – those with a population less than 100,000 in the 2016 census. A full list ofremoved ERs in each specification is listed in the Table S3 below.

The results of these alternative model runs are found in Figure S7. Overall, theseplots demonstrate the gains we find in our main analysis are not driven by one of theER subsets we define above. In the top-left panel, we presents results from a simulationconsidering “All ERs,” effectively replicating our main results in the body of our paper.In each of the three alternative specifications, we see that average gains across eachcompliance and preference parameter do not substantially differ from this baseline.

35

Another way we check against the impact of a single ER on our core results is by run-ning a “leave-one-out” robustness check. In this test, we run a series of 52 simulations, ineach case dropping one of the 52 total ERs from consideration. Other than removing thissingle ER from the choice set, the simulations are run exactly as described in the bodyof the paper. For simulation, we calculate the mean gain in annual income. We presentthe average of these gains and the 95% confidence interval across the 52 simulations inFigure S8. We again see little change to our core results, demonstrating no single ERdrives the average gain in income in our simulations.

36

S5 Figures

37

Cape Breton (NS) P. Edward Isl. (PE) W. Coast (NL) S. Cst., Nt. Dame (NL)

Interlake (MB) Southwest (MB) S. Cen., N. Cen. (MB) Southeast (MB) Edmundston (NB) Campbellton (NB) Annap. Valley (NS) N. Shore (NS)

Parklands, N. (MB) Northwest (ON) Stratford (ON) Fredericton (NB) Saint John (NB) Moncton (NB) Southern (NS) Yukon (YT)

Northeast (ON) Muskoka (ON) Kingston (ON) Avalon Pen. (NL) N.W. Territories (NT) P. Alb., Northern (SK) Yorkton (SK) Swift Current (SK)

Saskatoon (SK) Winnipeg (MB) Regina (SK) Northeast (BC) Windsor (ON) N. Coast, Nech. (BC) Cariboo (BC) Kootenay (BC)

Lethbridge (AB) Red Deer (AB) Vancouver Isl. (BC) Camrose (AB) Hamilton (ON) London (ON) Thompson (BC) Halifax (NS)

Toronto (ON) Edmonton (AB) Lower Mainland (BC) Calgary (AB) Banff, Athab. (AB) Ottawa (ON) Wood Buffalo (AB) Kitchener (ON)

0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40

0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40

Speaks EnglishLanding Year

Speaks FrenchPrefiler

Express EntryLanding Month

Family SizeER Unemployment

Immig. CategoryEducation

Birth RegionER Population

GenderAge

Skill LevelCitizenshipOccupation

Temp. Resid. Permit







GenderAge


Temp. Resid. Permit







GenderAge


Temp. Resid. Permit







GenderAge


Temp. Resid. Permit







GenderAge


Temp. Resid. Permit







GenderAge


Temp. Resid. Permit







GenderAge


Temp. Resid. Permit

Relative Variable Importance

Figure S1: Variable Importance Statistics for Tuned Location Models (Principal Appli-cants)

38

0

50000

100000

150000

200000

0 50000 100000 150000 200000

Prediction (CAD)

Out

com

e (C

AD

)

0

1000

2000

3000

0 50000 100000 150000 200000

Prediction (CAD)

Fre

q.

0

500

1000

1500

0 50000 100000 150000 200000

Actual (CAD)

Fre

q.

Figure S2: Calibration Plot: Model Fit for Express Entry Clients

39

●

●

●

●

●

●

●

1000

2000

3000

4000

5 10 15Percent of EE cases who follow recommendation

Ave

rage

Gai

n in

Ann

ual I

ncom

e (C

AD

)ERs Considered: ●Top 25 Top 15 Top 10

Follow Recommendation: 15% ERs Considered: All

Follow Recommendation: 10% ERs Considered: Top 25


0 2000 4000 0 2000 4000 0 2000 4000S. Coast and Notre Dame (NL)

W. Coast (NL)Prince Edward Island (PE)

Cape Breton (NS)N. Shore (NS)

Annapolis Valley (NS)Campbellton (NB)Edmundston (NB)

Southeast (MB)S. Central and N. Central (MB)





Parklands and North (MB)Swift Current (SK)

Yorkton (SK)Prince Albert and Northern (SK)

N.W. Territories (NT)Avalon Peninsula (NL)



Cariboo (BC)North Coast and Nechako (BC)






Vancouver Island (BC)Red Deer (AB)



Banff and Athabasca (AB)Calgary (AB)


Toronto (ON)

Number of EE Cases


Figure S3: Estimated Average Income Gains and Shifts in Arrival Locations with CPIAdjustments

40

●

●

●

●

●

●

●

1000

2000

5 10 15Percent of EE cases who follow recommendation

Ave

rage

Gai

n in

Ann

ual I

ncom

e (C

AD

)ERs Considered: ●Top 25 Top 15 Top 10





























Toronto (ON)

Number of EE Cases


Figure S4: Estimated Average Income Gains and Shifts in Arrival Locations for PrincipalApplicant and Spouse Model

41





























Toronto (ON)

Number of EE Cases


Figure S5: Movement Under Various Simulation Parameters

42

●

●

●

●

●

●

●

0

1000

2000

3000

4000

5 10 15Percent of Economic Migrants who follow recommendation

Ave

rage

Gai

n in

Ann

ual A

djus

ted

Inco

me

(CA

D)

ERs Considered: ●25 15 10

Figure S6: Constant Compliance Rate: Estimated Average Income Gains. N=17,640

43

●

●●

●

●

●●

0

1000

2000

3000

4000


Ave

rage

Gai

n in

Ann

ual A

djus

ted

Inco

me

(CA

D)

ERs Considered: ●Top 25 Top 15 Top 10

All ERs Considered

●

●

●

●

●

●

●

0

1000

2000

3000

4000


Ave

rage

Gai

n in

Ann

ual A

djus

ted

Inco

me

(CA

D)


No Movement to Smaller ERs

●

●●

●●

●●

0

1000

2000

3000

4000


Ave

rage

Gai

n in

Ann

ual A

djus

ted

Inco

me

(CA

D)


No Movement to Largest ERs

●

●

●

●●

●

●

0

1000

2000

3000

4000


Ave

rage

Gai

n in

Ann

ual A

djus

ted

Inco

me

(CA

D)


No Movement to Large and Growing ERs

Figure S7: Removing Subsets of ERs. N=17,640

44

●

●

●

●

●

●

●

0

1000

2000

3000

4000


Ave

rage

Gai

n in

Ann

ual A

djus

ted

Inco

me

(CA

D)


Range of Mean Gains: Leave−One−Out

Figure S8: Average Gains Across 52 Leave-One-Out Simulations. N=17,640

45

S6 Tables

46

Original Variable Name Source DescriptionCOUNTRY_BIRTH1,4 PNRFCOUNTRY_CITIZENSHIP1,4 PNRFCSQ_IND2 PNRF Quebec program in-

dicatorDEATH_INDICATOR2 PNRFDESTINATION_ER1,2,3 PNRF Intended Economic

Region (ER) of land-ing

EDUCATION_QUALIFICATION1 PNRFEXPRESS_ENTRY_IND1,2 PNRF Express Entry flagFAMILY_STATUS1,2 PNRFGENDER1 PNRFIMMIGRATION_CATEGORY_CENSUS1,2 PNRF Admission categoryLANDING_AGE1 PNRFLANDING_MONTH1 PNRFLANDING_YEAR1,2 PNRFLEVEL_OF_EDUCATION1 PNRFNOC3_CD111 PNRF 3-digit expected occu-

pation codeNUMBER_STUDY_PERMITS1 PNRFNUMBER_WORK_PERMITS1 PNRFOFFICIAL_LANGUAGE1 PNRFSKILL_LEVEL_CD111 PNRFEI___I1 Tax Individual employ-

ment income (exclud-ing self-employment)

PREFILER_IND1 Tax Whether an individ-ual filed a return on aTR permit

POPULATION_ER1 LFS Quarterly populationPRICE_INDEX3 External Average yearly rental

priceUNEMPLOYMENT_ER1 LFS Quarterly unemploy-

ment

1 = Variable used to train machine learning models2 = Variable used to subset the data3 = Variable used to adjust final predictions

4 = Variable aggregated to the continent level for modeling

Table S1: Variable Names

47

mean sd min max mean sd min max

Annual Income per Head (CAD) 49900.00 33600.00 0 246600 English: No 0.02 0.14 0 1Age 33.15 5.99 22 65 English: Yes 0.98 0.14 0 1Birth Region: The Americas 0.09 0.29 0 1 French: No 0.97 0.16 0 1Birth Region: Europe 0.24 0.43 0 1 French: Yes 0.03 0.16 0 1Birth Region: Africa 0.06 0.23 0 1 Prefiler: No 0.13 0.34 0 1Birth Region: Asia 0.59 0.49 0 1 Prefiler: Yes 0.87 0.34 0 1Birth Region: Oceania 0.02 0.14 0 1 TR: No TR 0.12 0.32 0 1Citizenship: United States 0.03 0.17 0 1 TR: Study 0.01 0.09 0 1Citizenship: Mexico 0.02 0.13 0 1 TR: Study+Work 0.31 0.46 0 1Citizenship: Jamaica 0.01 0.11 0 1 TR: Work 0.57 0.5 0 1Citizenship: Brazil 0.01 0.11 0 1Citizenship: France 0.03 0.16 0 1Citizenship: Germany 0.01 0.11 0 1Citizenship: Poland 0.01 0.10 0 1Citizenship: Russia 0.01 0.10 0 1Citizenship: Ukraine 0.01 0.12 0 1Citizenship: Ireland 0.05 0.21 0 1Citizenship: United Kingdom 0.06 0.24 0 1Citizenship: Nigeria 0.02 0.15 0 1Citizenship: South Africa 0.01 0.10 0 1Citizenship: Iran 0.01 0.09 0 1Citizenship: China 0.04 0.20 0 1Citizenship: South Korea 0.03 0.16 0 1Citizenship: Philippines 0.14 0.35 0 1Citizenship: India 0.28 0.45 0 1Citizenship: Pakistan 0.02 0.12 0 1Citizenship: Australia 0.02 0.14 0 1Citizenship: Other 0.18 0.38 0 1Education: Less than BA 0.41 0.49 0 1Education: BA 0.27 0.44 0 1Education: MA 0.28 0.45 0 1Education: PhD 0.03 0.18 0 1Male 0.67 0.47 0 1Female 0.33 0.47 0 1Unit Size 1.43 0.50 1 3Landing Year: 2015 0.29 0.45 0 1Landing Year: 2016 0.71 0.45 0 1Landing Month (1-12) 7.40 3.40 1 12Category: Skilled Worker program 0.46 0.50 0 1Category: Skilled Trades program 0.09 0.29 0 1Category: Canadian Experience Class 0.45 0.50 0 1Industry: ArtCultureSport 0.04 0.20 0 1Industry: Computer 0.20 0.40 0 1Industry: Education_Law_Govt 0.07 0.25 0 1Industry: Extraction 0.00 0.03 0 1Industry: Finance 0.09 0.29 0 1Industry: FoodTourism 0.12 0.33 0 1Industry: Health 0.04 0.21 0 1Industry: Management_Misc 0.03 0.16 0 1Industry: ManualTrades 0.10 0.31 0 1Industry: Manufacturing 0.01 0.08 0 1Industry: NatResources_9980 0.01 0.10 0 1Industry: No_Info 0.01 0.08 0 1Industry: Sales 0.05 0.22 0 1Industry: Services 0.18 0.38 0 1Industry: SocialServices 0.01 0.12 0 1Industry: Technical 0.03 0.18 0 1Skill Level: Managerial 0.10 0.30 0 1Skill Level: Professionals 0.35 0.48 0 1Skill Level: Skilled and Technical 0.55 0.50 0 1

Table S2: Descriptive Statistics: Express Entry Principal Applicants

48

ER_CODE Name Large Rapidly Growing Small3530 Toronto X X5920 Lower Mainland X X4830 Calgary X4860 Edmonton X3540 Kitchener X4660 Interlake X4680 Parklands and North X4840 Banff and Athabasca X4740 Yorkton X1350 Edmundston X5980 N.E. (B.C.) X4620 S. Central and N. Central X5960 North Coast and Nechako X4640 S. Central and N. Central X6110 N.W. Territories X4670 Parklands and North X5970 North Coast and Nechako X4760 Prince Albert and Northern X1020 S. Coast and Notre Dame X6010 Yukon X

Table S3: Removing Economic Regions Robustness Check

49

Date post:	30-Sep-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Leveraging the Power of Place: A Data-Driven Decision Helper to … · 2020. 7. 29. · Leveraging...

Documents