Loss Functions for Detecting Outliers in Panel Data

Post on 30-Dec-2015

26 views 0 download

Tags:

description

Loss Functions for Detecting Outliers in Panel Data. Charles D. Coleman Thomas Bryan Jason E. Devine U.S. Census Bureau. Prepared for the Spring 2000 meetings of the Federal-State Cooperative Program for Population Estimates, Los Angeles, CA, March, 2000. Panel Data. - PowerPoint PPT Presentation

transcript

Loss Functions for Detecting Outliers in Panel Data

Charles D. ColemanThomas Bryan

Jason E. DevineU.S. Census Bureau

Prepared for the Spring 2000 meetings of the Federal-State Cooperative Program for Population Estimates, Los Angeles, CA,

March, 2000

Panel Data

A.k.a. “longitudinal data.”

xit:

– i indexes cross-sectional units: retain identities over time. Exx: Geographic areas, persons, households, companies, autos.

– t indexes time.– Chronological or nominal.– Chronological time measures time elapsed between two dates.– Nominal time indexes different sets of estimates, can also

index true values.

Notation

• Bi is base value for unit i.

• Fi is “future” value for unit i.

• Fit is future value for unit i at time t.

• Bi, Fi, Fit > 0.

i=|Fi-Bi| is absolute difference for unit i.

• Subscripts will be dropped when not needed.

What is an Outlier?

“[An outlier is] an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism.”

D.M. Hawkins, Identification of Outliers, 1980, p. 1.

Meaning of an Outlier

• Either– Indication of a problem with the data

generation process.

• Or– A true, but unusual, statement about reality.

Loss Functions• Motivations: The i come from unknown

distributions. Want to compare multiple size classes on same basis.

• L(Fi;Bi)(i,Bi) is loss function for observation i.

• Loss functions measure “badness.”

• Loss functions produce rankings of observations to be examined.

• Loss functions are empirically based, except for one special case in nominal time.

Assumption 1

Loss is symmetric in error:

L(B+; B) = L(B–; B)

Assumption 2

Loss increases in difference:

/ > 0

Assumption 3

Loss decreases in base value:

/B < 0

Property 1

Loss associated with given absolute percentage difference (| / B|) increases in B.

Simplest Loss Function

L(F;B) = |F – B|Bq (1a)

or

(,B) = Bq (1b)

with

0 > q > –1.

~( ; )L F B F B

F B

Br

s

Loss as Weighted Combination of Absolute Difference and

Absolute Percentage Difference

• This generates loss function with q = –s/(r + s).• Infinite number of pairs (r, s) correspond to any given q.

Outlier Criterion

• Outlier declared wheneverL(F;B)(,B) > C

• C is “critical value.”

• C can be determined in advance, or as function of data (e.g., quantile or multiple of scale measure).

Loss Function Variants

• Time-Invariant Loss Function

• Signed Loss Function

• Nominal Time

Time-Invariant Loss Function

• Idea: Compare multiple dates of data on same basis.

• Time need not be round number.

• L(Fit;Bi,t) = |Fit – Bi|Btq

• Property 1 satisfied as long as t < –1/q.

• Thus, useful horizon is limited.

Signed Loss Function• Idea: Account for direction and magnitude of loss.

S(F;B) = (F – B) Bq

• Can use asymmetric critical values and “q”s:– Declare outliers whenever

S+(F;B) = (F – B) Bq+ > C+

or

S–(F;B) = (F – B) Bq– < C–

with C+ –C–, q+ q–.

Nominal Time

• Compare 2 sets of estimates, one set can be actual values, Ai.

• Assumptions:– Unbiased: EBi = EFi = Ai.

– Proportionate variance: Var(Bi) = Var(Fi) = 2Ai.

• q = –1/2.

• Either set of estimates can be used for Bi, Fi.

– Exception: Ai can only be substituted for Bi.

How to Use: No Preexisting Outlier Criteria

• Start with q = – 0.5.– Adjust by increments of 0.1 to get “good”

distribution of outliers.

• Alternative: Start with

q = log(range)/25 – 1, where range is range of data. (Bryan, 1999)– Can adjust.

How to Use: Preexisting Discrete Outlier Criteria

• Start with schedule of critical pairs (j, Bj).

– These pairs (approximately) satisfy equation Bq = C for some q and C. They are the cutoffs between outliers and nonoutliers.

• Run regressionlog j = –q log Bj + K

• Then, C = eK.

Loss Functions and GIS

• Loss functions can be used with GIS to focus analyst’s attention on problem areas.

• Maps compare tax method county population estimates to unconstrained housing unit method estimates.

• q = –0.5 in loss function map.

Persons

0 - 50005000 - 2500025000 - 50000Over 50000No Data

Note: The tax method estimates are the base

Map 1Absolute Differences between the Two Sets of Population EstimatesAbsolute Differences between the Population Estimates

Percent

0 - 55 - 1010 - 20Above 20No Data

Note: The tax method estimates are the base

Map 2Absolute Percent Differences between the Two Sets of Population EstimatesPercent Absolute Differences between the Population Estimates

0 - 10001000 - 20002000 - 4000Above 4000No Data

Loss

Map 3Loss Function Values

Note: The tax method estimates are the base

Loss Function Values

Outliers Classified by Another Variable

• Di is function of 2 successive observations.

• Ri is “reference” variable, used to classify outliers.

• Start with schedule of critical pairs (Dj, Rj).

• Run regressionlog Dj = a + log Rj

• Then, L(D, R) = DRb and C = ea.

What to Do with Negative Data

• From Coleman and Bryan (2000):

L(F,B) = |F–B|(|F|+|B|)q, B 0 or F 0,

0 , B = F = 0.

S(F,B) = (F–B)(|F|+|B|)q, B 0 or F 0,

0 , B = F = 0.

• 0 > q > –1. Suggest q –0.5.

Summary

• Defined panel data.

• Defined outliers.

• Created several types of loss functions to detect outliers in panel data.

• Loss functions are empirical (except for nominal time.)

• Showed several applications, including GIS.

URL for Presentation

http://chuckcoleman.home.dhs.org/fscpela.ppt