Assessment of Misclassification Error in Stratification Due to
Incomplete Frame Information
Assessment of Misclassification Error in Stratification Due to
Incomplete Frame Information
Donsig Jang, Xiaojing Lin, Amang SukasihMathematica Policy Research, Inc.
Steve Cohen, Kelly KangNational Science Foundation
ITSEW 2008Research Triangle Park, NC, June 2, 2008
Donsig Jang, Xiaojing Lin, Amang SukasihMathematica Policy Research, Inc.
Steve Cohen, Kelly KangNational Science Foundation
ITSEW 2008Research Triangle Park, NC, June 2, 2008
DisclaimerDisclaimer
The opinions and assertions are those of the The opinions and assertions are those of the authors and do not reflect the views or policies of authors and do not reflect the views or policies of the National Science Foundationthe National Science Foundation
The opinions and assertions are those of the The opinions and assertions are those of the authors and do not reflect the views or policies of authors and do not reflect the views or policies of the National Science Foundationthe National Science Foundation
Survey Data CollectionSurvey Data Collection
Involves many complex processes includingInvolves many complex processes including– Sampling frame constructionSampling frame construction– Sample selectionSample selection– Data collectionData collection– Data processingData processing– EstimationEstimation
Each process subjects to errorEach process subjects to error
Attempt to decompose the total survey errors into Attempt to decompose the total survey errors into separate stages of processes separate stages of processes
Involves many complex processes includingInvolves many complex processes including– Sampling frame constructionSampling frame construction– Sample selectionSample selection– Data collectionData collection– Data processingData processing– EstimationEstimation
Each process subjects to errorEach process subjects to error
Attempt to decompose the total survey errors into Attempt to decompose the total survey errors into separate stages of processes separate stages of processes
Total Survey ErrorsTotal Survey Errors
Sampling Frame
Parameter
Estimator
Sample
Respondent
Data
Misclassification errorMisclassification errorMisclassification errorMisclassification error
Coverage errorCoverage errorCoverage errorCoverage error
Sampling errorSampling errorSampling errorSampling error
Nonresponse errorNonresponse errorNonresponse errorNonresponse error
Measurement errorMeasurement errorMeasurement errorMeasurement error
Estimation errorEstimation errorEstimation errorEstimation error
Misclassification Error in StratificationMisclassification Error in Stratification
Focus of this talkFocus of this talk
A part of non-sampling errorA part of non-sampling error
Important but often overlooked componentImportant but often overlooked component
Focus of this talkFocus of this talk
A part of non-sampling errorA part of non-sampling error
Important but often overlooked componentImportant but often overlooked component
Stratification in SamplingStratification in Sampling
Enhance precision of survey estimatesEnhance precision of survey estimates
Precision requirements for analytic domainsPrecision requirements for analytic domains
Often imperfect information on stratification Often imperfect information on stratification variablesvariables
– Misclassification in stratificationMisclassification in stratification
Enhance precision of survey estimatesEnhance precision of survey estimates
Precision requirements for analytic domainsPrecision requirements for analytic domains
Often imperfect information on stratification Often imperfect information on stratification variablesvariables
– Misclassification in stratificationMisclassification in stratification
– Trade-off: cost to gather stratification information at the Trade-off: cost to gather stratification information at the frame construction vs. optimal sample allocationframe construction vs. optimal sample allocation
– Trade-off: cost to gather stratification information at the Trade-off: cost to gather stratification information at the frame construction vs. optimal sample allocationframe construction vs. optimal sample allocation
– Loss of effective sample sizes for some analytic Loss of effective sample sizes for some analytic domains domains
– Loss of effective sample sizes for some analytic Loss of effective sample sizes for some analytic domains domains
– Trade-off: cost to gather stratification information at the Trade-off: cost to gather stratification information at the frame construction vs. optimal sample allocationframe construction vs. optimal sample allocation
– Trade-off: cost to gather stratification information at the Trade-off: cost to gather stratification information at the frame construction vs. optimal sample allocationframe construction vs. optimal sample allocation
– Loss of effective sample sizes for some analytic Loss of effective sample sizes for some analytic domainsdomains
– Loss of effective sample sizes for some analytic Loss of effective sample sizes for some analytic domainsdomains
Misclassification MatrixMisclassification Matrix
the proportion of units classified as the proportion of units classified as category category jj in true category in true category kk and and
jk
11
m
jkj
11 12 1
21 22 2
1 2
...
...
... ... ... ...
...
m
m
m m mm
True classification True classification AA
StratificationStratificationclassification classification A*A*
Measures for Misclassification Effects Measures for Misclassification Effects
BiasBias
Effective sample size change Effective sample size change
BiasBias
Effective sample size change Effective sample size change
Bias Due to MisclassificationBias Due to Misclassification
( (1),..., ( ))A A AP P m Pwherewhere
*( ) ( )A ABias p Θ Ι P
= true population props.= true population props.
** ( ) ( )A i i
i s
p j w I A j
ss denotes sample, denotes sample, wwii the sampling weight for unit the sampling weight for unit ii, and , and
II(.)(.) the indicator function the indicator function
* * *( (1),..., ( ))A A Ap p m p = sample proportions = sample proportions
I = Identity matrixIdentity matrix
Kuha and Skinner 1997Kuha and Skinner 1997
Bias EstimationBias Estimation
1*
ˆ( ) ( ) ( )A ARebias Ap
p D Θ Ι p
jk( , ); ( ) if ; d 0 o.w.jk jk Ad d p j j k Ap
Dwherewhere
( ) ( ),A i ii s
p j w I A j
ˆ ˆ( ),jkΘ
( * , )ˆ
( )
i i ii s
jki i
i s
w I A j A k
w I A k
If the true classification is available from the sample:If the true classification is available from the sample:
Effective Sample Sizes and Variance Inflation Factors
Effective Sample Sizes and Variance Inflation Factors
Measures the inflation of variance due to weight variationMeasures the inflation of variance due to weight variation Measures the inflation of variance due to weight variationMeasures the inflation of variance due to weight variation
2
2, , 2
,
where, 1id i d
eff d w d ww d d
wnn deff CV
deff nw
,
,
( )
( *)w d
w d
deff A
deff A
,,
,
( ),
( *)w d
w dw d
deff AVIF
deff A
for domain for domain dd constructed based on true value constructed based on true value
for domain for domain dd constructed based on misclassified value constructed based on misclassified value
Example: National Survey of Recent College Graduates (NSRCG)
Example: National Survey of Recent College Graduates (NSRCG)
Sponsored by National Science FoundationSponsored by National Science Foundation
Collecting education, employment, and demographic Collecting education, employment, and demographic information from recent graduates with Bachelor’s or information from recent graduates with Bachelor’s or Master’s in science, engineering, or health fieldsMaster’s in science, engineering, or health fields
For details, For details,
– http://www.nsf.gov/statistics/srvyrecentgrads
Sponsored by National Science FoundationSponsored by National Science Foundation
Collecting education, employment, and demographic Collecting education, employment, and demographic information from recent graduates with Bachelor’s or information from recent graduates with Bachelor’s or Master’s in science, engineering, or health fieldsMaster’s in science, engineering, or health fields
For details, For details,
– http://www.nsf.gov/statistics/srvyrecentgrads
NSRCG (Continued)NSRCG (Continued)
Two stage sample design: school sample at the first stage and Two stage sample design: school sample at the first stage and graduate sample at the second stage graduate sample at the second stage
Crucial to collect key sampling variables (degree date, degree Crucial to collect key sampling variables (degree date, degree level, field of major, race/ethnicity, and gender) from schools level, field of major, race/ethnicity, and gender) from schools for eligibility determination and stratification (frame variables)for eligibility determination and stratification (frame variables)
Sample was designed to have moderate weight variation within Sample was designed to have moderate weight variation within domains while meeting certain sample size thresholdsdomains while meeting certain sample size thresholds
Quality of sampling variables compromised due to schools’ Quality of sampling variables compromised due to schools’ reluctance to release the student’s information, non-standard reluctance to release the student’s information, non-standard formats used by schools, and inaccurate/incomplete formats used by schools, and inaccurate/incomplete administrative data administrative data
Two stage sample design: school sample at the first stage and Two stage sample design: school sample at the first stage and graduate sample at the second stage graduate sample at the second stage
Crucial to collect key sampling variables (degree date, degree Crucial to collect key sampling variables (degree date, degree level, field of major, race/ethnicity, and gender) from schools level, field of major, race/ethnicity, and gender) from schools for eligibility determination and stratification (frame variables)for eligibility determination and stratification (frame variables)
Sample was designed to have moderate weight variation within Sample was designed to have moderate weight variation within domains while meeting certain sample size thresholdsdomains while meeting certain sample size thresholds
Quality of sampling variables compromised due to schools’ Quality of sampling variables compromised due to schools’ reluctance to release the student’s information, non-standard reluctance to release the student’s information, non-standard formats used by schools, and inaccurate/incomplete formats used by schools, and inaccurate/incomplete administrative data administrative data
Jang and Lin (2007 JSM)Jang and Lin (2007 JSM)
NSRCG (Continued)NSRCG (Continued)
Same information (degree date, degree level, field of Same information (degree date, degree level, field of major, race/ethnicity, and gender) were also major, race/ethnicity, and gender) were also collected from sampled graduatescollected from sampled graduates
Able to measure the quality of school provided Able to measure the quality of school provided information for stratification by assessing information for stratification by assessing discrepancies between school provided information discrepancies between school provided information and reported valuesand reported values
Looking at two survey data (2003 and 2006 NSRCG)Looking at two survey data (2003 and 2006 NSRCG)
Same information (degree date, degree level, field of Same information (degree date, degree level, field of major, race/ethnicity, and gender) were also major, race/ethnicity, and gender) were also collected from sampled graduatescollected from sampled graduates
Able to measure the quality of school provided Able to measure the quality of school provided information for stratification by assessing information for stratification by assessing discrepancies between school provided information discrepancies between school provided information and reported valuesand reported values
Looking at two survey data (2003 and 2006 NSRCG)Looking at two survey data (2003 and 2006 NSRCG)
Misclassification for GenderMisclassification for Gender
NSRCG2003NSRCG2003 NSRCG2006NSRCG2006
ReBias for PReBias for PMaleMale= -0.01%= -0.01% ReBias for PReBias for PMaleMale = 0.50% = 0.50%
Male FemaleMale 472,866 (98.5%) 7,042 (1.0%) 479,908Female 7,081 (1.5%) 667,266 (99.0%) 674,347Total 479,946 674,309 1,154,255
StratificationResponse
Total Male FemaleMale 824,984 (99.4%) 8,750 (0.8%) 833,734Female 4,611 (0.6%) 1,095,674 (99.2%) 1,100,284Total 829,594 1,104,424 1,934,018
StratificationResponse
Total
Prop_F Prop_RMale 41.58 41.58Female 58.42 58.42Total 100 100
Prop_F Prop_RMale 43.11 42.89Female 56.89 57.11Total 100 100
Misclassification for Race/EthnicityMisclassification for Race/Ethnicity
NSRCG2003NSRCG2003 NSRCG2006NSRCG2006
White Asian MinorityWhite 678,516 (82.4%) 4,891 (3.4%) 12,586 (6.7%) 695,992Asian 136,099 (16.5%) 134,386 (94.7%) 26834(14.2%) 297,320Minority 8546( 1.0%) 2659(1.9%) 149,739 (79.2%) 160,943Total 823,161 141,936 189,158 1,154,255
StratificationResponse
Total White Asian MinorityWhite 1,196,301 (90.6%) 9,636 (3.5%) 28,473 (8.4%) 1,234,409Asian 113,823 ( 8.6%) 262,197 (95.0%) 39,869 ( 11.8%) 415,889Minority 9,841 (0.7%) 4,130 (1.5%) 269,749 (79.8%) 283,720Total 1,319,964 275,963 338,091 1,934,018
StratificationResponse
Total
Prop_F Prop_RWhite 60.30 71.32Asian 25.76 12.30Minority 13.94 16.39Total 100 100
Prop_F Prop_RWhite 63.83 68.25Asian 21.50 14.27Minority 14.67 17.48Total 100 100
NSRCG2003 NSRCG2006Relative Bias of PWhite (White vs. Others) -15.4% -6.5%
Relative Bias of PAsian (Asian vs. Others) 109.5% 50.7%Relative Bias of PMinority (Minority vs. Others) -14.9% -16.1%
Relative Bias
Effective Sample Sizes and Variance Inflation Factors
Effective Sample Sizes and Variance Inflation Factors
What if taking reported values for discrepant cases?What if taking reported values for discrepant cases?
Result in more weight variation within domains Result in more weight variation within domains based on reported values due to unequal selection based on reported values due to unequal selection probabilities across classesprobabilities across classes
Check domain specific sample sizes and variance Check domain specific sample sizes and variance inflation factors inflation factors
What if taking reported values for discrepant cases?What if taking reported values for discrepant cases?
Result in more weight variation within domains Result in more weight variation within domains based on reported values due to unequal selection based on reported values due to unequal selection probabilities across classesprobabilities across classes
Check domain specific sample sizes and variance Check domain specific sample sizes and variance inflation factors inflation factors
Variance Inflation FactorsVariance Inflation Factors
NSRCG2003NSRCG2003 NSRCG2006NSRCG2006
Domain: race/ethnicity by degree level by major field by genderDomain: race/ethnicity by degree level by major field by gender
= White, = Asian, = = White, = Asian, = MinorityMinority
Ratio of Sample Size, n_R / n_F Ratio of Sample Size, n_R / n_F
NSRCG2003NSRCG2003 NSRCG2006NSRCG2006
Domain: race/ethnicity by degree level by major field by genderDomain: race/ethnicity by degree level by major field by gender
= White, = Asian, = = White, = Asian, = MinorityMinority
Ratio of Effective Sample Size, n_R / n_FRatio of Effective Sample Size, n_R / n_F
NSRCG2003NSRCG2003 NSRCG2006NSRCG2006
Domain: race/ethnicity by degree level by major field by genderDomain: race/ethnicity by degree level by major field by gender
= White, = Asian, = = White, = Asian, = MinorityMinority
Variance Inflation FactorsVariance Inflation Factors
NSRCG2003NSRCG2003 NSRCG2006NSRCG2006
Domain: race/ethnicity by degree level by major fieldDomain: race/ethnicity by degree level by major field
= White, = Asian, = = White, = Asian, = MinorityMinority
Ratio of Sample Size, n_R / n_FRatio of Sample Size, n_R / n_F
NSRCG2003NSRCG2003 NSRCG2006NSRCG2006
Domain: race/ethnicity by degree level by major fieldDomain: race/ethnicity by degree level by major field
= White, = Asian, = = White, = Asian, = MinorityMinority
Ratio of Effective Sample Size, n_R / n_FRatio of Effective Sample Size, n_R / n_F
NSRCG2003NSRCG2003 NSRCG2006NSRCG2006
Domain: race/ethnicity by degree level by major fieldDomain: race/ethnicity by degree level by major field
= White, = Asian, = = White, = Asian, = MinorityMinority
Variance Inflation FactorsVariance Inflation Factors
NSRCG2003NSRCG2003 NSRCG2006NSRCG2006
Domain: race/ethnicity by genderDomain: race/ethnicity by gender
= White, = Asian, = = White, = Asian, = MinorityMinority
Ratio of Sample Size, n_R / n_FRatio of Sample Size, n_R / n_F
NSRCG2003NSRCG2003 NSRCG2006NSRCG2006
Domain: race/ethnicity by genderDomain: race/ethnicity by gender
= White, = Asian, = = White, = Asian, = MinorityMinority
Ratio of Effective Sample Size, n_R / n_FRatio of Effective Sample Size, n_R / n_F
NSRCG2003NSRCG2003 NSRCG2006NSRCG2006
Domain: race/ethnicity by genderDomain: race/ethnicity by gender
= White, = Asian, = = White, = Asian, = MinorityMinority
SummarySummary
Misclassification in stratification may reduce the Misclassification in stratification may reduce the effective sample sizes for domains that were effective sample sizes for domains that were sampled with high sampling ratessampled with high sampling rates
Crucial to have good classification in stratification, Crucial to have good classification in stratification, especially with substantially unequal probability especially with substantially unequal probability selections implementedselections implemented
Misclassification in stratification may reduce the Misclassification in stratification may reduce the effective sample sizes for domains that were effective sample sizes for domains that were sampled with high sampling ratessampled with high sampling rates
Crucial to have good classification in stratification, Crucial to have good classification in stratification, especially with substantially unequal probability especially with substantially unequal probability selections implementedselections implemented
Next StepsNext Steps
Population counts for key domains available but based on Population counts for key domains available but based on misclassificationmisclassification
Estimation of population counts:Estimation of population counts:
– Weighted sums of correct classification from the sampleWeighted sums of correct classification from the sample
– Use of misclassification parameter estimates,Use of misclassification parameter estimates,
where is the vector with population counts of domains where is the vector with population counts of domains defined by defined by A*A*
Raking adjustments of the weights usingRaking adjustments of the weights using
Comparison of key estimatesComparison of key estimates
Population counts for key domains available but based on Population counts for key domains available but based on misclassificationmisclassification
Estimation of population counts:Estimation of population counts:
– Weighted sums of correct classification from the sampleWeighted sums of correct classification from the sample
– Use of misclassification parameter estimates,Use of misclassification parameter estimates,
where is the vector with population counts of domains where is the vector with population counts of domains defined by defined by A*A*
Raking adjustments of the weights usingRaking adjustments of the weights using
Comparison of key estimatesComparison of key estimates
*ˆˆ ,A A -1T Θ T
*AT
ˆAT