· 8 X. CHEN AND Y. HU LX a|X∗,andLX∗|W aj of the corresponding densities fX a,Ya|Waj,...

IDENTIFICATION AND INFERENCE IN NONLINEARMODELS USING TWO SAMPLES WITH NONCLASSICAL

MEASUREMENT ERRORS∗

By Xiaohong Chen† and Yingyao HuYale University and Johns Hopkins University

This paper considers identification and inference of a general non-linear Errors-in-Variables (EIV) model using two samples. Both sam-ples consist of a dependent variable, some error-free covariates, and anerror-ridden covariate, in which the measurement error has unknowndistribution and could be arbitrarily correlated with the latent truevalues; and neither sample contains an accurate measurement of thecorresponding true variable. We assume that the latent model of in-terest – the conditional distribution of the dependent variable giventhe latent true covariate and the error-free covariates – is the samein both samples, but the distributions of the latent true covariatesvary with observed error-free discrete covariates. We first show thatthe general latent nonlinear model is nonparametrically identified us-ing the two samples when both could have nonclassical errors, withno existence of instrumental variables nor independence between thetwo samples. When the two samples are independent and the latentnonlinear model is parameterized, we propose sieve Quasi MaximumLikelihood Estimation (Q-MLE) for the parameter of interest, andestablish its root-n consistency and asymptotic normality under pos-sible misspecification, and its semiparametric efficiency under correctspecification. We also provide a sieve likelihood ratio model selectiontest to compare two possibly misspecified parametric latent models.A small Monte Carlo simulation is presented.

1. Introduction. The Measurement error problems are frequently en-

countered by researchers conducting empirical studies in social and natural

sciences. A measurement error is called classical if it is independent of the

latent true values; otherwise, it is called nonclassical. There have been many

∗The authors would like to thank R. Blundell, R. Carroll, P. Cross, S. Donald, B.Fitzenberger, E. Mammen, L. Nesheim, M. Stinchcombe, C. Taber, and conference par-ticipants at the 2006 North American Summer Meeting of the Econometric Society andthe 2006 Southern Economic Association annual meeting for valuable suggestions.

†Chen acknowledges partial support from the National Science Foundation.AMS 2000 subject classifications: Primary 62H12, 62D05; secondary 62G08Keywords and phrases: data combination, nonlinear errors-in-variables model, non-

classical measurement error, nonparametric identification, misspecified parametric latentmodel, sieve likelihood estimation and inference

2 X. CHEN AND Y. HU

studies on identification and estimation of linear, nonlinear, and even non-

parametric models with classical measurement errors (see, e.g., Fuller (1987),

Cheng and Van Ness (1999), Wansbeek and Meijer (2000), Carroll, Ruppert,

Stefanski and Crainiceanu (2006) for detailed reviews). However, numerous

validation studies in economic survey data sets indicate that the errors in

self-reported variables, such as earnings, are typically correlated with the

true values, and hence, are nonclassical (see, e.g., Bound, Brown, and Math-

iowetz (2001)). In fact, in many survey situations, a rational agent has an in-

centive to purposely report wrong values conditioning on his/her truth. This

motivates many recent studies on Errors-In-Variables (EIV) problems allow-

ing for nonclassical measurement errors. In this paper, we provide one so-

lution to the nonparametric identification of a general nonlinear EIV model

by combining two samples, where both samples contain mismeasured co-

variates and neither contains an accurate measurement of the latent true

variable. Our identification strategy does not require the existence of instru-

mental variables or repeated measurements, and both samples could have

nonclassical measurement errors and the two samples could be arbitrarily

correlated.

It is well known that, without additional parametric restrictions or sample

information, a general nonlinear model cannot be identified in the presence

of mismeasured covariates. There are currently three broad approaches to

regaining identification of nonlinear EIV models. The first one is to impose

parametric restrictions on measurement error distributions (see, e.g., Hsiao

(1989), Fan (1991), Murphy and Van der Vaart (1996), Wang, Lin, Gutier-

rez and Carroll (1998), Liang, Hardle and Carroll (1999), Taupin (2001),

Hong and Tamer (2003), and others). The second approach is to assume

the existence of Instrumental Variables (IVs), such as a repeated measure-

ment of the mismeasured covariates, that do not enter the latent model of

interest but do contain information to recover features of latent true vari-

ables (see, e.g., Amemiya and Fuller (1988), Carroll and Stefanski (1990),

imsart-aos ver. 2007/04/13 file: ChenHu-2S-EIV-2007-aug-9-aos.tex date: August 9, 2007

MEASUREMENT ERROR MODELS WITH TWO SAMPLES 3

Hausman, Ichimura, Newey, and Powell (1991), Wang and Hsiao (1995),

Buzas and Stefanski (1996), Li and Vuong (1998), Newey (2001), Li (2002),

Wang (2004), Schennach (2004), Carroll, Ruppert, Crainiceanu, Tosteson

and Karagas (2004), Lewbel (2007), Hu (2006) and Hu and Schennach

(2006), to name only a few). The third approach to identifying nonlinear

EIV models with nonclassical errors is to combine two samples (see, e.g.,

Hausman, Ichimura, Newey, and Powell (1991), Carroll and Wand (1991),

Lee and Sepanski (1995), Chen, Hong, and Tamer (2005), Chen, Hong and

Tarozzi (2007), Hu and Ridder (2006), and Ichimura and Martinez-Sanchis

(2006), to name only a few).

The approach of combining samples has the advantages of allowing for

arbitrary measurement errors in the primary sample, without the need of

finding IVs or imposing parametric assumptions on measurement error dis-

tributions. However, all the currently published papers using this approach

require that the auxiliary sample contain an accurate measurement of the

true value; such a sample might be difficult to find in some applications. See

Carroll, Ruppert, and Stefanski (1995) and Ridder and Moffitt (2006) for a

detailed survey of this approach.

In this paper, we provide nonparametric identification of a general non-

linear EIV model with measurement errors in covariates by combining a

primary sample and an auxiliary sample, in which each sample contains

only one measurement of the error-ridden explanatory variable, and the er-

rors in both samples may be nonclassical. Our approach differs from the IV

approach in that we do not require an IV excluded from the latent model of

interest, and all the variables in our samples may be included in the model.

Our approach is closer to the existing two-sample approach, since we also

require an auxiliary sample and allow for nonclassical measurement errors

in both samples. However, our identification strategy differs crucially from

the existing two-sample approach in that neither of our samples contains an

accurate measurement of the latent true variable.


4 X. CHEN AND Y. HU

We assume that both samples consist of a dependent variable (Y ), some

error-free covariates (W ), and an error-ridden covariate (X), in which the

measurement error has unknown distribution and could be arbitrarily cor-

related with the latent true values (X∗); and neither sample contains an

accurate measurement of the corresponding true variable. We assume that

the latent model of interest, fY |X∗,W , the conditional distribution of the

dependent variable given the latent true covariate and the error-free covari-

ates, is the same in both samples, but the marginal distributions of the latent

true variables differ across some contrasting subsamples. These contrasting

subsamples of the primary and the auxiliary samples may be different geo-

graphic areas, age groups, or other observed demographic characteristics. We

use the difference between the distributions of the latent true values in the

contrasting subsamples of both samples to show that the measurement error

distributions are identified. To be specific, we may identify the relationship

between the measurement error distribution in the auxiliary sample and the

ratio of the marginal distributions of latent true values in the subsamples. In

fact, the ratio of the marginal distributions plays the role of an eigenvalue of

an observed linear operator, while the measurement error distribution in the

auxiliary sample is the corresponding eigenfunction. Therefore, the measure-

ment error distribution may be identified through a diagonal decomposition

of an observed linear operator under the normalization condition that the

measurement error distribution in the auxiliary sample has zero mode (or

zero median or mean). The latent nonlinear model of interest, fY |X∗,W , may

then be nonparametrically identified. In this paper, we first illustrate our

identification strategy using a nonlinear EIV model with nonclassical errors

in discrete covariates of two samples. We then focus on nonparametric iden-

tification of a general latent nonlinear model with arbitrary measurement

errors in continuous covariates.

Our identification result allows for fully nonparametric EIV models and

also allows for two correlated samples. But, in most empirical applications,



the latent models of interest are parametric nonlinear models, and the two

samples are regarded as independent. Within this framework, we propose a

sieve Quasi-Maximum Likelihood Estimation (Q-MLE) for the latent non-

linear model of interest using two samples with nonclassical measurement

errors. Under possible misspecification of the latent parametric model, we

establish root-n consistency and asymptotic normality of the sieve Q-MLE of

the finite dimensional parameter of interest, as well as its semiparametric ef-

ficiency under correct specification. In addition, we provide a sieve likelihood

ratio model selection test to compare two possibly misspecified parametric

nonlinear EIV models with nonclassical errors.

In this paper, for any two possibly vector-valued random variables A and

B, we let fA|B denote the conditional density of A given B, fA denote the

density of A. We assume the existence of two samples. The primary sam-

ple is a random sample from (X,W, Y ), in which X is a mismeasured X∗;

and the auxiliary sample is a random sample from (Xa,Wa, Ya), in which

Xa is a mismeasured X∗a . These two samples could be correlated and could

have different joint distributions. The rest of the paper is organized as fol-

lows. Section 2 establishes the nonparametric identification of the latent

probability model of interest, fY |X∗,W , using two samples with (possibly)

nonclassical errors. Section 3 presents the two-sample sieve Q-MLE and the

sieve likelihood ratio model selection test under possibly misspecified para-

metric latent models. Section 4 provides a Monte Carlo study and Section 5

briefly concludes. The Appendix contains the proofs of the main theorems.

2. Nonparametric Identification.

2.1. The dichotomous case: an illustration. We first illustrate our iden-

tification strategy by describing a special case in which all the variables

(X∗,X,W, Y ) are 0-1 dichotomous. For example, suppose that we are inter-

ested in the effect of the latent true college education level X∗ on the em-

ployment status Y with the marital status W as a covariate, i.e., fY |X∗,W .


6 X. CHEN AND Y. HU

Instead of X∗ we observe self-reported college education level X.

In this illustration subsection, we use italic letters to highlight all the

assumptions imposed for the nonparametric identification of fY |X∗,W , while

detailed discussions of the assumptions are postponed to subsection 2.2.

First, we assume that the measurement error in X is independent of all other

variables in the model conditional on the true value X∗, i.e., fX|X∗,W,Y =

fX|X∗ . In our simple example, this assumption implies that all the people

with the same latent true education level have the same pattern of misre-

porting, although the true education level could depend on other individual

characteristics. Under this assumption, the probability distribution of the

observables equals

(2.1)

fX,W,Y (x,w, y) =X

x∗=0,1

fX|X∗(x|x∗)fX∗,W,Y (x∗, w, y) for all x,w, y.

We define the matrix representations of fX|X∗ as follows:

LX|X∗ =

ÃfX|X∗(0|0) fX|X∗(0|1)fX|X∗(1|0) fX|X∗(1|1)

!.

Notice that the matrix LX|X∗ contains the same information as the condi-

tional density fX|X∗ . Equation (2.1) becomes for all w, y

(2.2)

ÃfX,W,Y (0, w, y)fX,W,Y (1, w, y)

!= LX|X∗ ×

ÃfX∗,W,Y (0, w, y)fX∗,W,Y (1, w, y)

!.

Equation (2.2) implies that the density fX∗,W,Y would be identified provided

that LX|X∗ would be identifiable and invertible.

Equation (2.1) is equivalent to, for the subsamples of the married (W = 1)

and of the unmarried (W = 0)

(2.3) fX,Y |W=j(x, y) =X

x∗=0,1

fX|X∗ (x|x∗) fY |X∗,W=j(y|x∗)fX∗|W=j(x∗),

in which fX,Y |W=j(x, y) ≡ fX,Y |W (x, y|j) and j = 0, 1. By counting the

numbers of knowns and unknowns in equation (2.3), one can see that the



unknown densities fX|X∗ , fY |X∗,W=j and fX∗|W=j cannot be identified using

the primary sample alone.

In the auxiliary sample, we assume that the measurement error in Xa

satisfies the same conditional independence assumption as that in X, i.e.,

fXa|X∗a ,Wa,Ya = fXa|X∗a . Furthermore, we link the two samples by a stable

assumption that the effect of interest, i.e., the distribution of the employment

status conditional on the true education level and the marital status, is the

same in the two samples, i.e., fYa|X∗a ,Wa(y|x∗, w) = fY |X∗,W (y|x∗, w) for all

y, x∗, w. Therefore, we have for the subsamples of the married (Wa = 1) and

of the unmarried (Wa = 0):

(2.4)

fXa,Ya|Wa=j(x, y) =X

x∗=0,1

fXa|X∗a (x|x∗) fY |X∗,W=j(y|x∗)fX∗a |Wa=j(x

∗).

Since all the variables are 0-1 dichotomous and probabilities sum to one,

Equations (2.3) and (2.4) involve 12 distinct known probability values of

fX,Y |W=j and fXa,Ya|Wa=j , and 12 distinct unknown values of fX|X∗ , fY |X∗,W=j ,

fX∗|W=j , fXa|X∗a and fX∗a |Wa=j , which makes exact identification (unique so-

lution) of the 12 distinct unknown values possibly. However, equations (2.3)

and (2.4) are nonlinear in the unknown values, we need additional restric-

tions to ensure the existence of unique solution.

Denote Wj = {j} for j = 0, 1. Define the matrix representations of rele-vant densities for the subsamples of the married (W1) and of the unmarried

(W0) in the primary sample as follows: for j = 0, 1,

LX,Y |Wj=

ÃfX,Y |Wj

(0, 0) fX,Y |Wj(0, 1)

fX,Y |Wj(1, 0) fX,Y |Wj

(1, 1)

!,

LY |X∗,Wj=

ÃfY |X∗,Wj

(0|0) fY |X∗,Wj(0|1)

fY |X∗,Wj(1|0) fY |X∗,Wj

(1|1)

!T

,

LX∗|Wj=

ÃfX∗|Wj

(0) 0

0 fX∗|Wj(1)

!,

where the superscript T stands for the transpose of a matrix. Let Waj =

{j} for j = 0, 1. We similarly define the matrix representations LXa,Ya|Waj,


8 X. CHEN AND Y. HU

LXa|X∗a , and LX∗a |Wajof the corresponding densities fXa,Ya|Waj

, fXa|X∗a , and

fX∗a |Wajin the auxiliary sample. To simplify notation, in the following we

use Wj instead of Waj in the auxiliary sample.

Using the matrix notations, equation (2.3) becomes for j = 0, 1,

LX,Y |Wj

=

ÃfX,Y |Wj

(0, 0) fX,Y |Wj(0, 1)

fX,Y |Wj(1, 0) fX,Y |Wj

(1, 1)

!

=

ÃfX|X∗(0|0) fX|X∗(0|1)fX|X∗(1|0) fX|X∗(1|1)

!ÃfY,X∗|Wj

(0, 0) fY,X∗|Wj(1, 0)

fY,X∗|Wj(0, 1) fY,X∗|Wj

(1, 1)

!

= LX|X∗

ÃfX∗|Wj

(0) 0

0 fX∗|Wj(1)

!ÃfY |X∗,Wj

(0|0) fY |X∗,Wj(0|1)

fY |X∗,Wj(1|0) fY |X∗,Wj

(1|1)

!T

= LX|X∗LX∗|WjLY |X∗,Wj

that is

(2.5) LX,Y |Wj= LX|X∗LX∗,Y |Wj

= LX|X∗LX∗|WjLY |X∗,Wj

.

Similarly, equation (2.4) becomes

(2.6) LXa,Ya|Wj= LXa|X∗aLX∗a ,Ya|Wj

= LXa|X∗aLX∗a |WjLY |X∗,Wj

.

We assume that the observable matrices LX,Y |Wjand LXa,Ya|Wj

are in-

vertible, that the diagonal matrices LX∗|Wjand LX∗a |Wj

are invertible, and

that LXa|X∗a is invertible. Then equations (2.6) and (2.5) imply that LY |X∗,Wj

and LX|X∗ are invertible. We can then eliminate LY |X∗,Wj, to have for

j = 0, 1

LXa,Ya|WjL−1X,Y |Wj

= LXa|X∗aLX∗a |WjL−1X∗|Wj

L−1X|X∗ .



Since this equation holds for j = 0, 1, we may then eliminate LX|X∗ , to have

LXa,Xa

≡³LXa,Ya|W1

L−1X,Y |W1

´ ³LXa,Ya|W0

L−1X,Y |W0

´−1= LXa|X∗a

³LX∗a |W1

L−1X∗|W1LX∗|W0

L−1X∗a |W0

´L−1Xa|X∗a

=

ÃfXa|X∗a (0|0) fXa|X∗a (0|1)fXa|X∗a (1|0) fXa|X∗a (1|1)

!ÃkX∗a (0) 00 kX∗a (1)

!×(2.7)

×Ã

fXa|X∗a (0|0) fXa|X∗a (0|1)fXa|X∗a (1|0) fXa|X∗a (1|1)

!−1,

with

kX∗a (x∗) ≡

fX∗a |W1(x∗) fX∗|W0

(x∗)

fX∗|W1(x∗) fX∗a |W0

(x∗)for x∗ ∈ {0, 1}.

Notice that the matrix LXa,Xa on the left-hand side of the equation (2.7) can

be viewed as observed given the data. Equation (2.7) provides an eigenvalue-

eigenvector decomposition of LXa,Xa . If such a decomposition is unique, then

we may identify (or solve) LXa|X∗a , i.e., fXa|X∗a , from the observed matrix

LXa,Xa .

To ensure a unique eigenvalue-eigenvector decomposition of LXa,Xa , we

assume that the eigenvalues are distinctive; i.e., kX∗a (0) 6= kX∗a (1). This

assumption requires that the distributions of the latent education level of

the married or the unmarried in the primary sample are different from those

in the auxiliary sample, and that the distribution of the latent education

level of the married is different from that of the unmarried in one of the

two samples. Notice that each eigenvector is a column in LXa|X∗a , which is

a conditional density; hence each eigenvector is automatically normalized.

Therefore, for an observed LXa,Xa , we may have an eigenvalue-eigenvector


10 X. CHEN AND Y. HU

decomposition as follows:

LXa,Xa(2.8)

=

ÃfXa|X∗a (0|x

∗1) fXa|X∗a (0|x

∗2)

fXa|X∗a (1|x∗1) fXa|X∗a (1|x

∗2)

!ÃkX∗a (x

∗1) 0

0 kX∗a (x∗2)

!×

×Ã


∗2)


∗2)

!−1.

The value of each entry on the right-hand side of equation (2.8) can be

directly computed from the observed matrix LXa,Xa . The only ambiguity

left in equation (2.8) is the value of the indices x∗1 and x∗2, or the index-

ing of the eigenvalues and eigenvectors. In other words, the identification

of fXa|X∗a boils down to finding a 1-to-1 mapping between the two sets of

indices of the eigenvalues and eigenvectors: {x∗1, x∗2}⇐⇒ {0, 1}. For this, wemake a normalization assumption that people with (or without) true college

education in the auxiliary sample are more likely to report that they have

(or do not have) college education; i.e., fXa|X∗a (x∗|x∗) > 0.5 for x∗ = 0, 1.

(This assumption also implies the invertibility of LXa|X∗a .) Since the values of

fXa|X∗a (0|x∗1) and fXa|X∗a (1|x

∗1) are known in equation (2.8), this assumption

pins down the index x∗1 as follows:

x∗1 =

(0 if fXa|X∗a (0|x

∗1) > 0.5

1 if fXa|X∗a (1|x∗1) > 0.5

.

The value of x∗2 may be found in the same way. In summary, we have iden-

tified LXa|X∗a , i.e., fXa|X∗a , from the decomposition of the observed matrix

LXa,Xa .

After identifying LXa|X∗a , we can then identify LX∗a ,Ya|Wjor fX∗a ,Ya|Wj

from equation (2.6) as LX∗a ,Ya|Wj= L−1Xa|X∗a

LXa,Ya|Wj; hence the conditional

density fY |X∗,Wj= fYa|X∗a ,Wj

and the marginal density fX∗a |Wjare identified.

Moreover, we can then identify LX,X∗|Wjor fX,X∗|Wj

from equation (2.5)

as LX,X∗|Wj= LX,Y |Wj

L−1Y |X∗,Wj; hence the densities fX|X∗ and fX∗|Wj

are

identified.



This simple example with dichotomous variables demonstrates that we

can nonparametrically identify fY |X∗,W = fYa|X∗a ,Wa, fX|X∗ and fXa|X∗a using

the two samples in which both samples contain nonclassical measurement

errors. We next show that such a nonparametric identification strategy is in

fact generally applicable.

2.2. The continuous latent regressor case. We are interested in identify-

ing a latent (structural) probability model: fY |X∗,W (y|x∗, w), in which Y is

a continuous dependent variable, X∗ is an unobserved continuous regressor

subject to a possibly nonclassical measurement error, and W is an accu-

rately measured discrete covariate. For example, the discrete covariate W

may stand for subpopulations with different demographic characteristics,

such as marital status, race, gender, profession, and geographic location.

Suppose the supports of X,W, Y, and X∗ are X ⊆ R, W = {w1, w2, ..., wJ},Y ⊆ R, and X ∗ ⊆ R, respectively. We assume

Assumption 2.1. fX|X∗,W,Y (x|x∗, w, y) = fX|X∗(x|x∗) for all x ∈ X ,x∗ ∈ X ∗, w ∈W, and y ∈ Y.

Assumption 2.1 implies that the measurement error in X is independent

of all other variables in the model conditional on the true value X∗. The

measurement error in X may still be correlated with the true value X∗ in

an arbitrary way, and hence is nonclassical.

Assumption 2.2. (i) X∗a , Wa, and Ya have the same supports as X

∗,

W , and Y , respectively; (ii) fXa|X∗a ,Wa,Ya(x|x∗, w, y) = fXa|X∗a (x|x∗) for all

x ∈ X , x∗ ∈ X ∗, w ∈W, and y ∈ Y.

The next condition requires that the latent structural probability model

is the same in both samples, which is a reasonable stable assumption.

Assumption 2.3. fYa|X∗a ,Wa(y|x∗, w) = fY |X∗,W (y|x∗, w) for all x∗ ∈

X ∗, w ∈W, and y ∈ Y.



We note that, under assumption 2.3, the joint distributions of the true

value X∗ and covariate W in the primary sample may still be different from

those of X∗a and Wa in the auxiliary sample.

Let Lp (X ), 1 ≤ p <∞ denote the space of functions withRX |h(x)|pdx <

∞, and let L∞ (X ) be the space of functions with supx∈X |h(x)| <∞. Thenit is clear that for any fixed w ∈ W, y ∈ Y, fX,W,Y (·, w, y) ∈ Lp (X ) ,and fX∗,W,Y (·, w, y) ∈ Lp (X ∗) for all 1 ≤ p ≤ ∞. Let HX ⊆ L2 (X ) andHX∗ ⊆ L2 (X ∗) . Define the integral operator LX|X∗ : HX∗ → HX as:

{LX|X∗h} (x) =ZX∗

fX|X∗ (x|x∗)h (x∗) dx∗ for any h ∈ HX∗ , x ∈ X .

Denote Wj = {wj} for j = 1, ..., J and define the following operators for theprimary sample³

LX,Y |Wjh´(x) =

ZfX,Y |W (x, u|wj)h (u) du,³

LY |X∗,Wjh´(x∗) =

ZfY |X∗,Wj

(u|x∗)h (u) du,³LX∗|Wj

h´(x∗) = fX∗|Wj

(x∗)h (x∗) .

We also define the operators LXa|X∗a , LXa,Ya|Wj, LYa|X∗a ,Wj

, and LX∗a |Wjfor

the auxiliary sample in the same way as their counterparts for the primary

sample. Notice that operators LX∗|Wjand LX∗a |Wj

are diagonal operators.

Under the operators formulation, assumption 2.1 implies

LX,Y |Wj= LX|X∗LX∗|Wj

LY |X∗,Wj

in the primary sample; assumptions 2.2 and 2.3 imply

LXa,Ya|Wj= LXa|X∗aLX∗a |Wj

LY |X∗,Wj

in the auxiliary sample. We assume

Assumption 2.4. LXa|X∗a : HX∗a → HXa is injective, i.e., the set {h ∈HX∗a : LXa|X∗ah = 0} = {0}.



Assumption 2.4 is equivalent to assume that the linear operator LXa|X∗a is

invertible. Recall that the conditional expectation operator of X∗a given Xa,

EX∗a |Xa: L2 (X ∗)→ L2 (Xa), is defined as

{EX∗a |Xah0}(x) =

ZX∗

fX∗a |Xa(x∗|x)h0(x∗)dx∗

= E[h0 (X∗a) |Xa = x] for any h0 ∈ L2 (X ∗) , x ∈ Xa.

We have {LXa|X∗ah} (x) =RX∗ fXa|X∗a (x|x

∗)h (x∗) dx∗ = E

∙fXa(x)h(X

∗a)

fX∗a (X∗a)

|Xa = x

¸for any h ∈ HX∗a , x ∈ Xa. Assumption 2.4 is equivalent toE

∙h (X∗

a)fXa(Xa)fX∗a (X

∗a)|Xa

¸=

0 implies h = 0. If 0 < fX∗a (x∗) <∞ over int(X ∗) and 0 < fXa(x) <∞ over

int(Xa) (which are very minor restrictions), then assumption 2.4 is the sameas the identification condition imposed in Newey and Powell (2003), and Car-

rasco, Florens, and Renault (2006), among others. As these authors point

out, this condition is implied by the completeness of the conditional density

fX∗a |Xa, which is satisfied, for example, when fX∗a |Xa

belongs to an exponen-

tial family. Moreover, if we are willing to assume supx∗,w fX∗a ,Wa(x∗, w) ≤

c < ∞, then a sufficient condition for assumption 2.4 is the bounded com-pleteness of the conditional density fX∗a |Xa

; see, e.g., Blundell, Chen, and

Kristensen (2007) and Chernozhukov, Imbens, and Newey (2007). Distrib-

utions that are complete are automatically bounded complete. There are

much larger families of distributions that are bounded complete (and some

of them may not be complete). See, e.g., Lehmann (1986, page 173), Hoeffd-

ing (1977) and Mattner (1993) for many examples. When Xa and X∗a are

discrete, assumption 2.4 requires that the support of Xa is not smaller than

that of X∗a .

Assumption 2.5. (i) fX∗|Wj> 0 and fX∗a |Wj

> 0; (ii) LX,Y |Wjis injec-

tive; (iii) LX,Y |Wjis injective.

Assumptions 2.4 and 2.5 imply that LY |X∗,Wjand LX|X∗ are invertible.

In the Appendix we establish the diagonalization of an observed operator



LijXa,Xa

:

LijXa,Xa

= LXa|X∗aLijX∗a

L−1Xa|X∗afor all i, j,

where the operator LijX∗a≡³LX∗a |Wj

L−1X∗|WjLX∗|Wi

L−1X∗a |Wi

ís a diagonal op-

erator defined as:³LijX∗a

h´(x∗) = kijX∗a (x

∗)h (x∗) with

kijX∗a (x∗) ≡

fX∗a |Wj(x∗) fX∗|Wi

(x∗)

fX∗|Wj(x∗) fX∗a |Wi

(x∗).

In order to show the identification of fXa|X∗a and kijX∗a (x∗), we assume

Assumption 2.6. supx∗∈X∗ kijX∗a(x∗) <∞ for all i, j ∈ {1, 2, ..., J}.

Notice that the subsets W1,W2, ...,WJ ⊂ W do not need to be collectively

exhaustive. We may only consider those subsets in W in which these as-

sumptions are satisfied.

Assumption 2.7. For any x∗1 6= x∗2, there exist i, j ∈ {1, 2, ..., J}, suchthat kijX∗a (x

∗1) 6= kijX∗a (x

∗2).

Assumption 2.7 implies that, for any two different eigenfunctions fXa|X∗a (·|x∗1)

and fXa|X∗a (·|x∗2), one can always find two subsets Wj and Wi such that the

two different eigenfunctions correspond to two different eigenvalues kijX∗a (x∗1) and

kijX∗a (x∗2) and, therefore, are identified. Although there may exist duplicate

eigenvalues in each decomposition corresponding to a pair of i and j, this

assumption guarantees that each eigenfunction fXa|X∗a (·|x∗) is uniquely de-

termined by combining all the information from a series of decompositions

of LijXa,Xa

for i, j ∈ {1, 2, ..., J}.We now provide an example of the marginal distribution of X∗ to il-

lustrate that assumptions 2.6 and 2.7 are easily satisfied. Suppose that

the distribution of X∗ in the primary sample is the standard normal, i.e.,

fX∗|wj (x∗) = ψ (x∗) for j = 1, 2, 3, where ψ is the probability density func-

tion of the standard normal, and the distribution of X∗a in the auxiliary



sample is for 0 < σ < 1 and μ 6= 0

(2.9) fX∗a |wj (x∗) =

⎧⎪⎨⎪⎩ψ (x∗) for j = 1

σ−1ψ¡σ−1x∗

¢for j = 2

ψ (x∗ − μ) for j = 3.

It is obvious that assumption 2.6 is satisfied with

(2.10) kijX∗a (x∗) =

⎧⎨⎩ σ−1 exp³−1−σ−22 (x∗)2

´for i = 1, j = 2

ψ(x∗−μ)ψ(x∗) for i = 1, j = 3

.

For i = 1,j = 2, any two eigenvalues k12X∗a (x∗1) and k

12X∗a(x∗2) of L

12Xa,Xa

may be

the same if and only if x∗1 = −x∗2. In other words, we cannot distinguish theeigenfunctions fXa|X∗a (·|x

∗1) and fXa|X∗a (·|x

∗2) in the decomposition of L

12Xa,Xa

if and only if x∗1 = −x∗2. Since kijX∗a(x∗) for i = 1, j = 3 is not symmetric

around zero, the eigenvalues k13X∗a (x∗1) and k13X∗a (x

∗2) of L

13Xa,Xa

are different

for any x∗1 = −x∗2. Notice that the operators L12Xa,Xaand L13Xa,Xa

share the

same eigenfunctions fXa|X∗a (·|x∗1) and fXa|X∗a (·|x

∗2). Therefore, we may dis-

tinguish the eigenfunctions fXa|X∗a (·|x∗1) and fXa|X∗a (·|x

∗2) with x

∗1 = −x∗2 in

the decomposition of L13Xa,Xa. By combining the information obtained from

the decompositions of L12Xa,Xaand L13Xa,Xa

, we can distinguish the eigenfunc-

tions corresponding to any two different values of x∗.

Remark 2.1. (1) Assumption 2.7 does not hold if fX∗|W=wj (x∗) =

fX∗a |W=wj (x∗) for all wj and all x

∗ ∈ X ∗. This assumption requires that thetwo samples be from different populations. Given assumption 2.3 and the in-

vertibility of the operator LY |X∗,Wj, one could check assumption 2.7 from the

observed densities fY |W=wj and fYa|Wa=wj . In particular, if fY |W=wj (y) =

fYa|Wa=wj (y) for all wj and all y ∈ Y, then assumption 2.7 is not satis-fied. (2) Assumption 2.7 does not hold if fX∗|W=wj (x

∗) = fX∗|W=wi (x∗)

and fX∗a |Wa=wj (x∗) = fX∗a |Wa=wi (x

∗) for all wj 6= wi and all x∗ ∈ X ∗.

This means that the marginal distribution of X∗ or X∗a should be different

in the subsamples corresponding to different wj in at least one of the two

samples. For example, if X∗ or X∗a are earnings and wj corresponds to gen-

der, then assumption 2.7 requires that the earning distribution of males be



different from that of females in one of the samples (either the primary or

the auxiliary). Given the invertibility of the operators LX|X∗ and LXa|X∗a ,

one could check assumption 2.7 from the observed densities fX|W=wj and

fXa|Wa=wj . In particular, if fX|W=wj (x) = fX|W=wi (x) for all wj 6= wi,

and all x ∈ X , then assumption 2.7 requires the existence of an auxiliarysample such that fXa|Wa=wj (Xa) 6= fXa|Wa=wi (Xa) with positive probability

for some wj 6= wi.

In order to fully identify each eigenfunction, i.e., fXa|X∗a , we need to iden-

tify the exact value of x∗ in each eigenfunction fXa|X∗a (·|x∗). Notice that

the eigenfunction fXa|X∗a (·|x∗) is identified up to the value of x∗. In other

words, we have identified a probability density of Xa conditional on X∗a = x∗

with the value of x∗ unknown. An intuitive normalization assumption is

that the value of x∗ is the mean of this identified probability density, i.e.,

x∗ =RxfXa|X∗a (x|x

∗) dx; this assumption implies that the measurement

error in the auxiliary sample has zero mean conditional on the latent true

values. An alternative normalization assumption is that the value of x∗ is the

mode of this identified probability density, i.e., x∗ = argmaxx

fXa|X∗a (x|x∗);

this assumption implies that the error distribution conditional on the latent

true values has zero mode. The intuition behind this assumption is that

people are more willing to report some values close to the latent true val-

ues than they are to report those far from the truth. Another normalization

assumption may be that the value of x∗ is the median of the identified proba-

bility density, i.e., x∗ = infnz :R z−∞ fXa|X∗a (x|x

∗) dx ≥ 12

o; this assumption

implies that the error distribution conditional on the latent true values has

zero median, and that people have the same probability of overreporting as

that of underreporting. Obviously, the zero median condition can be gener-

alized to an assumption that the error distribution conditional on the latent

true values has a zero quantile.

Assumption 2.8. One of the followings holds for all x∗ ∈ X ∗: (i) (mean)


MEASUREMENT ERROR MODELS WITH TWO SAMPLES 17RxfXa|X∗a (x|x

∗) dx = x∗; or (ii) (mode) argmaxx

fXa|X∗a (x|x∗) = x∗; or (iii)

(quantile) there is an γ ∈ (0, 1) such that infnz :R z−∞ fXa|X∗a (x|x

∗) dx ≥ γo=

x∗.

Assumption 2.8 requires that the support of Xa not be smaller than that

of X∗a , and that, although the measurement error in the auxiliary sample

(Xa−X∗a) could be nonclassical, it needs to satisfy some location regularity

such as zero conditional mean, or zero conditional mode or zero conditional

median. Recall that, in the dichotomous case, assumption 2.8 with zero

conditional mode also implies the invertibility of LXa|X∗a (i.e., assumption

2.4). However, this is not true in the general discrete case. For the general

discrete case, a comparable sufficient condition for the invertibility of LXa|X∗ais strictly diagonal dominance (i.e., the diagonal entries of LXa|X∗a are all

larger than 0.5), but assumption 2.8 with zero mode only requires that

the diagonal entries of LXa|X∗a be the largest in each row, which cannot

guarantee the invertibility of LXa|X∗a when the support of X∗a contains more

than 2 values.

We obtain the following identification result.

Theorem 2.1. Suppose assumptions 2.1—2.8 hold. Then, the densities

fX,W,Y and fXa,Wa,Ya uniquely determine fY |X∗,W , fX|X∗, and fXa|X∗a .

Remark 2.2. (1) When there exist extra common covariates in the two

samples, we may consider more generally defined W and Wa, or relax as-

sumptions on the error distributions in the auxiliary sample. On the one

hand, this identification theorem still holds when we replace W and Wa with

a scalar measurable function of W and Wa, respectively. The identification

theorem is still valid. On the other hand, we may relax assumptions 2.1 and

2.2(ii) to allow the error distributions to be conditional on the true values

and the extra common covariates. (2) The identification theorem does not

require that the two samples be independent of each other.



3. Sieve Quasi Likelihood Estimation and Inference. Our iden-

tification result is very general and does not require the two samples to be

independent. However, for many applications, it is reasonable to assume that

there are two random samples {Xi,Wi, Yi}ni=1 and {Xaj ,Waj , Yaj}naj=1 thatare mutually independent.

As shown in Section 2, the densities fY |X∗,W , fX|X∗ , fX∗|W , fXa|X∗a , and

fX∗a |Waare nonparametrically identified under assumptions 2.1—2.8. Never-

theless, in empirical studies, we typically have either a semiparametric or a

parametric specification of the conditional density fY |X∗,W as the model of

interest. In this section, we treat the other densities fX|X∗ , fX∗|W , fXa|X∗a ,

and fX∗a |Waas unknown nuisance functions, but consider a parametrically

specified conditional density of Y given (X∗,W ):

{g(y|x∗, w; θ) : θ ∈ Θ}, Θ a compact subset of Rdθ , 1 ≤ dθ <∞.

Define

θ0 ≡ argmaxθ∈Θ

Z[log g(y|x∗, w; θ)]fY |X∗,W (y|x∗, w)dy.

The latent parametric model is correctly specified if g(y|x∗, w; θ0) = fY |X∗,W (y|x∗, w)for almost all y, x∗, w (and θ0 is called true parameter value); otherwise it is

misspecified (and θ0 is called pseudo-true parameter value); see, e.g., White

(1982).

In this section we provide a root-n consistent and asymptotically nor-

mally distributed sieve MLE of θ0, regardless of whether the latent para-

metric model g(y|x∗, w; θ) is correctly specified or not. When g(y|x∗, w; θ)is misspecified, the estimator is better called the “sieve quasi MLE” than

the “sieve MLE.” (In this paper we have used both terminologies since we

allow the latent model g(y|x∗, w; θ) to either correctly or incorrectly specifythe true latent conditional density fY |X∗,W .) Under the correct specifica-

tion of the latent model, we show that the sieve MLE of θ0 is automatically

semiparametrically efficient, and provide a simple consistent estimator of its

asymptotic variance. In addition, we provide a sieve likelihood ratio model



selection test of two non-nested parametric specifications of fY |X∗,W when

both could be misspecified.

3.1. Sieve likelihood estimation under possible misspecification. Let α0 ≡(θT0 , f01, f01a, f02, f02a)

T ≡ (θT0 , fX|X∗ , fXa|X∗a , fX∗|W , fX∗a |Wa)T denote the

true parameter value, in which θ0 is really “pseudo-true” when the paramet-

ric model g(y|x∗, w; θ) is incorrectly specified for the unknown true densityfY |X∗,W . We introduce a dummy random variable S, with S = 1 indicating

the primary sample and S = 0 indicating the auxiliary sample. Then we

have a big combined samplenZTt ≡ (StXt, StWt, StYt, St, (1− St)Xt, (1− St)Wt, (1− St)Yt)

on+nat=1

such that {Xt,Wt, Yt, St = 1}nt=1 is the primary sample and {Xt,Wt, Yt, St =

0}n+nat=n+1 is the auxiliary sample. Denote p ≡ Pr(St = 1) ∈ (0, 1). Then theobserved joint likelihood for α0 is

n+naYt=1

[p× f(Xt,Wt, Yt|St = 1;α0)]St [(1− p)× f(Xt,Wt, Yt|St = 0;α0)]1−St ,

in which

f(X,W,Y |S = 1;α0) = fW (W )

Zf01(X|x∗)g(Y |x∗,W ; θ0)f02(x

∗|W )dx∗,

f(X,W, Y |S = 0;α0) = fWa(W )

Zf01a(X|x∗)g(Y |x∗,W ; θ0)f02a(x

∗|W )dx∗.

Before we present a sieve (quasi-) MLE estimator bα for α0, we need to im-pose some mild smoothness restrictions on the unknown densities. The sieve

method allows for unknown functions belonging to many different function

spaces such as Sobolev space, Besov space, and others; see, e.g., Shen and

Wong (1994) and Van de Geer (1993, 2000). But for the sake of concreteness

and simplicity, we consider the widely used Holder space of functions. Let

ξ = (ξ1, ξ2)T ∈ R2, a = (a1, a2)

T , and ∇ah(ξ) ≡ ∂a1+a2h(ξ1,ξ2)

∂ξa11 ∂ξ

a22

denote the

(a1+a2)-th derivative. Let k·kE denote the Euclidean norm. Let V ⊆ R2 andγ be the largest integer satisfying γ > γ. The Holder space Λγ(V) of order



γ > 0 is a space of functions h : V 7→ R, such that the first γ derivativesare continuous and bounded, and the γ-th derivative is Holder continuous

with the exponent γ−γ ∈ (0, 1]. The Holder space Λγ(V) becomes a Banachspace under the Holder norm:

khkΛγ = maxa1+a2≤γ

supξ|∇ah(ξ)|+ max

a1+a2=γsupξ 6=ξ0

¯∇ah(ξ)−∇ah(ξ0)

¯¡°°ξ − ξ0°°E

¢γ−γ <∞.

We define a Holder ball as Λγc (V) ≡ {h ∈ Λγ(V) : khkΛγ ≤ c <∞}. Denote

F1 =½f1(·|·) ∈ Λγ1c (X ×X ∗) : f1(·|x∗) > 0,

ZXf1(x|x∗)dx = 1 for all x∗ ∈ X ∗

¾,

F1a =(

f1a(·|·) ∈ Λγ1ac (Xa ×X ∗) : assumptions 2.4, 2.8 hold,f1a(·|x∗) > 0,

RXa f1a(x|x

∗)dx = 1 for all x∗ ∈ X ∗

),

F2 =(

f2 (·|w) ∈ Λγ2c (X ∗) : assumptions 2.6, 2.7 hold,f2 (·|w) > 0,

RX∗ f2 (x

∗|w) dx∗ = 1 for all w ∈W

),

We impose the following smoothness restrictions on the densities:

Assumption 3.1. (i) All the assumptions in theorem 2.1 hold; (ii) fX|X∗(·|·) ∈F1 with γ1 > 1; (iii) fXa|X∗a (·|·) ∈ F1a with γ1a > 1; (iv) fX∗|W (·|w) , fX∗a |Wa

(·|w) ∈F2 with γ2 > 1/2 for all w ∈W.

Denote A = Θ×F1×F1a×F2×F2 and α =³θT , f1, f1a, f2, f2a

´T. Then

the log-joint likelihood for α ∈ A is given by:n+naXt=1

(St ln [p× f(Xt,Wt, Yt|St = 1;α)]

+(1− St) ln [(1− p)× f(Xt,Wt, Yt|St = 0;α)]

)

= n ln p+ na ln(1− p) +n+naXt=1

(Zt;α),

in which

(Zt;α) ≡ St p(Zt; θ, f1, f2) + (1− St) a(Zt; f1a, f2a),

p(Zt; θ, f1, f2) = ln

Zf1(Xt|x∗)g(Yt|x∗,Wt; θ)f2(x

∗|Wt)dx∗ + ln fW (Wt),

a(Zt; f1a, f2a) = ln

Zf1a(Xt|x∗a)g(Yt|x∗a,Wt; θ)f2a(x

∗a|Wt)dx

∗a + ln fWa(Wt).



Let E[·] denote the expectation with respect to the underlying true datagenerating process for Zt. To stress that our combined data set consists of

two samples, sometimes we let Zpi = (Xi,Wi, Yi)T denote i− th observation

in the primary data set, and Zaj = (Xaj ,Waj , Yaj)T denote j−th observation

in the auxiliary data set. Then

α0 = arg supα∈A

E [ (Zt;α)]

= arg supα∈A

[pE{ p(Zpi; θ, f1, f2)}+ (1− p)E{ a(Zaj ; f1a, f2a)}] .

Let An = Θ × Fn1 × Fn

1a × Fn2 × Fn

2 be a sieve space for A, which is asequence of approximating spaces that are dense in A under some pseudo-

metric. The two-sample sieve quasi- MLE bαn = µbθT , bf1, bf1a, bf2, bf2a¶T ∈ An

for α0 ∈ A is defined as:

bαn = argmaxα∈An

n+naXt=1

(Zt;α)

= argmaxα∈An

⎡⎣ nXi=1

p(Zpi; θ, f1, f2) +naXj=1

a(Zaj ; f1a, f2a)

⎤⎦ .We could apply infinite-dimensional approximating spaces as sieves Fn

j

for Fj , j = 1, 1a, 2. However, in applications we shall use finite-dimensional

sieve spaces since they are easier to implement. For j = 1, 1a, 2, let pkj,nj (·)

be a kj,n× 1−vector of known basis functions, such as power series, splines,Fourier series, wavelets, Hermite polynomials, etc. Then we denote the sieve

space for F1, F1a, and F2 as follows:

Fn1 =

nf1(x|x∗) = p

k1,n1 (x, x∗)Tβ1 ∈ F1

o,

Fn1a =

nf1a(xa|x∗a) = p

k1a,n1a (xa, x

∗a)Tβ1a ∈ F1a

o,

Fn2 =

⎧⎨⎩f2 (x∗|w) =JXj=1

1 (w = wj) pk2,n2 (x∗)Tβ2j ∈ F2

⎫⎬⎭ ,



3.1.1. Consistency. Define a norm on A as: kαks = kθkE + kf1k∞,ω1+

kf1ak∞,ω1a+kf2k∞,ω2

+kf2ak∞,ω2in which khk∞,ωj

≡ supξ |h(ξ)ωj (ξ)| with

ωj (ξ) =³1 + kξk2E

´−ςj/2, ςj > 0 for j = 1, 1a, 2. We assume each of X , Xa,

X ∗ is R, and

Assumption 3.2. (i) {Xi,Wi, Yi}ni=1 and {Xaj ,Waj , Yaj}naj=1 are i.i.dand independent of each other. In addition, limn→∞

nn+na

= p ∈ (0, 1); (ii)g(y|x∗, w; θ) is continuous in θ ∈ Θ, and Θ is a compact subset of Rdθ ; and

(iii) θ0 ∈ Θ is the unique maximizer ofR[log g(y|x∗, w; θ)]fY |X∗,W (y|x∗, w)dy

over θ ∈ Θ.

Assumption 3.3. (i) −∞ < E [ (Zt;α0)] < ∞, E [ (Zt;α)] is upper

semicontinuous on A under the metric k·ks; (ii) there is a finite κ > 0 and a

random variable U(Zt) with E{U(Zt)} <∞ such that supα∈An:kα−α0ks≤δ | (Zt;α)−(Zt;α0)| ≤ δκU(Zt).

Assumption 3.4. (i) pk2,n2 (·) is a k2,n×1−vector of spline wavelet basis

functions on R, and for j = 1, 1a, pkj,nj (·, ·) is a kj,n × 1−vector of tensor

product of spline wavelet basis functions on R2; (ii) kn ≡ max{k1,n, k1a,n, k2,n}→∞ and kn/n→ 0.

Assumption 3.2(i) is a typical condition used in cross-sectional analyses

with two samples. Assumption 3.2(ii—iii) are typical conditions for paramet-

ric (quasi-) MLE of θ0 if X∗ could be observed without error. Assumption

3.3(ii) requires that the log density be Holder continuous under the metric

k·ks over the sieve space. The following consistency lemma is a direct appli-cation of lemma A.1 of Newey and Powell (2003) or theorem 3.1 (or remark

3.1(4), remark 3.3) of Chen (2006); hence, we omit its proof.

Lemma 3.1. Let bαn be the two-sample sieve MLE. Under assumptions3.1—3.4, we have kbαn − α0ks = op(1).



3.1.2. Convergence rate under a weaker metric. Given Lemma 3.1, we

can now restrict our attention to a shrinking || · ||s−neighborhood aroundα0. Let A0s ≡ {α ∈ A : ||α − α0||s = o(1), ||α||s ≤ c0 < c} and A0sn ≡{α ∈ An : ||α − α0||s = o(1), ||α||s ≤ c0 < c}. Then, for the purpose ofestablishing a convergence rate under a pseudo metric that is weaker than

|| · ||s, we can treat A0s as the new parameter space and A0sn as its sievespace, and assume that both A0s and A0sn are convex parameter spaces.For any α1, α2 ∈ A0s, we consider a continuous path {α (τ) : τ ∈ [0, 1]} inA0s such that α (0) = α1 and α (1) = α2. For simplicity we assume that for

any α, α + v ∈ A0s, {α + τv : τ ∈ [0, 1]} is a continuous path in A0s, andthat (Zt;α+ τv) is twice continuously differentiable at τ = 0 for almost all

Zt and any direction v ∈ A0s. We define the pathwise first derivative asd (Zt;α)

dα[v] ≡ d (Zt;α+ τv)

dτ|τ=0 a.s. Zt,

and the pathwise second derivative as

d2 (Zt;α)

dαdαT[v, v] ≡ d2 (Zt;α+ τv)

dτ2|τ=0 a.s. Zt.

Following Ai and Chen (2007), for any α1, α2 ∈ A0s, we define a pseudometric || · ||2 as follows:

kα1 − α2k2 ≡s−E

µd2 (Zt;α0)

dαdαT[α1 − α2, α1 − α2]

¶.

We show that bαn converges to α0 at a rate faster than n−1/4 under the

pseudo metric k·k2 and the following assumptions:

Assumption 3.5. (i) ςj > γj for j = 1, 1a, 2; (ii) k−γn = o([n+na]

−1/4)

with γ ≡ min{γ1/2, γ1a/2, γ2} > 1/2.

Assumption 3.6. (i) A0s is convex at α0 and θ0 ∈ int (Θ); (ii) (Zt;α)

is twice continuously pathwise differentiable with respect to α ∈ A0s, andlog g(y|x∗, w; θ) is twice continuously differentiable at θ0.



Assumption 3.7. supeα∈A0s supα∈A0sn ¯d (Zt;eα)dα

hα−α0

kα−α0ks

i¯≤ U(Zt) for a

random variable U(Zt) with E{[U(Zt)]2} <∞.

Assumption 3.8. (i) supv∈A0s:||v||s=1−E³d2 (Zt;α0)dαdαT

[v, v]´≤ C < ∞;

(ii) uniformly over eα ∈ A0s and α ∈ A0sn, we have

−EÃd2 (Zt; eα)dαdαT

[α− α0, α− α0]

!= kα− α0k22 × {1 + o(1)}.

Assumption 3.5 guarantees that the sieve approximation error under the

strong norm || · ||s goes to zero faster than [n + na]−1/4. Assumption 3.6

makes sure that the twice pathwise derivatives are well defined with respect

to α ∈ A0s; hence, the pseudo metric kα− α0k2 is well defined on A0s.Assumption 3.7 imposes an envelope condition. Assumption 3.8(i) implies

that kα− α0k2 ≤√C kα− α0ks for all α ∈ A0s. Assumption 3.8(ii) implies

that there are positive finite constants C1 and C2, such that for all α ∈ A0sn,C1 kα− α0k22 ≤ E[ (Zt;α0)− (Zt;α)] ≤ C2 kα− α0k22; that is, kα− α0k22 isequivalent to the Kullback-Leibler discrepancy on the local sieve space A0sn.The following convergence rate theorem is a direct application of theorem

3.2 of Shen and Wong (2004) or theorem 3.2 of Chen (2006) to the local

parameter space A0s and the local sieve space A0sn; hence, we omit itsproof.

Theorem 3.1. Under assumptions 3.1—3.8, if kn = O³[n+ na]

12γ+1

´,

then

kbαn − α0k2 = OP

Ãmax

(k−γn ,

skn

n+ na

)!= OP

µ[n+ na]

−γ2γ+1

¶.

3.1.3. Asymptotic normality under possible misspecification. We can de-

rive the asymptotic distribution of the sieve quasi MLE bθn regardless ofwhether the latent parametric model g(y|x∗, w; θ0) is correctly specified ornot. First, we define an inner product corresponding to the pseudo metric

k·k2:

hv1, v2i2 ≡ −E(d2 (Zt;α0)

dαdαT[v1, v2]

).



LetV denote the closure of the linear span ofA− {α0} under the metric k·k2.Then

³V, k·k2

ís a Hilbert space and we can represent V = Rdθ × U with

U ≡ F1 ×F1a ×F2 ×F2 − {(f01, f01a, f02, f02a)}. Let h = (f1, f1a, f2, f2a)

denote all the unknown densities. Then the pathwise first derivative can be

written as

d (Zt;α0)

dα[α− α0] =

d (Zt;α0)

dθT(θ − θ0) +

d (Z;α0)

dh[h− h0]

=

µd (Zt;α0)

dθT− d (Z;α0)

dh[μ]

¶(θ − θ0),

with h− h0 ≡ −μ× (θ − θ0), and in which

d (Z;α0)

dh[h− h0] =

d (Z; θ0, h0(1− τ) + τh)

dτ|τ=0

=d (Zt;α0)

df1[f1 − f01] +

d (Zt;α0)

df1a[f1a − f01a]

+d (Zt;α0)

df2[f2 − f02] +

d (Zt;α0)

df2a[f2a − f02a] .

Note that

E

Ãd2 (Zt;α0)

dαdαT[α− α0, α− α0]

!

= (θ − θ0)TE

Ãd2 (Zt;α0)

dθdθT− 2d

2 (Z;α0)

dθdhT[μ] +

d2 (Z;α0)

dhdhT[μ, μ]

!(θ − θ0),

with h− h0 ≡ −μ× (θ − θ0), and in which

d2 (Z;α0)

dθdhT[h− h0] =

d(∂ (Z; θ0, h0(1− τ) + τh)/∂θ)

dτ|τ=0,

d2 (Z;α0)

dhdhT[h− h0, h− h0] =

d2 (Z; θ0, h0(1− τ) + τh)

dτ2|τ=0.

For each component θk (of θ), k = 1, ..., dθ, suppose there exists a μ∗k ∈ U

that solves:

μ∗k : infμk∈U

E

(−Ã∂2 (Z;α0)

∂θk∂θk− 2d

2 (Z;α0)

∂θkdhT[μk] +

d2 (Z;α0)

dhdhT[μk, μk]

!).



Denote μ∗ =³μ∗1, μ∗2, ..., μ∗dθ

´with each μ∗k ∈ U , and

d (Z;α0)

dh[μ∗] =

µd (Z;α0)

dh

hμ∗1

i, ...,

d (Z;α0)

dh

hμ∗dθ

i¶,

d2 (Z;α0)

∂θdhT[μ∗] =

Ãd2 (Z;α0)

∂θdh[μ∗1], ...,

d2 (Z;α0)

∂θdh[μ∗dθ ]

!,

d2 (Z;α0)

dhdhT[μ∗, μ∗] =

⎛⎜⎝d2 (Z;α0)dhdhT

[μ∗1, μ∗1] · · · d2 (Z;α0)dhdhT

[μ∗1, μ∗dθ ]· · · · · · · · ·

d2 (Z;α0)dhdhT

[μ∗dθ , μ∗1] · · · d2 (Z;α0)dhdhT

[μ∗dθ , μ∗dθ ]

⎞⎟⎠ .

Also denote

V∗ ≡ −EÃ∂2 (Z;α0)

∂θ∂θT− 2d

2 (Z;α0)

∂θdhT[μ∗] +

d2 (Z;α0)

dhdhT[μ∗, μ∗]

!.

Now we consider a linear functional of α, which is λT θ for any λ ∈ Rdθ

with λ 6= 0. Since

supα−α0 6=0

|λT (θ − θ0) |2||α− α0||22

= supθ 6=θ0,μ6=0

(θ − θ0)TλλT (θ − θ0)

(θ − θ0)TEn−³d2 (Zt;α0)

dθdθT− 2d2 (Z;α0)

dθdhT[μ] + d2 (Z;α0)

dhdhT[μ, μ]

ó(θ − θ0)

= λT (V∗)−1λ,

the functional λT (θ − θ0) is bounded if and only if the matrix V∗ is nonsin-

gular.

Suppose that V∗ is nonsingular. For any fixed λ 6= 0, denote υ∗ ≡ (v∗θ , v∗h)with v∗θ ≡ (V∗)−1λ and v∗h ≡ −μ∗ × v∗θ . Then the Riesz representation theo-

rem implies: λT (θ − θ0) = hυ∗, α− α0i2 for all α ∈ A. In the appendix, weshow that

λT³bθn − θ0

´= hυ∗, bαn − α0i2

=1

n+ na

n+naXt=1

d (Zt;α0)

dα[υ∗] + op

µ1√

n+ na

¶.



Denote N0 = {α ∈ A0s : kα− α0k2 = o([n + na]−1/4)} and N0n = {α ∈

A0sn : kα− α0k2 = o([n + na]−1/4)}. We impose the following additional

conditions for asymptotic normality of sieve quasi MLE bθn:Assumption 3.9. μ∗ exists (i.e., μ∗k ∈ U for k = 1, ..., dθ), and V∗ is

positive-definite.

Assumption 3.10. There is a υ∗n ∈ An− {α0}, such that ||υ∗n − υ∗||2 =o(1) and kυ∗n − υ∗k2 × kbαn − α0k2 = oP (

1√n+na

).

Assumption 3.11. There is a random variable U(Zt) with E{[U(Zt)]2} <

∞ and a non-negative measurable function η with limδ→0 η(δ) = 0, such that,

for all α ∈ N0n,

supα∈N0

¯¯d2 (Zt;α)

dαdαT[α− α0, υ

∗n]

¯¯ ≤ U(Zt)× η(||α− α0||s).

Assumption 3.12. Uniformly over α ∈ N0 and α ∈ N0n,

E

Ãd2 (Zt;α)


∗n]−

d2 (Zt;α0)


∗n]

!= o

µ1√

n+ na

¶.

Assumption 3.13. E

½³d (Zt;α0)

dα [υ∗n − υ∗]´2¾

goes to zero as kυ∗n − υ∗k2goes to zero.

Assumption 3.9 is critical for obtaining the√n convergence of sieve quasi

MLE bθn to θ0 and its asymptotic normality. We notice that it is possi-

ble that θ0 is uniquely identified but assumption 3.9 is not satisfied. If

this happens, θ0 can still be consistently estimated, but the best achiev-

able convergence rate is slower than the√n−rate. Assumption 3.10 implies

that the asymptotic bias of the Riesz representer is negligible. Assumptions

3.11 and 3.12 control the remainder term. Assumption 3.13 is automati-

cally satisfied when the latent parametric model is correctly specified, since

E

½³d (Zt;α0)

dα [υ∗n − υ∗]´2¾

= kυ∗n − υ∗k22 under correct specification. Denote

Sθ0 ≡d (Zt;α0)

dθT− d (Zt;α0)

dh[μ∗] and I∗ ≡ E

hSTθ0Sθ0

i.



The following asymptotic normality result applies to possibly misspecified

models.

Theorem 3.2. Under assumptions 3.1—3.13, we have√n+ na

³bθn − θ0´

d→N¡0, V −1∗ I∗V −1∗

¢.

3.1.4. Semiparametric efficiency under correct specification. In this sub-

section we assume that g(y|x∗, w; θ0) correctly specifies the true unknownconditional density fY |X∗,W (y|x∗, w). We can then establish the semipara-metric efficiency of the two-sample sieve MLE bθn for the parameter of inter-est θ0. First we recall the Fisher metric k·k on A: for any α1, α2 ∈ A,

kα1 − α2k2 ≡ E

(µd (Zt;α0)

dα[α1 − α2]

¶2)and the Fisher norm—induced inner product:

hv1, v2i ≡ E

½µd (Zt;α0)

dα[v1]

¶µd (Zt;α0)

dα[v2]

¶¾.

Under correct specification, g(y|x∗, w; θ0) = fY |X∗,W (y|x∗, w), it can beshown that kvk = kvk2 and hv1, v2i = hv1, v2i2. Thus, the spaceV is also the

closure of the linear span of A− {α0} under the Fisher metric k·k. For eachparametric component θk of θ, k = 1, 2, ..., dθ, an alternative way to obtain

μ∗ =³μ∗1, μ∗2, ..., μ∗dθ

ís to compute μ∗k ≡

³μ∗k1 , μ∗k1a, μ

∗k2 , μ∗k2a

´T∈ U as

the solution to

infμk∈U

E

(µd (Zt;α0)

dθk− d (Zt;α0)

dh

hμki¶2)

= inf(μ1,μ1a,μ2,μ2a)

T∈UE

⎧⎪⎨⎪⎩⎛⎝ d (Zt;α0)

dθk− d (Zt;α0)

df1[μ1]−

d (Zt;α0)df1a

[μ1a]

−d (Zt;α0)df2

[μ2]−d (Zt;α0)

df2a[μ2a]

⎞⎠2⎫⎪⎬⎪⎭ .

Then

Sθ0 ≡d (Zt;α0)

dθT− d (Zt;α0)

dh[μ∗]

becomes the semiparametric efficient score for θ0, and under correct spec-

ification, I∗ ≡ EhSTθ0Sθ0

i= V∗, which is the semiparametric information

bound for θ0.



Given the expression of the density function, the pathwise first derivative

at α0 can be written as

d (Zt;α0)

dα[α− α0]

= Std p(Zt; θ0, f01, f02)

dα[α− α0] + (1− St)

d a(Zt; f01a, f02a)

dα[α− α0] .

See Appendix for the expressions of d p(Zt;θ0,f01,f02)dα [α− α0] and

d a(Zt;f01a,f02a)dα [α− α0].

Thus

I∗ ≡ EhSTθ0Sθ0

i= pI∗p + (1− p)I∗a

with

I∗p = E

⎡⎣ ³d p(Zt;θ0,f01,f02)

dθT−P2

j=1d p(Zt;θ0,f01,f02)

dfj

hμ∗ji´T³

d p(Zt;θ0,f01,f02)

dθT−P2

j=1d p(Zt;θ0,f01,f02)

dfj

hμ∗ji´

⎤⎦ ,

I∗a = E

⎡⎣ ³P2j=1

d a(Zt;f01a,f02a)dfja

hμ∗ja

i´T³P2j=1

d a(Zt;f01a,f02a)dfja

hμ∗ja

i´⎤⎦ .

Therefore, the influence function representation of our two-sample sieve

MLE is:

λT³bθn − θ0

´=

1

n+ na

⎧⎨⎩nXi=1

d p(Zpi; θ0, f01, f02)

dα[υ∗] +

naXj=1

d a(Zaj ; f01a, f02a)

dα[υ∗]

⎫⎬⎭+op

µ1√

n+ na

¶,

and the asymptotic distribution of√n+ na

³bθn − θ0ís N

¡0, I−1∗

¢. Com-

bining our theorem 3.2 and theorem 4 of Shen (1997), we immediately obtain

Theorem 3.3. Suppose that g(y|x∗, w; θ0) = fY |X∗,W (y|x∗, w) for al-most all y, x∗, w, that I∗ is positive definite, and that assumptions 3.1—3.12

hold. Then the two-sample sieve MLE bθn is semiparametrically efficient, and√n³bθn − θ0

´d→ N

³0, [I∗p +

1−pp I∗a]−1

´= N

¡0, pI−1∗

¢.



Following Ai and Chen (2003), the asymptotic efficient variance, I−1∗ , of

the sieve MLE bθn (under correct specification) can be consistently estimatedby bI−1∗ , with

bI∗ = 1

n+ na

n+naXt=1

µd (Zt; bα)

dθT− d (Zt; bα)

dh[bμ∗]¶T µd (Zt; bα)

dθT− d (Zt; bα)

dh[bμ∗]¶ ,

in which bμ∗ = ³bμ∗1, bμ∗2, ..., bμ∗dθ´ and bμ∗k ≡ ³bμ∗k1 , bμ∗k1a, bμ∗k2 , bμ∗k2a´T solves thefollowing sieve minimization problem: for k = 1, 2, ..., dθ,

minμk∈Fn

n+naXt=1

⎛⎝ d (Zt;bα)dθk

− d (Zt;bα)df1

hμk1

i− d (Zt;bα)

df1a

hμk1a

i−d (Zt;bα)

df2

hμk2

i− d (Zt;bα)

df2a

hμk2a

i⎞⎠2 ,

in which Fn ≡ Fn1 ×Fn

1a ×Fn2 ×Fn

2 . Denote

d (Zt; bα)dh

hbμ∗ki ≡ d (Zt; bα)df1

hbμ∗k1 i+ d (Zt; bα)df1a

hbμ∗k1ai+d (Zt; bα)

df2

hbμ∗k2 i+ d (Zt; bα)df2a

hbμ∗k2ai ,and

d (Zt; bα)dh

[bμ∗] = µd (Zt; bα)

dh

hbμ∗1i , ..., d (Zt; bα)dh

hbμ∗dθi¶ .3.2. Sieve likelihood ratio model selection test. In many empirical appli-

cations, researchers often estimate different parametrically specified struc-

ture models in order to select one that fits the data the “best”. We shall

consider two non-nested, possibly misspecified, parametric latent structure

models: {g1(y|x∗, w; θ1) : θ2 ∈ Θ1} and {g2(y|x∗, w; θ2) : θ2 ∈ Θ2}. If X∗

were observed without error in the primary sample, researchers could ap-

ply Vuong’s (1989) likelihood ratio test to select a “best” parametric model

that is closest to the true underlying conditional density fY |X∗,W (y|x∗, w)according to the KLIC. In this subsection, we shall extend Vuong’s result to

the case in which X∗ is not observed in either sample.

Consider two parametric families of models {gj(y|x∗, w; θj) : θj ∈ Θj},Θj a compact subset of Rdθj , j = 1, 2 for the latent true conditional density



fY |X∗,W . Define

θ0j ≡ arg maxθj∈Θj

Z[log gj(y|x∗, w; θj)]fY |X∗,W (y|x∗, w)dy.

According to Vuong (1989), the two models are nested if g1(y|x∗, w; θ01) =g2(y|x∗, w; θ02) for almost all y ∈ Y, x∗ ∈ X ∗, w ∈ W; the two models arenon-nested if g1(Y |X∗,W ; θ01) 6= g2(Y |X∗,W ; θ02) with positive probabil-

ity.

For j = 1, 2, denote α0j = (θT0j , f01, f01a, f02, f02a)T ∈ Aj with Aj =

Θj×F1×F1a×F2×F2, and let j(Zt;α0j) denote the log-likelihood according

to model j evaluated at data Zt. Following Vuong (1989), we select model

1 if H0 holds, in which

H0 : E { 2(Zt;α02)− 1(Zt;α01)} ≤ 0,

and we select model 2 if H1 holds, in which

H1 : E { 2(Zt;α02)− 1(Zt;α01)} > 0.

For j = 1, 2, denote Aj,n = Θj×Fn1 ×Fn

1a×Fn2 ×Fn

2 and define the sieve

quasi MLE for α0j ∈ Aj as

bαj = argmaxαj∈Aj,n

n+naXt=1

j(Zt;αj)

= argmaxαj∈Aj,n

"nXt=1

j,p(Zpt; θj , f1, f2) +naXt=1

j,a(Zat; f1a, f2a)

#.

In the following, we denote σ2 ≡ V ar ( 2(Zt;α02)− 1(Zt;α01)) and

σ2 =1

n+ na

n+naXt=1

"{ 2(Zt; bα2)− 1(Zt; bα1)}

− 1n+na

Pn+nas=1 { 2(Zs; bα2)− 1(Zs; bα1)}

#2.

Theorem 3.4. Suppose both models 1 and 2 satisfy assumptions 3.1—3.8,



and σ2 <∞. Then

1√n+ na

n+naXt=1

Ã{ 2(Zt; bα2)− 1(Zt; bα1)}

−E{ 2(Zt;α02)− 1(Zt;α01)}

!

=1√

n+ na

n+naXt=1

Ã{ 2(Zt;α02)− 1(Zt;α01)}−E{ 2(Zt;α02)− 1(Zt;α01)}

!+ oP (1)

d→ N³0, σ2

´.

Suppose models 1 and 2 are non-nested, then

1

σ√n+ na

n+naXt=1

Ã{ 2(Zt; bα2)− 1(Zt; bα1)}

−E{ 2(Zt;α02)− 1(Zt;α01)}

!d→ N (0, 1) .

Thus under the least favorable null hypothesis of E { 2(Zt;α02)− 1(Zt;α01)} =0, we have 1

σ√n+na

Pn+nat=1 { 2(Zt; bα2)− 1(Zt; bα1)} d→ N (0, 1), which can be

used to provide a sieve likelihood ratio model selection test of H0 against

H1.

4. Simulation. In this section we present a simulation study to illus-

trate the finite sample performance of the two-sample sieve MLE. The true

latent probability density model fY |X∗,W is:

fY |X∗,W (y|x∗, w; θ) = φ (y −m (x∗, w; θ)) ,

where φ (·) is the pdf of the standard normal distribution and

m (X∗,W ; θ) = β1X∗ + β2X

∗W + β3

³X∗2W +X∗W 2

´/2,

in which θ = (β1, β2, β3)T is unknown and W ∈ {−1, 0, 1}. We have two

independent random samples: {Xi,Wi, Yi}ni=1 and {Xaj ,Waj , Yaj}naj=1, withn = 1500 and na = 1000. In the primary sample, we let θ0 = (1, 1, 1)T ,

X∗|W ∼ N(0, 1), and Pr(W = 1) = Pr(W = 0) = 1/3. The mismeasured

value X equals

X = 0.1X∗ + e−0.1X∗ε with ε ∼ N(0, 0.36).



In the auxiliary sample we generateWa in the same way that we generateW

in the primary sample. We set the unknown true conditional density fX∗a |Wa

as follows:

fX∗a |Wa(x∗a|wa) =

⎧⎪⎨⎪⎩ψ (x∗a) for wa = −1

0.25ψ (0.25x∗a) for wa = 0ψ (x∗a − 0.5) for wa = 1

.

The mismeasured value Xa equals

Xa = X∗a + 0.5e

−X∗aν, with ν ∼ N(0, 1),

which implies that x∗a is the mode of the conditional density fXa|X∗a (·|x∗a).

We use the simple sieve expression pk1,n1 (x1, x2)

Tβ1 =PJn

j=0

PKnk=0 γjkpj (x1 − x2) qk (x2)

to approximate the conditional densities fX|X∗(x1|x2) and fXa|X∗a (x1|x2),with k1,n = (Jn + 1)(Kn + 1); and p

k2,n2 (x∗)Tβ2(w) =

Pk2,nk=1 γk(w)qk (x

∗)

to approximate the conditional densities fX∗|Wj=w, fX∗a |Wj=w with Wj =

−1, 0, 1. The bases {pj(·)} and {qk(·)} are Hermite polynomials bases.The simulation results shown in Table 1 include three estimators. The first

estimator is the standard probit MLE using the primary sample {Xi,Wi, Yi}ni=1alone as if it were accurate; this estimator is inconsistent and its bias should

dominate the squared root of mean square error (root MSE). The second

estimator is the standard probit MLE using accurate data {Yi,X∗i ,Wi}ni=1.

This estimator is consistent and most efficient; however, we call it “infeasi-

ble MLE” since X∗i is not observed in practice. The third estimator is the

two-sample sieve MLE developed in this paper. In the last column, we also

report the square root of the sum of the three mean square errors of β1, β2,

and β3. The simulation repetition times is 400. The simulation results show

that the 2-sample sieve MLE has a much smaller bias (and a slightly bigger

standard error) than the estimator ignoring measurement error. Moreover,

the 2-sample sieve MLE has a smaller total root MSE than the inconsistent

estimator. In summary, our 2-sample sieve MLE performs well in this Monte

Carlo simulation.



Table 1Simulation results (n = 1500, na = 1000, reps = 400)

true value of β: β1 = 1 β2 = 1 β3 = 1

ignoring meas. error:— mean estimate 0.1753 0.3075 0.5953— standard error 0.08422 0.1227 0.1879— root mse 0.8290 0.7033 0.4461

infeasible MLE:— mean estimate 0.9998 1.001 1.000— standard error 0.02792 0.03382 0.03549— root mse 0.02792 0.03382 0.03549

2-sample sieve MLE:— mean estimate 1.024 1.038 0.9866— standard error 0.08670 0.1229 0.2290— root mse 0.08999 0.1286 0.2293

note: Jn = 5, Kn = 3 in bfX|X∗ , bfXa|X∗a ;k2,n = 4 for bfX∗|W , bfX∗a |Wa .

5. Conclusion. This paper considers nonparametric identification and

semiparametric estimation of a general nonlinear model using two random

samples. Both samples consist of a dependent variable, some error-free co-

variates and an error-ridden covariate, in which the measurement error has

unknown distribution and could be arbitrarily correlated with the latent

true values. We provide reasonable conditions so that the latent nonlinear

model is nonparametrically identified using the two samples. The advantage

of our identification strategy is that, in addition to allowing for nonclassical

measurement errors in both samples, neither sample is required to contain

an accurate measurement of the latent true covariate, and only one mea-

surement of the error-ridden covariate is assumed in each sample. Moreover,

our identification result does not require that the primary sample contain

an IV excluded from the nonlinear model of interest, nor does it require that

the two samples be independent.

Since the latent nonlinear model is nonparametrically identified without



imposing two independent samples, we could estimate the latent nonlinear

model nonparametrically via two potentially correlated samples, provided

that we impose some structure on the correlation of the two samples. In

particular, the panel data structure in Horowitz and Markatou (1996) could

be borrowed to model two correlated samples. We shall investigate this in

future research.

6. Appendix: Mathematical Proofs. Proof : (Theorem 2.1) Under

assumption 2.1, the probability density of the observed vectors equals

(A.1)

fX,W,Y (x,w, y) =

ZX∗

fX|X∗(x|x∗)fX∗,W,Y (x∗, w, y)dx∗ for all x,w, y.

For each value wj of W , assumption 2.1 implies that

(A.2) fX,Y |W (x, y|wj) =

ZfX|X∗ (x|x∗) fY |X∗,W (y|x∗, wj)fX∗|Wj

(x∗)dx∗

in the primary sample. Similarly, assumptions 2.2 and 2.3 imply that

(A.3)

fXa,Ya|Wa(x, y|wj) =

ZfXa|X∗a (x|x

∗) fY |X∗,W (y|x∗, wj)fX∗a |Wj(x∗)dx∗

in the auxiliary sample.

By equation (A.2) and the definition of the operators, we have, for any

function h,³LX,Y |Wj

h´(x)

=

ZfX,Y |Wj

(x, u|wj)h (u) du

=

Z µZfX|X∗ (x|x∗) fY |X∗,W (u|x∗, wj)fX∗|Wj

(x∗)dx∗¶h (u) du

=

ZfX|X∗ (x|x∗) fX∗|Wj

(x∗)µZ

fY |X∗,W (u|x∗, wj)h (u) du

¶dx∗

=

ZfX|X∗ (x|x∗) fX∗|Wj

(x∗)³LY |X∗,Wj

h´(x∗) dx∗

=

ZfX|X∗ (x|x∗)

³LX∗|Wj

LY |X∗,Wjh´(x∗) dx∗

=³LX|X∗LX∗|Wj

LY |X∗,Wjh´(x) .



This means we have the operator equivalence

(A.4) LX,Y |Wj= LX|X∗LX∗|Wj

LY |X∗,Wj

in the primary sample. Similarly, equation (A.3) and the definition of the

operators imply

(A.5) LXa,Ya|Wj= LXa|X∗aLX∗a |Wj

LY |X∗,Wj

in the auxiliary sample. While the left-hand sides of equations (A.4) and

(A.5) are observed, the right-hand sides contain unknown operators corre-

sponding to the error distributions (LX|X∗ and LXa|X∗a ), the marginal distri-

butions of the latent true values (LX∗|Wjand LX∗a |Wj

), and the conditional

distribution of the dependent variable (LY |X∗,Wj).

Assumptions 2.4 and 2.5 imply that all the operators involved in equations

(A.4) and (A.5) are invertible. Under assumptions 2.4 and 2.5, for any given

Wj we can eliminate LY |X∗,Wjin equations (A.4) and (A.5) to obtain

(A.6) LXa,Ya|WjL−1X,Y |Wj

= LXa|X∗aLX∗a |WjL−1X∗|Wj

L−1X|X∗ .

This equation holds for all Wi and Wj . We may then eliminate LX|X∗ to

have

LijXa,Xa

≡³LXa,Ya|Wj

L−1X,Y |Wj

´³LXa,Ya|Wi

L−1X,Y |Wi

´−1= LXa|X∗a

³LX∗a |Wj


L−1X∗a |Wi

´L−1Xa|X∗a

≡ LXa|X∗aLijX∗a

L−1Xa|X∗a.(A.7)

The operator LijXa,Xa

on the left-hand side is observed for all i and j. An im-

portant observation is that the operator LijX∗a≡³LX∗a |Wj


L−1X∗a |Wi

ís a diagonal operator defined as

(A.8)³LijX∗a

h´(x∗) ≡ kijX∗a (x

∗)h (x∗)

with

kijX∗a (x∗) ≡

fX∗a |Wj(x∗) fX∗|Wi

(x∗)

fX∗|Wj(x∗) fX∗a |Wi

(x∗).



Equation (A.7) implies a diagonalization of an observed operator LijXa,Xa

.

An eigenvalue of LijXa,Xa

equals kijX∗a (x∗) for a value of x∗, which corresponds

to an eigenfunction fXa|X∗a (·|x∗).

We now show the identification of fXa|X∗a and kijX∗a (x∗). First, we require

the operator LijXa,Xa

to be bounded so that the diagonal decomposition may

be unique; see, e.g., Dunford and Schwartz (1971). Equation (A.7) implies

that the operator LijXa,Xa

has the same spectrum as the diagonal operator

LijX∗a. Since an operator is bounded by the largest element of its spectrum,

assumption 2.6 guarantees that the operator LijXa,Xa

is bounded. Second,

although it implies a diagonalization of the operator LijXa,Xa

, equation (A.7)

does not guarantee distinctive eigenvalues. If there exist duplicate eigen-

values, there exist two linearly independent eigenfunctions corresponding

to the same eigenvalue. A linear combination of the two eigenfunctions is

also an eigenfunction corresponding to the same eigenvalue. Therefore, the

eigenfunctions may not be identified in each decomposition corresponding

to a pair of i and j. However, such ambiguity can be eliminated by noting

that the observed operators LijXa,Xa

for all i, j share the same eigenfunctions

fXa|X∗a (·|x∗). Assumption 2.7 guarantees that, for any two different eigen-

functions fXa|X∗a (·|x∗1) and fXa|X∗a (·|x

∗2), one can always find two subsets

Wj and Wi such that the two different eigenfunctions correspond to two

different eigenvalues kijX∗a (x∗1) and kijX∗a (x

∗2) and, therefore, are identified.

The third ambiguity is that, for a given value of x∗, an eigenfunction

fXa|X∗a (·|x∗) times a constant is still an eigenfunction corresponding to x∗.

To eliminate this ambiguity, we need to normalize each eigenfunction. No-

tice that fXa|X∗a (·|x∗) is a conditional probability density for each x∗; hence,R

fXa|X∗a (x|x∗) dx = 1 for all x∗. This property of conditional density pro-

vides a perfect normalization condition.

Fourth, in order to fully identify each eigenfunction, i.e., fXa|X∗a , we need

to identify the exact value of x∗ in each eigenfunction fXa|X∗a (·|x∗). Notice

that the eigenfunction fXa|X∗a (·|x∗) is identified up to the value of x∗. In

other words, we have identified a probability density of Xa conditional on

X∗a = x∗ with the value of x∗ unknown. Moreover, assumption 2.8 identifies



the exact value of x∗ for each eigenfunction fXa|X∗a (·|x∗). For example, an

intuitive assumption is that the value of x∗ is the mean of this identified

probability density, i.e., x∗ =RxfXa|X∗a (x|x

∗) dx; this assumption is equiv-

alent to that the measurement error in the auxiliary sample (Xa −X∗a) has

zero mean conditional on the latent true values.

After fully identifying the density function fXa|X∗a , we now show that the

density of interest fY |X∗,W and fX|X∗ are also identified. By equation (A.3),

we have fXa,Ya|Wa= LXa|X∗afYa,X∗a |Wa

. By the injectivity of operator LXa|X∗a ,

the joint density fYa,X∗a |Wamay be identified as follows:

fYa,X∗a |Wa= L−1Xa|X∗afXa,Ya|Wa

.

Assumption 2.3 implies that fYa|X∗a ,Wa= fY |X∗,W so that we may identify

fY |X∗,W through

fY |X∗,W (y|x∗, w) =fYa,X∗a |Wa

(y, x∗|w)RfYa,X∗a |Wa

(y, x∗|w)dy for all x∗ and w.

By equation (A.4) and the injectivity of the identified operator LY |X∗,Wj,

we have

(A.9) LX|X∗LX∗|Wj= LX,Y |Wj

L−1Y |X∗,Wj.

The left-hand side of equation (A.9) equals an operator with the kernel

function fX,X∗|W=wj ≡ fX|X∗fX∗|W=wj . Since the right-hand side of equa-

tion (A.9) has been identified, the kernel fX,X∗|W=wj on the left-hand side

is also identified. We may then identify fX|X∗ through

fX|X∗(x|x∗) =fX,X∗|W=wj (x, x

∗)RfX,X∗|W=wj (x, x

∗)dxfor all x∗ ∈ X ∗.

Proof : (Theorem 3.2) The proof is a simplified version of that for the-

orem 4.1 in Ai and Chen (2007). Recall the neighborhoods N0n = {α ∈A0sn : kα− α0k2 = o([n + na]

−1/4)} and N0 = {α ∈ A0s : kα− α0k2 =o([n+ na]

−1/4)}. For any α ∈ N0n, define

r[Zt;α,α0] ≡ (Zt;α)− (Zt;α0)−d (Zt;α0)

dα[α− α0].



Denote the centered empirical process indexed by any measurable function

h as

μn (h(Zt)) ≡1

n+ na

n+naXt=1

{h(Zt)−E[h(Zt)]}.

Let εn > 0 be at the order of o([n+na]−1/2). By definition of the two-sample

sieve quasi MLE bαn, we have0 ≤ 1

n+ na

n+naXt=1

[ (Zt; bα)− (Zt; bα± εnυ∗n)]

= μn ( (Zt; bα)− (Zt; bα± εnυ∗n)) +E ( (Zt; bα)− (Zt; bα± εnυ

∗n))

= ∓εn ×1

n+ na

n+naXt=1

d (Zt;α0)

dα[υ∗n] + μn (r[Zt; bα,α0]− r[Zt; bα± εnυ

∗n, α0])

+E (r[Zt; bα,α0]− r[Zt; bα± εnυ∗n, α0]) .

In the following we will show that:

1

n+ na

n+naXt=1

d (Zt;α0)

dα[υ∗n − υ∗](A.10)

= oP (1√

n+ na);

E (r[Zt; bα,α0]− r[Zt; bα± εnυ∗n, α0])(A.11)

= ±εn hbα− α0, υ∗i2 + εnoP (

1√n+ na

);

μn (r[Zt; bα,α0]− r[Zt; bα± εnυ∗n, α0])(A.12)

= εn × oP (1√

n+ na).

Notice that assumptions 3.1, 3.2(ii)(iii), and 3.6 imply E³d (Zt;α0)

dα [υ∗]´= 0.

Under (A.10) - (A.12) we have:

0 ≤ 1

n+ na

n+naXt=1

[ (Zt; bα)− (Zt; bα± εnυ∗n)]

= ∓εn × μn

µd (Zt;α0)

dα[υ∗]

¶± εn × hbα− α0, υ

∗i2 + εn × oP (1√

n+ na).

Hence

√n+ na hbα− α0, υ

∗i2 =√n+ naμn

µd (Zt;α0)

dα[υ∗]

¶+ oP (1)⇒ N

³0, σ2∗

´,



with

σ2∗ ≡ E

(µd (Zt;α0)

dα[υ∗]

¶2)= (v∗θ)

TEhSTθ0Sθ0

i(v∗θ) = λT (V∗)

−1I∗(V∗)−1λ.

Thus, assumptions 3.2(i), 3.7, and 3.9 together imply that σ2∗ <∞ and

√n+ naλ

T (bθn − θ0) =√n+ na hbα− α0, υ

∗i2 + oP (1)⇒ N³0, σ2∗

´.

To complete the proof, it remains to establish (A.10) - (A.12). Notice that

(A.10) is implied by the Chebyshev inequality, i.i.d. data, and assumptions

3.10 and 3.13. For (A.11) and (A.12) we notice that

r[Zt; bα,α0]− r[Zt; bα± εnυ∗n, α0]

= (Zt; bα)− (Zt; bα± εnυ∗n)−

d (Zt;α0)

dα[∓εnυ∗n]

= ∓εn ×µd (Zt; eα)

dα[υ∗n]−

d (Zt;α0)

dα[υ∗n]

¶= ∓εn ×

Ãd2 (Zt;α)

dαdαT[eα− α0, υ

∗n]

!in which eα ∈ N0n is in between bα and bα± εnυ

∗n, and α ∈ N0 is in betweeneα ∈ N0n and α0. Therefore, for (A.11), by the definition of inner product

h·, ·i2, we have:

E (r[Zt; bα,α0]− r[Zt; bα± εnυ∗n, α0])

= ∓εn ×E

Ãd2 (Zt;α)


∗n]

!= ±εn × heα− α0, υ

∗ni2

∓εn ×E

Ãd2 (Zt;α)


∗n]−

d2 (Zt;α0)


∗n]

!= ±εn × hbα− α0, υ

∗ni2 ± εn × heα− bα, υ∗ni2 + oP (

εn√n+ na

)

= ±εn × hbα− α0, υ∗i2 +OP (ε

2n) + oP (

εn√n+ na

)

in which the last two equalities hold due to the definition of eα, assumptions3.10 and 3.12, and

hbα− α0, υ∗n − υ∗i2 = oP (

1√n+ na

) and ||υ∗n||22 → ||υ∗||22 <∞.



Hence, (A.11) is satisfied. For (A.12), we notice

μn (r[Zt; bα,α0]− r[Zt; bα± εnυ∗n, α0]) = ∓εn×μn

µd (Zt; eα)

dα[υ∗n]−

d (Zt;α0)

dα[υ∗n]

¶in which eα ∈ N0n is in between bα and bα±εnυ∗n. Since the class nd (Zt;eα)

dα [υ∗n] : eα ∈ A0sois Donsker under assumptions 3.1, 3.2, 3.6, and 3.7, and since

E

(µd (Zt; eα)

dα[υ∗n]−

d (Zt;α0)

dα[υ∗n]

¶2)= E

⎧⎨⎩Ãd2 (Zt;α)


∗n]

!2⎫⎬⎭goes to zero as ||eα−α0||s goes to zero under assumption 3.11, we have (A.12)holds.

For the sake of completeness, we write down the expressions of d p(Z;θ0,f01,f02)dα [α− α0]

and d a(Z;f01a,f02a)dα [α− α0] that are needed in the calculation of the Riesz

representer and the asymptotic efficient variance of the sieve MLE bθ in sub-section 3.1.4:

fX,Y |W (X,Y |W ; θ0, f01, f02)×d p(Z; θ0, f01, f02)

dα[α− α0]

=

ZX∗

f01(X|x∗)dg(Y |x∗,W ; θ0)

dθTf02(x

∗|W )dx∗ [θ − θ0]

+

ZX∗[f1(X|x∗)− f01(X|x∗)] g(Y |x∗,W ; θ0)f02(x

∗|W )dx∗

+

ZX∗

f01(X|x∗)g(Y |x∗,W ; θ0) [f2(x∗|W )− f02(x

∗|W )] dx∗,

and

fXa,Ya|Wa(Xa, Ya|Wa; f01a, f02a)×

d a(Z; f01a, f02a)

dα[α− α0]

=

ZX∗

f01a(X|x∗)dg(Y |x∗,W ; θ0)

dθTf02a(x

∗|Wa)dx∗

+

ZX∗[f1a(X|x∗)− f01a(X|x∗)] g(Y |x∗,W ; θ0)f02a(x

∗|Wa)dx∗

+

ZX∗

f01a(X|x∗)g(Y |x∗,W ; θ0) [f2a(x∗|Wa)− f02a(x

∗|Wa)] dx∗.

Proof : (Theorem 3.4) Under stated assumptions, we have, for model

j = 1, 2,

1√n+ na

n+naXt=1

Ã{ j(Zt; bαj)− j(Zt;α0j)}−E{ j(Zt; bαj)− j(Zt;α0j)}

!= oP (1),



and

E{ j(Zt; bαj)− j(Zt;α0j)} ³ ||bαj − α0j ||22 = oP

µ1√

n+ na

¶thus

1√n+ na

n+naXt=1

({ j(Zt; bαj)−E[ j(Zt;α0j)]})

=1√

n+ na

n+naXt=1

({ j(Zt; bαj)− j(Zt;α0j)}−E{ j(Zt; bαj)− j(Zt;α0j)})

+1√

n+ na

n+naXt=1

{ j(Zt;α0j)−E[ j(Zt;α0j)]}

+√n+ naE{ j(Zt; bαj)− j(Zt;α0j)}

=1√

n+ na

n+naXt=1

{ j(Zt;α0j)−E[ j(Zt;α0j)]}+ oP (1).

Under stated conditions, it is obvious that σ2 = σ2+oP (1). Suppose models

1 and 2 are non-nested, then σ > 0. Thus,

1

σ√n+ na

n+naXt=1

Ã{ 2(Zt; bα2)− 1(Zt; bα1)}

−E{ 2(Zt;α02)− 1(Zt;α01)}

!d→ N (0, 1) .

REFERENCES

Ai, C., and X. Chen (2003): “Efficient Estimation of Models with Condi-tional Moment Restrictions Containing Unknown Functions,” Econo-metrica 71, 1795—1843.

Ai, C., and X. Chen (2007): “Estimation of Possibly Misspecified Semipara-metric Conditional Moment Restriction Models with Different Condi-tioning Variables,” forthcoming in Journal of Econometrics.

Amemiya, Y., and W. A. Fuller (1988): “Estimation for the NonlinearFunctional Relationship,” Annals of Statistics 16, 147—160.

Blundell, R., X. Chen, and D. Kristensen (2007): “Semiparametric EngelCurves with Endogenous Expenditure,” forthcoming in Econometrica.

Bound, J., C. Brown, and N. Mathiowetz (2001): “Measurement Error inSurvey Data,” in Handbook of Econometrics, vol. 5, ed. by J. J. Heck-man and E. Leamer, Elsevier Science.

Buzas, J., and L. Stefanski (1996): “Instrumental Variable Estimation inGeneralized Linear Measurement Error Models,” Journal of the Amer-ican Statistical Association 91, 999—1006.



Carrasco, M., J.-P. Florens, and E. Renault (2006): “Linear Inverse Prob-lems and Structural Econometrics: Estimation Based on Spectral De-composition and Regularization,” in Handbook of Econometrics, vol.6, ed. by J. J. Heckman, and E. Leamer, Elsevier Science.

Carroll, R. J., D. Ruppert, C. Crainiceanu, T. Tosteson, and R. Karagas(2004): “Nonlinear and Nonparametric Regression and InstrumentalVariables,” Journal of the American Statistical Association 99, 736—750.

Carroll, R. J., D. Ruppert, and L. A. Stefanski (1995): Measurement Errorin Nonlinear Models: A Modern Perspective. New York: Chapman &Hall.

Carroll, R. J., D. Ruppert, L. A. Stefanski and C. Crainiceanu, 2006, Mea-surement Error in Nonlinear Models: A Modern Perspective, SecondEdition, CRI.

Carroll, R. J., and L. A. Stefanski (1990): “Approximate Quasi-likelihoodEstimation in Models with Surrogate Predictors,” Journal of the Amer-ican Statistical Association 85, 652—663.

Carroll, R. J. and M. P. Wand (1991): “Semiparametric Estimation inLogistic Measurement Error Models,” Journal of the Royal StatisticalSociety B 53, 573—585.

Chen, X. (2006): “Large Sample Sieve Estimation of Semi-nonparametricModels,” in Handbook of Econometrics, vol. 6, ed. by J. J. Heckmanand E. Leamer, Elsevier Science.

Chen, X., H. Hong, and E. Tamer (2005): “Measurement Error Models withAuxiliary Data,” Review of Economic Studies 72, 343—366.

Chen, X., H. Hong, and A. Tarozzi (2007): “Semiparametric Efficiency inGMM Models with Nonclassical Measurement Error,” forthcoming inAnnals of Statistics.

Cheng, C. L., Van Ness, J. W., 1999, Statistical Regression with Measure-ment Error, Arnold, London.

Chernozhukov, V., G. Imbens, and W. Newey (2007): “Instrumental Vari-able Identification and Estimation of Nonseparable Models via Quan-tile Conditions,” Journal of Econometrics.

Dunford, N., and J. T. Schwartz (1971): Linear Operators. New York: JohnWiley & Sons.

Fan, J. (1991): “On the Optimal Rates of Convergence for NonparametricDeconvolution Problems,” Annals of Statistics 19, 1257—1272.

Fuller, W., 1987, Measurement error models. New York: John Wiley &Sons.

Hausman, J., H. Ichimura, W. Newey, and J. Powell (1991): “Identificationand Estimation of Polynomial Errors-in-variables Models,” Journal ofEconometrics 50, 273—295.



Hoeffding, W. (1977): “Some Incomplete and Boundedly Complete Familiesof Distributions,” Annals of Statistics, 5, 278-291.

Hong, H., and E. Tamer (2003): “A Simple Estimator for Nonlinear Errorin Variable Models,” Journal of Econometrics 117(1), 1—19.

Horowitz, J., and M. Markatou (1996): “Semiparametric Estimation of Re-gression Models for Panel Data,” Review of Economic Studies 63, 145—168.

Hsiao, C. (1989): “Consistent Estimation for Some Nonlinear Errors-in-Variables Models,” Journal of Econometrics 41, 159—185.

Hu, Y. (2006): “Identification and Estimation of Nonlinear Models withMisclassification Error Using Instrumental Variables,” working paper,University of Texas at Austin.

Hu, Y., and G. Ridder (2006): “Estimation of Nonlinear Models with Mea-surement Error Using Marginal Information,” working paper, Univer-sity of Southern California.

Hu, Y., and S. M. Schennach (2006): “Identification and Estimation ofNonclassical Nonlinear Errors-in-Variables Models with ContinuousDistributions Using Instruments,” Cemmap working paper (Centre forMicrodata Methods and Practice).

Ichimura, H., and E. Martinez-Sanchis (2006): “Identification and Estima-tion of GMM Models by Combining Two Data Sets,” working paper,University College London.

Lee, L.-F., and J. H. Sepanski (1995): “Estimation of Linear and Nonlin-ear Errors-in-Variables Models Using Validation Data,” Journal of theAmerican Statistical Association 90 (429).

Lehmann, E.L. (1986): Testing Statistical Hypothesis, 2nd ed. Wiley: NewYork.

Lewbel, A. (2007): “Estimation of Average Treatment Effects with Mis-classification,” Econometrica, 75, 537-551.

Li, T., and Q. Vuong (1998): “Nonparametric Estimation of the Measure-ment Error Model Using Multiple Indicators,” Journal of MultivariateAnalysis 65, 139—165.

Li, T. (2002): “Robust and Consistent Estimation of Nonlinear Errors-in-Variables Models,” Journal of Econometrics 110, 1—26.

Liang, H., W. Hardle, and R. Carroll, 1999, “Estimation in a Semiparamet-ric Partially Linear Errors-in-Variables Model,” The Annals of Statis-tics, Vol. 27, No. 5, 1519-1535.

Mattner, L. (1993): “Some Incomplete but Bounded Complete LocationFamilies,” Annals of Statistics, 21, 2158-2162.

Murphy, S. A. and Van der Vaart, A. W. 1996, ”Likelihood inference in theerrors-in-variables model.” J. Multivariate Anal. 59, no. 1, 81-08.

Newey, W. (2001): “Flexible Simulated Moment Estimation of Nonlinear



Errors-in-Variables Models,” Review of Economics and Statistics 83,616—627.

Newey, W., and J. Powell (2003): “Instrumental Variables Estimation ofNonparametric Models,” Econometrica 71, 1557—1569.

Ridder, R., and R. Moffitt (2006): “The Econometrics of Data Combina-tion,” in Handbook of Econometrics, vol. 6, ed. by J. J. Heckman, andE. Leamer, Elsevier Science.

Schennach, S. (2004): “Estimation of Nonlinear Models with MeasurementError,” Econometrica 72(1), 33—76.

Shen, X. (1997): “On Methods of Sieves and Penalization,” Annals of Sta-tistics 25, 2555—2591.

Shen, X., and W. Wong (1994) “Convergence Rate of Sieve Estimates,”The Annals of Statistics 22, 580—615.

Taupin, M. L. (2001): “Semi-parametric Estimation in the Nonlinear Struc-tural Errors-in-Variables Model,” Annals of Statistics 29, 66—93.

Van de Geer, S. (1993), “Hellinger-Consistency of Certain NonparametricMaximum Likelihood Estimators,” The Annals of Statistics, 21, 14-44.

Van de Geer, S. (2000), Empirical Processes in M-estimation, CambridgeUniversity Press.

Vuong, Q. (1989): “Likelihood Ratio Test for Model Selection and Non-nested Hypotheses,” Econometrica 57, 307—333.

Wang, L., 2004, ”Estimation of nonlinear models with Berkson measure-ment errors,” The Annals of Statistics 32, no. 6, 2559—2579.

Wang, L., and C. Hsiao (1995): “Simulation-Based Semiparametric Esti-mation of Nonlinear Errors-in-Variables Models,” working paper, Uni-versity of Southern California.

Wang, N., X. Lin, R. Gutierrez, and R. Carroll, 1998, ”Bias analysis andSIMEX approach in generalized linear mixed measurement error mod-els,” J. Amer. Statist. Assoc. 93, no. 441, 249—261.

Wansbeek, T., and E. Meijer (2000): Measurement Error and Latent Vari-ables in Econometrics, North Holland.

White, H. (1982): “Maximum Likelihood Estimation of Misspecified Mod-els,” Econometrica 50, 143—161.

Department of EconomicsYale UniversityE-mail: [email protected]

Department of EconomicsJohns Hopkins UniversityE-mail: [email protected]


Date post:	18-Jul-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

· 8 X. CHEN AND Y. HU LX a|X∗,andLX∗|W aj of the corresponding densities fX a,Ya|Waj,...

Documents