ChallengesinFiducialInference · ChallengesinFiducialInference Partsofthistalkarejointworkwith...

transcript

Challenges in Fiducial Inference

Parts of this talk are joint work withT. C.M Lee (UC Davis), Randy Lai (U of Maine), H. Iyer (NIST)J. Williams, Y Cui (UNC)

BFF 2017

Jan Hanniga

University of North Carolina at Chapel Hill

aNSF support acknowledged

outline

Outline

Introduction

Definition

Sparsity

Regularization

Conclusions

introduction

Outline

Introduction

Definition

Sparsity

Regularization

Conclusions

introduction

Fiducial?

▶ Oxford English Dictionary▶ adjective technical (of a point or line) used as a fixed basis of

comparison.▶ Origin from Latin fiducia ‘trust, confidence’

▶ Merriam-Webster dictionary1. taken as standard of reference a fiducial mark2. founded on faith or trust3. having the nature of a trust : fiduciary

introduction

Aim of this talk

▶ Explain the definition of generalized fiducial distribution

▶ Challenge of extra information:▶ Sparsity▶ Regularization

▶ My point of view: frequentist▶ Justified using asymptotic theorems and simulations.▶ GFI tends to work well

introduction

Aim of this talk

▶ Explain the definition of generalized fiducial distribution▶ Challenge of extra information:

▶ Sparsity▶ Regularization

introduction

Aim of this talk

▶ Explain the definition of generalized fiducial distribution▶ Challenge of extra information:

▶ Sparsity▶ Regularization

definition

Outline

Introduction

Definition

Sparsity

Regularization

Conclusions

definition

Comparison to likelihood

▶ Density is the function f(x, ξ), where ξ is fixed and x isvariable.

▶ Likelihood is the function f(x, ξ), where ξ is variable and x isfixed.

▶ Likelihood as a distribution?

definition

definition Formal Defintion

General Definition

▶ Data generating equationX = G(U , ξ).▶ e.g. Xi = µ+ σUi

▶ A distribution on the parameter space is Generalized FiducialDistribution if it can be obtained as a limit (as ε ↓ 0) of

arg minξ

∥x−G(U⋆, ξ)∥ | {minξ

∥x−G(U⋆, ξ)∥ ≤ ε} (1)

▶ Similar to ABC; generating from prior replaced bymin.▶ Is this practicle? Can we compute?

General Definition

arg minξ

∥x−G(U⋆, ξ)∥ ≤ ε} (1)

General Definition

arg minξ

∥x−G(U⋆, ξ)∥ ≤ ε} (1)

▶ Similar to ABC; generating from prior replaced bymin.

▶ Is this practicle? Can we compute?

General Definition

arg minξ

∥x−G(U⋆, ξ)∥ ≤ ε} (1)

Explicit limit (1)

▶ AssumeX ∈ Rn is continuous; parameter ξ ∈ Rp

▶ The limit in (1) has density (H, Iyer, Lai & Lee, 2016)

r(ξ|x) = fX(x|ξ)J(x, ξ)∫Ξ fX(x|ξ′)J(x, ξ′) dξ′

where J(x, ξ) = D

(ddξG(u, ξ)

∣∣∣u=G−1(x,ξ)

)▶ n = p givesD(A) = | detA|

▶ ∥· ∥2 givesD(A) = (detA⊤A)1/2

Compare to Fraser, Reid, Marras & Yi (2010)

▶ ∥· ∥∞ givesD(A) =∑

i=(i1,...,ip)

|det(A)i|

Explicit limit (1)

▶ AssumeX ∈ Rn is continuous; parameter ξ ∈ Rp

▶ The limit in (1) has density (H, Iyer, Lai & Lee, 2016)

r(ξ|x) = fX(x|ξ)J(x, ξ)∫Ξ fX(x|ξ′)J(x, ξ′) dξ′

where J(x, ξ) = D

(ddξG(u, ξ)

∣∣∣u=G−1(x,ξ)

)▶ n = p givesD(A) = | detA|

▶ ∥· ∥2 givesD(A) = (detA⊤A)1/2

Compare to Fraser, Reid, Marras & Yi (2010)

▶ ∥· ∥∞ givesD(A) =∑

i=(i1,...,ip)

|det(A)i|

Example -- Linear Regression

▶ Data generating equation Y = Xβ + σZ

▶ ddθY = (X,Z) and Z = (Y −Xβ)/σ.

▶ The L2 Jacobian is

J(y, β, σ) =

y −Xβ

σ)⊤(X,

y −Xβ

= σ−1|det(XTX)|1/2(RSS)1/2

▶ Fiducial happens to be same as independence Jeffreys,explicit normalizing constant

J(y, β, σ) =

y −Xβ

σ)⊤(X,

y −Xβ

= σ−1| det(XTX)|1/2(RSS)1/2

J(y, β, σ) =

y −Xβ

σ)⊤(X,

y −Xβ

= σ−1| det(XTX)|1/2(RSS)1/2

Example -- Uniform(θ, θ2)

▶ Xi i.i.d. U(θ, θ2), θ > 1

▶ Data generating equationXi = θ + (θ2 − θ)Ui, Ui ∼ U(0, 1).

▶ Compute Jacobian: ddθ [θ + (θ2 − θ)Ui] = 1 + (2θ − 1)Ui, with

Ui =Xi−θθ2−θ

▶ Using ∥· ∥∞ we have J(x, θ) = n x̄(2θ−1)−θ2

θ2−θ .

▶ Reference prior π(θ) = eψ( 2θ

2θ−1)(2θ−1)θ(θ−1) Berger, Bernardo &

Sun (2009) – complicated to derive.▶ In simulations fiducial was marginally better than reference

prior which was much better than flat prior.

▶ Xi i.i.d. U(θ, θ2), θ > 1▶ Data generating equationXi = θ + (θ2 − θ)Ui, Ui ∼ U(0, 1).

Ui =Xi−θθ2−θ

θ2−θ .

Ui =Xi−θθ2−θ

θ2−θ .

Ui =Xi−θθ2−θ

θ2−θ .

Ui =Xi−θθ2−θ

θ2−θ .

Sun (2009) – complicated to derive.

▶ In simulations fiducial was marginally better than referenceprior which was much better than flat prior.

Ui =Xi−θθ2−θ

θ2−θ .

Important Simple Observations

▶ GFD is allways proper

▶ GFD is invariant to re-parametrizations (same as Jeffreys)

▶ GFD is not invariant to smooth transformation of the data ifn > p

▶ Does not satisfy likelihood principle.

definition Asymptotic results

Various Asymptotic Results

r(ξ|x) ∝ fX(x|ξ)J(x, ξ)where J(x, ξ) = D

G(u, ξ)∣∣∣u=G−1(x,ξ)

▶ Most start with C−1n J(x, ξ) → J(ξ0, ξ)

▶ Bernstein-von Mises theorem for fiducial distributionsprovides asymptotic correctness of fiducial CIs H (2009,2013), Sonderegger & H (2013) .

▶ Consistency of model selection H & Lee (2009), Lai, H & Lee(2015), H, Iyer, Lai & Lee (2016).

▶ Regular higher order asymptotics in Pal Majumdar & H(2016+).

sparsity

Outline

Introduction

Definition

Sparsity

Regularization

Conclusions

sparsity

Model Selection

▶ X = G(M, ξM ,U), M ∈ M, ξM ∈ ξM

Theorem: (H, Iyer, Lai, Lee 2016) Under assumptions

r(M |y) ∝ q|M |∫ξM

fM (y, ξM )JM (y, ξM ) dξM

▶ Need for penalty – in fiducial framework additional equations0 = Pk, k = 1, . . . ,min(|M |, n)

▶ Default value q = n−1/2 (motivated by MDL)

sparsity

Model Selection

▶ X = G(M, ξM ,U), M ∈ M, ξM ∈ ξM

r(M |y) ∝ q|M |∫ξM

sparsity

Model Selection

▶ X = G(M, ξM ,U), M ∈ M, ξM ∈ ξM

r(M |y) ∝ q|M |∫ξM

sparsity

Alternative to penalty

▶ Penalty is used to discourage models with many parameters

▶ Real issue: Not too many parameters but a smaller modelcan do almost the same job

r(M |y) ∝∫ξM

fM (y, ξM )JM (y, ξM )hM (ξM ) dξM ,

hM (ξM ) =

{0 a smaller model predicts nearly as well

1 otherwise

▶ Motivated by non-local priors of Johnson & Rossell (2009)

sparsity

▶ Penalty is used to discourage models with many parameters▶ Real issue: Not too many parameters but a smaller model

can do almost the same job

r(M |y) ∝∫ξM

hM (ξM ) =

1 otherwise

sparsity

r(M |y) ∝∫ξM

hM (ξM ) =

1 otherwise

sparsity

r(M |y) ∝∫ξM

hM (ξM ) =

1 otherwise

sparsity

Regression

▶ Y = Xβ + σZ

▶ First idea hM (βM ) = I{|βi|>ϵ, i∈M} – issue: collinearity

▶ Better:

hM (βM ) := I{ 12∥XT (XMβM−Xbmin)∥22≥ε(n,|M |)}

where bmin solves

minb∈Rp

2∥XT (XMβM −Xb)∥22 subject to ∥b∥0 ≤ |M | − 1.

▶ algorithm – Bertsimas et al (2016)▶ similar to Dantzig selector Candes & Tao (2007)

different norm and target

sparsity

Regression

▶ Y = Xβ + σZ

▶ First idea hM (βM ) = I{|βi|>ϵ, i∈M} – issue: collinearity▶ Better:

where bmin solves

minb∈Rp

▶ algorithm – Bertsimas et al (2016)

▶ similar to Dantzig selector Candes & Tao (2007)different norm and target

sparsity

Regression

▶ Y = Xβ + σZ

▶ First idea hM (βM ) = I{|βi|>ϵ, i∈M} – issue: collinearity▶ Better:

where bmin solves

minb∈Rp

▶ algorithm – Bertsimas et al (2016)▶ similar to Dantzig selector Candes & Tao (2007)

different norm and target

sparsity

r(M |y) ∝ π|M|2 Γ

(n− |M |2

−(n−|M|−1

M E[hεM (β⋆M )]

Observations:

▶ Expectation with respect to within model GFD (usual T)

▶ r(M |y) negligibly small for large models because of h,e.g., |M | > n implies r(M |y) = 0.

▶ Implemented using Grouped Independence MetropolisHastings (Andrieu & Roberts, 2009).

sparsity

r(M |y) ∝ π|M|2 Γ

(n− |M |2

−(n−|M|−1

M E[hεM (β⋆M )]

Observations:

▶ Expectation with respect to within model GFD (usual T)▶ r(M |y) negligibly small for large models because of h,

e.g., |M | > n implies r(M |y) = 0.

▶ Implemented using Grouped Independence MetropolisHastings (Andrieu & Roberts, 2009).

sparsity

r(M |y) ∝ π|M|2 Γ

(n− |M |2

−(n−|M|−1

M E[hεM (β⋆M )]

Observations:

▶ Expectation with respect to within model GFD (usual T)▶ r(M |y) negligibly small for large models because of h,

e.g., |M | > n implies r(M |y) = 0.▶ Implemented using Grouped Independence Metropolis

Hastings (Andrieu & Roberts, 2009).

sparsity

Main Result

TheoremWilliams & H (2017+)Suppose the true model is given byMT . Then under certainconditions, for a fixed positive constant α < 1,

r(MT |y) =r(MT |y)∑nα

∑M :|M |=j r(M |y)

P−→ 1 as n, p → ∞.

sparsity

Some Conditions

▶ Number of Predictors: lim infn→∞p→∞

n1−α

log(p) > 2,

▶ For the true model/parameter pT < log nγ

εMT(n, p) ≤ 1

18∥XT (µT −Xbmin)∥22

where bmin minimizes the norm subject to ∥b∥0 ≤ pT − 1.▶ For a large model |M | > pT and large enough n or p,

2∥XT (HM −HM(−1))µT ∥22 < εM (n, p),

whereHM andHM(−1) are the projection matrix forM andM with a covariate removed respectively.

sparsity

Some Conditions

n1−α

log(p) > 2,

εMT(n, p) ≤ 1

where bmin minimizes the norm subject to ∥b∥0 ≤ pT − 1.

▶ For a large model |M | > pT and large enough n or p,

2∥XT (HM −HM(−1))µT ∥22 < εM (n, p),

sparsity

Some Conditions

n1−α

log(p) > 2,

εMT(n, p) ≤ 1

where bmin minimizes the norm subject to ∥b∥0 ≤ pT − 1.▶ For a large model |M | > pT and large enough n or p,

2∥XT (HM −HM(−1))µT ∥22 < εM (n, p),

sparsity

Simulation

▶ Setup from Rockova & George (2015)▶ n = 100, p = 1000, pT = 8.▶ Columns ofX either a) independent or b) correlated with

ρ = 0.6

▶ εM (n, p) = ΛM σ̂2M

(n0.51

9 + |M | log(pπ)1.1

9 − log(n)γ)+with

γ = 1.45.

sparsity

Highlight of simulation results

▶ See Jon Williams’ poster for details on theory and simulation

▶ WhenX independent – usually select the correct model▶ WhenX correlated – usually select too small of a model

▶ Conditions of Theorem violated▶ based on conditions: p decreased to 500 to satisfy,

performance improves.

sparsity

▶ See Jon Williams’ poster for details on theory and simulation▶ WhenX independent – usually select the correct model▶ WhenX correlated – usually select too small of a model

sparsity

▶ Conditions of Theorem violated

▶ based on conditions: p decreased to 500 to satisfy,performance improves.

sparsity

Comments

▶ Standardized way of measuring closeness in other models?▶ What if small model not the right target, e.g., gene

interactions?

sparsity

Comments

▶ Standardized way of measuring closeness in other models?

▶ What if small model not the right target, e.g., geneinteractions?

sparsity

Comments

▶ Standardized way of measuring closeness in other models?▶ What if small model not the right target, e.g., gene

interactions?

regularization

Outline

Introduction

Definition

Sparsity

Regularization

Conclusions

regularization

Recall

arg minξ

∥x−G(U⋆, ξ)∥ ≤ ε}

▶ Conditioning U⋆ on {x = G(U⋆, ξ)}– “regularization by model”

regularization

Recall

arg minξ

∥x−G(U⋆, ξ)∥ ≤ ε}

▶ Conditioning U⋆ on {x = G(U⋆, ξ)}– “regularization by model”

regularization

Most general iid model

▶ Data generating equation:

Xi = F−1(Ui), Ui, i.i.d. Uniform(0,1)

▶ Inverting (solving for F ) we get

F ∗(x−i ) ≤ U∗i ≤ F ∗(xi).

There is a solution iff order of U∗i matches order of xi.

regularization

F ∗(x−i ) ≤ U∗i ≤ F ∗(xi).

regularization

F ∗(x−i ) ≤ U∗i ≤ F ∗(xi).

x x x x• •

••

regularization

F ∗(x−i ) ≤ U∗i ≤ F ∗(xi).

x x x x0

xi lower upper

• •

••

regularization

F ∗(x−i ) ≤ U∗i ≤ F ∗(xi).

x x x x0

xi lower upper

• •

••

▶ See Yifan Cui’s poster for extension to censored data.21

regularization

Additional Constraints

▶ Location scale family with known density f(x) and cdf F (x),e.g.,N(µ, σ2).

▶ Condition U∗i on existence µ∗, σ∗ so that

F (σ∗−1(xi − µ∗)) = U∗i , for all i

x x x x0

xiN(4.5, 32) lower upper

• •

••

▶ GFD is r(µ, σ) ∝ σ−1∏n

i=1 σ−1f

(σ−1(xi − µ)

regularization

x x x x0

• •

••

i=1 σ−1f

(σ−1(xi − µ)

regularization

x x x x0

• •

••

i=1 σ−1f

(σ−1(xi − µ)

regularization

Constraint complications

Toy example: X = µ+ Z, µ > 0.

▶ Option 1: condition Z⋆|x− Z⋆ > 0

▶ r(µ) = φ(x−µ)Φ(x) I{µ>0}

▶ Lower confidence bounds do not have correct coverage.

▶ Option 2: projection to µ > 0▶ r(µ) = (1− Φ(x))I{0} + φ(x− µ)I{µ>0}▶ Correct coverage; possible to get {0} as CI – sure bet against

▶ Option 3: mixture▶ r(µ) = min( 12 , 1− Φ(x)))I{0} +max( 1

2Φ(x) , 1)φ(x− µ)I{µ>0}▶ Correct/conservative coverage, no {0} for reasonable α CIs.

regularization

▶ r(µ) = φ(x−µ)Φ(x) I{µ>0}

regularization

▶ r(µ) = φ(x−µ)Φ(x) I{µ>0}

regularization

▶ r(µ) = φ(x−µ)Φ(x) I{µ>0}

regularization

Shape restrictions - preliminary results

▶ Example: Positive iid data with concave cdf(MLE is the Grenander estimator)

▶ Condition U⋆ on concave solution (Gibbs sampler)▶ Project unrestricted GFD to space of concave functions

(quadratic program)

regularization

▶ Condition U⋆ on concave solution (Gibbs sampler)

▶ Project unrestricted GFD to space of concave functions(quadratic program)

regularization

(quadratic program)

regularization

(quadratic program)

Condition Projection

regularization

(quadratic program)

Condition Projection

regularization

Comments

▶ When to use conditioning vs. projection?

▶ Connection to ancillarity and IM.▶ Computational cost a consideration?

regularization

Comments

▶ When to use conditioning vs. projection?▶ Connection to ancillarity and IM.

▶ Computational cost a consideration?

regularization

Comments

▶ When to use conditioning vs. projection?▶ Connection to ancillarity and IM.▶ Computational cost a consideration?

conclusions

Outline

Introduction

Definition

Sparsity

Regularization

Conclusions

conclusions

Fiducial Future

▶ What is it that we provide?▶ GFI: General purpose method that often works well

▶ Computational convenience and efficiency▶ Fiducial options in software.

▶ Theoretical guarantees▶ Applications

▶ The proof is in the pudding

conclusions

Fiducial Future

conclusions

Fiducial Future

conclusions

Fiducial Future

▶ Theoretical guarantees

▶ Applications▶ The proof is in the pudding

conclusions

Fiducial Future

conclusions

List of successful applications

▶ General Linear Mixed Models E, H & Iyer (2008); Cissewski &H (2012)

▶ Confidence sets for wavelet regression H & Lee (2009) andfree knot splines Sonderegger & H (2014)

▶ Extreme value data (Generalized Pareto), Maximum mean,and model comparison Wandler & H (2011, 2012ab)

▶ Uncertainty quantification for ultra high dimensionalregression Lai, H & Lee (2015), Wandler & H (2017+)

▶ Volatility estimation for high frequency data Katsoridas & H(2016+)

▶ Logistic regression with random effects (response models)Liu & H (2016,2017)

conclusions

I have a dream …

▶ One famous statistician said (I paraphrase)

“I use Bayes because there is no need to proveasymptotic theorem; it is correct.”

▶ I have a dream that by the time I retire people will havesimilar trust in fiducial inspired approaches.

Thank you!

conclusions

I have a dream …

Thank you!

conclusions

I have a dream …

Thank you!

conclusions

I have a dream …

Thank you!

ChallengesinFiducialInference · ChallengesinFiducialInference Partsofthistalkarejointworkwith...

Documents