Sub-Gaussian estimators under heavy tailsw3.impa.br/~rimfo/EBP_subgaussian.pdf · Sub-Gaussian...

Post on 07-Nov-2020

1 views 0 download

transcript

Sub-Gaussian estimators under heavy tails

Roberto Imbuzeiro Oliveira

XIX Escola Brasileira de Probabilidade Maresias, August 6th 2015

MatthieuLerasle

(CNRS/Nice)

LucDevroye(McGill)

Joint with

GáborLugosi

(ICREA/UPF)

Our problem (and why it's interesting)

Our problem

We want to estimate the mean of a probability distribution over the real line from an i.i.d. sample.

This is (related to) many fundamental statistical tasks.

Our problem

We assume finite variances, but as little else as possible.

Interesting in theory, important in practice.

Our problem

Want nearly optimal tail bounds, uniformly over large classes of distributions.

High-confidence estimates are sometimes necessary.

Formal statement

Given: P, family of probability distributions over R.For P 2 P, µP and �2

P are the mean and variance of P.

Want: for each large enough n 2 N, an estimator

bEn : Rn ! Rand a parameter �min = �min,n 2 [0, 1) such that,

if Xn1 = (X1, . . . , Xn) is i.i.d. from P 2 P, then

8� 2 [�min, 1) : P | bEn(X

n1 )� µP| > L�P

r1 + ln(1/�)

n

! �.

Formal statement

Given: P, family of probability distributions over R.For P 2 P, µP and �2

P are the mean and variance of P.

Want: for each large enough n 2 N, an estimator

bEn : Rn ! Rand a parameter �min = �min,n 2 [0, 1) such that,

if Xn1 = (X1, . . . , Xn) is i.i.d. from P 2 P, then

8� 2 [�min, 1) : P | bEn(X

n1 )� µP| > L�P

r1 + ln(1/�)

n

! �.

Should be very large (nonparametric)

Formal statement

Given: P, family of probability distributions over R.For P 2 P, µP and �2

P are the mean and variance of P.

Want: for each large enough n 2 N, an estimator

bEn : Rn ! Rand a parameter �min = �min,n 2 [0, 1) such that,

if Xn1 = (X1, . . . , Xn) is i.i.d. from P 2 P, then

8� 2 [�min, 1) : P | bEn(X

n1 )� µP| > L�P

r1 + ln(1/�)

n

! �.

Should be very small (exponentially in n?)

Formal statement

Given: P, family of probability distributions over R.For P 2 P, µP and �2

P are the mean and variance of P.

Want: for each large enough n 2 N, an estimator

bEn : Rn ! Rand a parameter �min = �min,n 2 [0, 1) such that,

if Xn1 = (X1, . . . , Xn) is i.i.d. from P 2 P, then

8� 2 [�min, 1) : P | bEn(X

n1 )� µP| > L�P

r1 + ln(1/�)

n

! �.

Constant (may depend on the family)

Why sub-Gaussian?

What we ask for is basically that the estimator has Gaussian-like fluctuations around the mean.

Catoni: Gaussian-like fluctuations are optimal for "reasonable" families of distributions (more on this below).

P✓| bEn(X

n1 )� µP| >

��Ppn

◆ C1 e

� �2

C2

Why is this interesting?

Estimator must turn heavy tails into light tails! (Tail surgery?)

Why is this interesting?

Why is this interesting?

Related (weaker) estimators have been applied to problems in Statistics and Machine Learning. Our notion could improve these results.

Audibert and Catoni + Hsu and Sabato (least squares), Buyback et al. (bandits), Brownlees et al. (empirical risk minimization).

When is this possible?

This is the main subject of our paper.

We present our results before we move on.

Our results

First result

Assumption: variance known up to an interval.

Partially known varianceExample: P [�2

1 ,�22 ]

2 := all distributions with variance �2P 2 [�2

1 ,�22 ].

We let R := �2/�1 (may depend on n).

Theorem:

If R is bounded, then for all large enough nthere exist

bEn : Rn ! R, �min ⇡ e�c nand L constant

such that, when P 2 P [�21 ,�

22 ]

2 and Xn1 =d P

⌦n,

8� 2 [�min, 1) : P | bEn(X

n1 )� µP| > L�P

r1 + ln(1/�)

n

!.

If R unbounded , any sequence �min ! 0 fails.

Partially known varianceExample: P [�2

1 ,�22 ]

2 := all distributions with variance �2P 2 [�2

1 ,�22 ].

We let R := �2/�1 (may depend on n).

Theorem:

If R is bounded, then for all large enough nthere exist

bEn : Rn ! R, �min ⇡ e�c nand L constant

such that, when P 2 P [�21 ,�

22 ]

2 and Xn1 =d P

⌦n,

8� 2 [�min, 1) : P | bEn(X

n1 )� µP| > L�P

r1 + ln(1/�)

n

!.

If R unbounded , any sequence �min ! 0 fails.

Optimal up to the exact values of c>0 e L>0.

Partially known varianceExample: P [�2

1 ,�22 ]

2 := all distributions with variance �2P 2 [�2

1 ,�22 ].

We let R := �2/�1 (may depend on n).

Theorem:

If R is bounded, then for all large enough nthere exist

bEn : Rn ! R, �min ⇡ e�c nand L constant

such that, when P 2 P [�21 ,�

22 ]

2 and Xn1 =d P

⌦n,

8� 2 [�min, 1) : P | bEn(X

n1 )� µP| > L�P

r1 + ln(1/�)

n

!.

If R unbounded , any sequence �min ! 0 fails.

Truly different behavior!

Second result

Assumption: (slightly) higher moments.

Higher momentsExample: P↵,⌘ := all distributions with

EP|X � µP|↵ (⌘ �P)↵,

(here ↵ 2 (2, 3) is fixed, ⌘ � ⌘0 may depend on n)

Theorem: for all large enough n, if k↵,⌘ := (C ⌘)2↵/(↵�2),

there exist

bEn : Rn ! R, �min ⇡ e�c n/k↵,⌘and L constant

such that, when P 2 P↵,⌘ and Xn1 =d P

⌦n,

8� 2 [�min, 1) : P | bEn(X

n1 )� µP| > L�P

r1 + ln(1/�)

n

!.

Higher momentsExample: P↵,⌘ := all distributions with

EP|X � µP|↵ (⌘ �P)↵,

(here ↵ 2 (2, 3) is fixed, ⌘ � ⌘0 may depend on n)

Theorem: for all large enough n, if k↵,⌘ := (C ⌘)2↵/(↵�2),

there exist

bEn : Rn ! R, �min ⇡ e�c n/k↵,⌘and L constant

such that, when P 2 P↵,⌘ and Xn1 =d P

⌦n,

8� 2 [�min, 1) : P | bEn(X

n1 )� µP| > L�P

r1 + ln(1/�)

n

!.

Optimal up to value of c>0.

An extensionSuffices to assume that the distributions which is k-regular:

For instance, symmetric distributions are 1-regular.

9k 2 N, 8P 2 P, 8j � k : if Xj1 =d P⌦n,

P ±1

j

jX

i=1

(Xi � µP) 0

!� 1

3.

Third result

Assumption: under bounded kurtosis, can get nearly optimal constant

This will be further discussed later.

L =p2 + "

Some background

History

Typical analyses of estimators for means are based on expectations, not deviations.

Exceptions do exist (eg. Kolmogorov’s CLT for medians), but assumptions and goals are different.

History

Catoni’s paper (AIHP Prob. Stat. 2012) seems to be the first to focus on deviations as a fundamental problem.

We’ll mention some more applied results later.

Gaussian lower bound

Recall normal cumulative distribution function.

�(r) :=

Z r

�1

e

� x

2

2dxp

2⇡

�1(1� �) ⇠

p2 ln(1/�) for � ⌧ 1.

Gaussian lower bound

Family: P�2

Gauss, all Gaussian distributions over Rwith variance �2 > 0.

Thm (Catoni): for any n,

inf

bEn

sup

P2PXn

1 =dP⌦n

P✓bEn(X

n1 )� µP � �(1� �)�1 �Pp

n

◆= �

Similar result for lower tail.

Gaussian lower bound

Family: P�2

Gauss, all Gaussian distributions over Rwith variance �2 > 0.

Thm (Catoni): for any n,

inf

bEn

sup

P2PXn

1 =dP⌦n

P✓bEn(X

n1 )� µP � �(1� �)�1 �Pp

n

◆= �

Similar result for lower tail.

This is asymptotic to

Lp

ln(1/�) with L =p2

Compare with definition

Given: P, family of probability distributions over R.For P 2 P, µP and �2

P are the mean and variance of P.

Want: for each large enough n 2 N, an estimator

bEn : Rn ! Rand a parameter �min = �min,n 2 [0, 1) such that,

if Xn1 = (X1, . . . , Xn) is i.i.d. from P 2 P, then

8� 2 [�min, 1) : P | bEn(X

n1 )� µP| > L�P

r1 + ln(1/�)

n

! �.

Gaussian lower bound

Family: P�2

Gauss, all Gaussian distributions over Rwith variance �2 > 0.

Thm (Catoni): for any n,

inf

bEn

sup

P2PXn

1 =dP⌦n

P✓bEn(X

n1 )� µP � �(1� �)�1 �Pp

n

◆= �

Similar result for lower tail.

The empirical mean

It follows from Catoni’s result the empirical mean has optimal deviations for all Gaussian distributions.

This is an exception, rather than the rule.

bEn(Xn1 ) :=

1

n

nX

i=1

Xi

Empirical mean fails

Example: P�2

2 , all distributions with

variances �2P = �2

.

Thm (Catoni): Chebyshev is basically optimal.

sup

P2P�22

Xn1 =dP⌦n

P �����

1

n

nX

i=1

Xi � µP

����� >c�Pp� n

!� �.

Empirical mean fails

Example: Pkrt, all distributions with

kurtosis P := EP|X � µP|4/�4P .

Thm (Catoni): If n is large and � 1/n.

sup

P2Pkrt

Xn1 =dP⌦n

P �����

1

n

nX

i=1

Xi � µP

����� >c�P

(� n)1/4

!� �.

Positive results

Catoni obtained sharp sub-Gaussian estimators in some settings.

Unfortunately, they depend on the confidence level!

One example

Example: P�

2

2 , all distributions with variance �

2P = �

2.

Thm (Catoni): Set �min := e

�o(n). Then 8� 2 [�min, 1),

there exists a �-dependent bEn,�

with

sup

P2P�22

X

n1 =dP⌦n

P | bE

n,�

(X

n

1 )� µP| > �P

r(2 + o(1)) ln(2/�)

n

! �

Why is this bad?

Suppose you want high confidence. Only guarantee is that the probability of huge error is very low.

Nothing is known about the probability of average-to-large error in more typical events.

Why is this bad?

Statistical and machine learning applications (Bubeck et al., Brownlees et al., Hsu/Sabato) had to cope with this dependence on the confidence level.

In all cases, something was lost.

Our results are “better”

… or rather, genuinely different.

Our results imply that for parameter-dependent estimators are easier to obtain.

We’ll see that right now.

Median of means

Median of means

Simple construction of a sub-Gaussian parameter-dependent estimator that only requires finite second moments.

Known for a long time, in many forms, in different comunities (Nemirovski/Yudin, Alon/Matias/Szégedy, Levin, Jerrum/Sinclair, Hsu…). “Pre-history”.

Median of means

Example: P�2

2 , all distributions with

variances �2P = �2

.

Thm: Set �min := e1�n/2. Then 8� 2 [�min, 1),

there exists a �-dependent bEn,� with

sup

P2P�22

Xn1 =dP⌦n

P | bEn,�(X

n1 )� µP| > L�P

r1 + ln(2/�)

n

! �

Median of meansSample: Xn

1 := (X1, X2, X3, . . . , Xn) from distribution P.

Blocks: split {1, 2, . . . , n} = B1 [B2 [ · · · [Bb,

disjoint blocks of size n/b.Means: for each block B`, define

Y` :=b

n

X

i2B`

Xi

Median of means:

bEn,�(Xn1 ) := median of (Y1, Y2, . . . , Yb)

Analysis

RµPµP � L�P

rb

nµP + L�P

rb

n

Interval

Analysis

RµPµP � L�P

rb

nµP + L�P

rb

n

Want: median of Y1, . . . , Yb in interval.

Su�cient: more than half of the Y`’s

are in there.

Analysis

RµPµP � L�P

rb

nµP + L�P

rb

n

Y` =b

n

X

i2B`

Xi, with the Xi i.i.d. P

E(Y`) = µP, Var(Y`) = b�2P/n

Analysis

RµPµP � L�P

rb

nµP + L�P

rb

n

By Chebyshev, P (Y` 62 interval) L�2

Disjoint blocks) events are independent.

Analysis

RµPµP � L�P

rb

nµP + L�P

rb

n

Probability that � b/2 Y`’s not in interval

is bounded by a binomial tail probability.

If L is large, P�Bin(b, L�2

) � b/2� e�b

Analysis

RµPµP � L�P

rb

nµP + L�P

rb

n

Probability that � b/2 Y`’s not in interval

is bounded by a binomial tail probability.

If L is large, P�Bin(b, L�2

) � b/2� e�b

b ⇡ ln(1/�) and we’re done

Our proof ideas

Exponential is optimalFamily: PLa, all Laplace distributions La

, with � 2 R and

dLa

(x)

dx

=

e

�|x��|

2

Property: e

�|�|n dLa⌦n�

dLa⌦n0

(x) e

|�|n

Consequence: any estimator with constant L

will mistake a La0 sample for a La10L2sample

with prob. ⇡ e

1�5L2n

.

Partially known varianceExample: P [�2

1 ,�22 ]

2 := all distributions with variance �2P 2 [�2

1 ,�22 ].

We let R := �2/�1 (may depend on n).

Theorem:

If R is bounded, then for all large enough nthere exist

bEn : Rn ! R, �min ⇡ e�c nand L constant

such that, when P 2 P [�21 ,�

22 ]

2 and Xn1 =d P

⌦n,

8� 2 [�min, 1) : P | bEn(X

n1 )� µP| > L�P

r1 + ln(1/�)

n

!.

If R unbounded , any sequence �min ! 0 fails.

Why unbounded fails

Family: P [c/n,R c/n]Po

, Poisson random variables

with very small means c/n µP

Rc/n.

Recall mean=variance for Poisson!

Xn1

:= sample with mean c/n, SX := X1

+ · · ·+Xn.

Y n1

:= sample with mean Rc/n, SY := Y1

+ · · ·+ Yn.

Why unbounded fails

Xn1 := sample with mean c/n, SX := X1 + · · ·+Xn.

Y n1 := sample with mean Rc/n, SY := Y1 + · · ·+ Yn.

Assume good estimator

bEn with constant L.

P⇣n bE(Y n

1 ) � Rc/2⌘� 1� e1�

Rc4L2

In particular, P⇣n bE(Y n

1 ) � Rc/2 | SY = Rc⌘⇡ 1.

Why unbounded fails

Xn1 := sample with mean c/n, SX := X1 + · · ·+Xn.

Y n1 := sample with mean Rc/n, SY := Y1 + · · ·+ Yn.

Assume good estimator

bEn with constant L.

P⇣n bE(Y n

1 ) � Rc/2⌘� 1� e1�

Rc4L2

In particular, P⇣n bE(Y n

1 ) � Rc/2 | SY = Rc⌘⇡ 1.

Same for X as for Y! (Sample sum is sufficient statistic)

Why unbounded fails

P⇣n bE(Xn

1 ) � Rc/2 | SX = Rc⌘⇡ 1.

So P⇣n bE(Xn

1 ) � Rc/2⌘� P (SX = Rc) ⇡ e�R lnRc

On the other hand, the prob. should be ⇡ e�R2 cL2

by the sub-Gaussian estimation property

)( for R large

The positive resultExample: P [�2

1 ,�22 ]

2 := all distributions with variance �2P 2 [�2

1 ,�22 ].

We let R := �2/�1 (may depend on n).

Theorem:

If R is bounded, then for all large enough nthere exist

bEn : Rn ! R, �min ⇡ e�c nand L constant

such that, when P 2 P [�21 ,�

22 ]

2 and Xn1 =d P

⌦n,

8� 2 [�min, 1) : P | bEn(X

n1 )� µP| > L�P

r1 + ln(1/�)

n

!.

If R unbounded , any sequence �min ! 0 fails.

Confidence intervalsUse median of means. Get a confidence interval.

bIn,�(Xn1 ) :=

"bEn,�(X

n1 )± L�2

r1 + ln(1/�)

n

#

P µP 2 bIn,�(Xn

1 ) and |bIn,�(Xn1 )| 2LR �P

r1 + ln(1/�)

n

!� 1� �

Confidence intervals

We'll combine sub-Gaussian confidence intervals to obtain a single sub-Gaussian estimator.

Similar in spirit to Lepskii’s adaptation method from nonparametric statistics.

Confidence intervals

Lemma: I1, I2, . . . , IK random nonempty closed intervals.

Assume µ 2 R, P (µ 62 Ik) 2

�k, 1 k K.

Set

ˆK := min{k K : \Kj=kIj 6= ;}.

Let

bE :=midpoint of \Kj=K̂

Ij .

Then 81 k K : P⇣| bE � µ| > |Ik|

⌘ 2

1�k.

Proof sketchI1, I2, . . . , IK random nonempty closed intervals.Set K̂ := min{k K : \K

j=kIj 6= ;}.Let bE :=midpoint of \K

j=K̂Ij .

Assume 8j � k, µ 2 Ij .

Obtain, \Kj=kIj 6= ;, so K̂ k.

Hence bE, µ 2 Ik under the assumption.

) P⇣| bE � µ| > |Ik|

Pj�k P (µ 62 Ij).

Other usesExample: P↵,⌘ := all distributions with

EP|X � µP|↵ (⌘ �P)↵,

(here ↵ 2 (2, 3) is fixed, ⌘ � ⌘0 may depend on n)

Theorem: for all large enough n, if k↵,⌘ := (C ⌘)2↵/(↵�2),

there exist

bEn : Rn ! R, �min ⇡ e�c n/k↵,⌘and L constant

such that, when P 2 P↵,⌘ and Xn1 =d P

⌦n,

8� 2 [�min, 1) : P | bEn(X

n1 )� µP| > L�P

r1 + ln(1/�)

n

!.

Other usesExample: P↵,⌘ := all distributions with

EP|X � µP|↵ (⌘ �P)↵,

(here ↵ 2 (2, 3) is fixed, ⌘ � ⌘0 may depend on n)

Theorem: for all large enough n, if k↵,⌘ := (C ⌘)2↵/(↵�2),

there exist

bEn : Rn ! R, �min ⇡ e�c n/k↵,⌘and L constant

such that, when P 2 P↵,⌘ and Xn1 =d P

⌦n,

8� 2 [�min, 1) : P | bEn(X

n1 )� µP| > L�P

r1 + ln(1/�)

n

!.

Use quantiles of means (instead of medians of means) to build confidence intervals.

Barry-Esséen-type bounds prove that empirical means are nearly symmetric.

Different ideas - kurtosis

Under bounded kurtosis, can use the empirical mean of truncated random variables.

The truncation is data driven and uses preliminary estimates of mean and variance.

Use empirical processes to show this is similar to truncating at the exact mean and variance. Sharp bounds!

Open problems

Open problemsSharp constants are essential for statisticians.

Are sub-Gaussian confidence intervals somehow equivalent to sub-Gaussian estimators?

Efficient extensions to vector-valued data and to risk minimization problems.

Optimal deviation bounds for Poissons, Bernoullis, etc.

Obrigado! (references in the next slides)

Our preprint

Should be posted to the arXiv in some weeks. Available upon request from

roboliv AT gmail.com

Catoni’s work

Catoni’s estimation paper + companion paper on least squares (with Audibert).

J.-Y. Audibert & O. Catoni. "Robust linear least squares regression.” Ann. Stat. 39 no. 5 (2011)

O. Catoni. "Challenging the empirical mean and empirical variance: A deviation study.” Ann. Inst. H. Poincaré Probab. Statist. 48 no. 4 (2012)

Median of meansD. Hsu http://www.inherentuncertainty.org/2010/12/robust-statistics.html (See also Levin, L. "Notes for Miscellaneous Lectures.” arXiv:cs/0503039)

N. Alon, Y. Matias & M. Szégedy. "The Space Complexity of Approximating the Frequency Moments." J. Comput. Syst. Sci. 58 no. 1 (1999)

A. Nemirovski & D. Yudin. Problem complexity and method efficiency in optimization. Wiley (1983).

Some applicationsC. Brownlees, E. Joly & G. Lugosi. "Empirical risk minimization for heavy-tailed losses.” To appear in Ann. Stat.

S. Bubeck, N. Cesa-Bianchi & G. Lugosi. “Bandits with heavy tail.” IEEE Transactions on Information Theory 59 no. 11 (2013)

D. Hsu & S. Sabato. "Loss minimization and parameter estimation with heavy tails.” arXiv:1307.1827. Abstract in ICML proceedings (2014).