+ All Categories
Home > Documents > Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Date post: 13-Jan-2016
Category:
Upload: anissa-shelton
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
48
Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1
Transcript
Page 1: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Reasoning about Uncertainty in High-dimensional Data Analysis

Adel Javanmard

Stanford University

1

Page 2: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

What is high-dimensional data?

• Modern data sets are both massive and fine-grained.

2

# Features (variables) > # Observations (Samples)

• A trend in modern data analysis.

Page 3: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

High-Dimensional Data: an example

Allergiestype

reactionseverity

start year stop year

…..

Diagnosis infoICD9 codesDescriptionstart year stop year

…..

Medicationsname

strengthschedule

…..

Transcript recordsage

genderBMI

heart rate billing information

…..

Lab resultsHl7 text

valueabnormalityobs. year

…..

Medical images

3

Page 4: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

What can we do with such data?

• Extract useful, actionable information.

• Predictive models for: clinical outcomes patient evolution readmission rate …

• Design (or advise) treatment clinical interventions and trials

4

Health Care Reform

HITECH

Heritage Health Prize

• More than 71 million persons are admitted to hospitals each year.

• Over $30 billion was spent on unnecessary hospital readmissions (2006).

Page 5: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Diabetes Example

• n = 500 (patients)• P = 805 (variables) medical information: medications, lab results, diagnosis, …

[Data from Practice Fusion posted on kaggle]

• Find significant variables in predicting Type 2-Diabetes

• “People with higher Bilirubin are more susceptible to diabetes ”

How certain we are about this claim?

5

Page 6: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

para

met

ers

Problem of Uncertainty Assessment

6

How stable are these estimates? What can we say about the true parameters?

indices

Page 7: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Confidence intervalspa

ram

eter

s

index

7

Blood pressure

Page 8: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Why is it hard?

• Low dimensional regime ( fixed, )

8

Large Sample Theory

• Situation in high-dimensional regime is very different!

Page 9: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

9

• Much progress has been achieved for high-dimensional parameter estimation high-dimensional variable/feature selection high-dimensional prediction

Supp ( )

[Tibshirani, Donoho, Cai, Zhou, Candés, Tao, Bickel, van de Geer, Ritov, Bühlmann, Meinshausen, Zhao, Yu, Wainwright, …]

Page 10: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

10

How to assign measures of uncertainty to each single parameter?

Page 11: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Other examples

11

Targeted online advertising Personalized medicine

Social Networks Collaborative filtering

Genomics

Page 12: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Overview of Regularized Estimators

12

Page 13: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Regularized Estimators

• Investigate low dimensional structures in data

minimize Loss +λ Model Complexity

• Mitigates spurious correlations noise accumulation instability (to noise and sampling)

• This comes at a price. biased (towards small complexity) nonlinear and non explicit

13

Page 14: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Diabetes Example

• Patient gets type-2 diabetes• Variables of patient

• Convex optimization• Variable selection (some of )

contribution of feature

logistic loss

argmin

14

Page 15: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Selects 62 interesting features.

We want to construct confidence intervals for each parameter.15

Page 16: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

What is a confidence interval?

• We observe data generated by

16

• Confidence intervals:

Confidence intervals are random objects.

Confidence level

Page 17: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Why uncertainty assessment?

Scientific discoveries

99% confidence

70% confidence

17

• Curry increases the cognitive capacity of the brain. [Tze-Pin Ng, 2006]

• Beautiful parents have more daughters than ugly parents. [Kanazawa 2006]

• Left-handedness in males has a significant level on wage level. [Ruebeck 2006]

“Why most published research findings are wrong?” [John P. A. Ioannidis]

Page 18: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Why uncertainty assessment?

Decision making

18

System state:

We take measurements

abnormal zone

normal zone

State space

Page 19: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Why uncertainty assessment?

Optimization/ Stopping rules

19

First order methods for large scale data (coordinate descent, mirror descent, Nesterov’s method, …)

iteration

Optimization is a tool not the goal!

Page 20: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Why uncertainty assessment?

Optimization/ Stopping rules

20

First order methods for large scale data (coordinate descent, mirror descent, Nesterov’s method, …)

Stopping point

iteration

Page 21: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Reasoning about Uncertainty

21

Page 22: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Setup

p

Gaussian noise with mean zero and covariance .

= +n p

22

Page 23: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Lasso

[Tibshirani 1996, Chen, Donoho 1996]

distribution of ?

argmin

23

Deterministic Random

Page 24: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Approach 1: Sample splitting

Lasso Subset S of variables

[Wasserman, Röeder 2009, Bühlmann, Meier, Meinshausen 2009]24

S S

Least square Distribution of

explicitly

Page 25: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Problems with sample splitting

• Have to cut half of data

• Assumes Lasso on selects all relevant features (plus some).

• It depends on the splitting.

25

Page 26: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Approach 2: Bootstrap

Data Sampled Data

26

fails because of the bias!

Page 27: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Our approach: de-biasing Lasso

27

Classical setting n>p:

Unbiased estimator Precise distributional characterization

Gaussian error

Problem in high dimension (n <p):

is not invertible!

Page 28: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Our approach: de-biasing Lasso

28

Use your favorite M

Lets (try to) subtract the bias

Gaussian errorBiasBias

Page 29: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Geometric interpretation

29

Ball

subgradient of

Page 30: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

How should we choose M?

Bias Error

30

We want small bias and small error.

Page 31: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Choosing M?

i

31

minimize Var (Error )

subject to |Bias | ≤ ξ

i

i

feasible set

• m : i-th row of M• e : (0,0,..,1,0,…0)

ii

Bias

Varia

nce

infeasible

ξξ = ξ*

minimize

subject to ξ

Page 32: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

What does it look like?

is not sparse!32

Page 33: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Distribution of our estimator?

Neglecting the bias

33

Page 34: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Distribution of our estimator?

Histogram Q-Q plot

‘Ground-truth’ from n = 10000 records.tot34

Page 35: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Confidence intervalsco

effici

ents

indices

35

Blood pressure

Coverage: 93.6%

Page 36: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Theorem (Javanmard, Montanari 2013)

Assume has i.i.d. subgaussian rows with covariance . Also eigenvalues of are bounded as sample size grows.Then, asymptotically as , with ,

36

Main Theorem

number of truly significant variables (number of nonzero parameters).

What is s?

Page 37: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Consequences

• Confidence interval for each individual parameter

37

• Length of confidence intervals do not depend on p.

• This is optimal.

Page 38: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Summary (so far)

• High dimensionality and regularized estimators

• Uncertainty assessment for parameter estimations

• Optimality

38

R-package will be available soon!

Page 39: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Further insights and related work

39

Page 40: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Two questions

• How general?

• What about smaller sample size?

40

Page 41: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Question1: How to generalize it?

Regularized estimators:

argminloss regularizer

Suppose that loss decomposes over samples:

41

Page 42: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Question1: How to generalize it?

• Debiasing the regularized estimator

• Find M by solving the same optimization problem

minimize

Subject to ξ

42

Fisher information

Page 43: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Question 2: How about smaller sample size?

• Estimation, prediction:

[Candés, Tao 2007, Bickel et al. 2009]

• Uncertainty assessment, confidence intervals:

[This talk]

Can we match the optimal sample size, ?

43

Page 44: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Can we match the optimal sample size, ?

• Javanmard, Montanari, 2013 Sample size, . Gaussian designs. Exact asymptotic characterization.

• Javanmard, Montanari, 2013 Sample size, . Confidence intervals have (nearly) optimal average length.

44

Page 45: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Related work

• Lockhart, Taylor, Tibshirani, Tibshirani, 2012 Test significance along Lasso path.

• Zhang, Zhang, 2012, Van de Geer, Bühlmann, Ritov, 2013 Assume structure on X. For random designs is assumed to be sparse. Optimality in terms of semiparametric efficiency.

• Bühlmann, 2012 Tests overly conservative.

45

Page 46: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Future directions

46

Page 47: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Two directions

• Uncertainty assessment for predictions

• Other applications

47

Page 48: Reasoning about Uncertainty in High-dimensional Data Analysis Adel Javanmard Stanford University 1.

Thank you!

48


Recommended