Lecture 2 – Modern Statistical Modeling, an Overview

8/31/2006 1

Lecture 2 – Modern Statistical Modeling, an Overview

Rice ELEC 697

Farinaz Koushanfar

Fall 2006

8/31/2006 2

Summary

• A little bit of history

• The culture of statistical modeling– Classic– Modern

• Exploratory data analysis– Exploratory vs. confirmatory– Examples

8/31/2006 3

A little bit of history

• Statistics is the science of learning from data to understand its meaning, structure, relationships, etc – 100s of years

• For a brief history of pre-20th centruty statistics check: http://www.bized.ac.uk/timeweb/reference/statisticians.htm

• Statistics as an independent discipline has started to separate from math ~70 years ago

• Like many other disciplines in science and engineering, statistics has undergone a major revolution in the past 30 years

• Earlier, most data was collected manually, and we were dealing with small data sets. Now, we have terabits storage data bases that we like to capture and model

http://www.bized.ac.uk/timeweb/reference/statisticians.htm

8/31/2006 4

The scientists behind what I will talk about today…

• An appropriate answer to the right problem is worth a good deal more than an exact answer to an approximate problem – John Tukey

• Wrote his PhD thesis in Convergence and Uniformity in topology at Princeton (1939)

• Recognized the importance of statistics during the World War II

• Mathematics is just a tool to facilitate addressing sound problems

• Many contributions including fast Fourier transforms, Jackknife, exploratory data analysis

John Wilder Tukey1915-2000

John W. Tukey, We Need Both Exploratory and Confirmatory The American Statistician, Vol. 34, No. 1 (Feb., 1980), pp. 23-25

8/31/2006 5

The scientists behind what I will talk about today…

• PhD in math in 1954 at Berkeley• Became a Prof. of probability at UCLA

math dept.• Left in 1967 – realized that abstract

mathematics has very little to do with real life

• Wrote a book, independent consultant for 13 years

• Finally could solve interesting and important real-world problems!!

• Got a Berkeley position in 1980, this time to help fund the right department for him

Leo Breiman1928-2005

Leo Breiman, Statistical Modeling: The Two Cultures Statistical Science, Vol. 16, No. 3 (Aug., 2001), pp. 199-215

8/31/2006 6

The culture of statistical modeling

• Statistics really starts with data• Two main goals

– Prediction (estimation)– Information (detection)

• Two different cultures– Stochastic models, e.g., response var =

f(predictor var, random noise, parameters), model selection, prediction, evaluation (classic)

– Algorithmic models, the relating function is an algorithm that operates on the input x to predict the response y (modern)

Naturex y

8/31/2006 7

Breiman argues that,

The focus on classical data models and ignorance of modern methods has:

• Led to irrelevant theory and questionable scientific conclusions

• Kept statisticians back from using more suitable models

• Has prevented the classical statisticians from working on exciting new problems

• In this course, we will cover more of classics and a few modern methods

8/31/2006 8

Back to the history

• Upon his return to academia, Breiman realized that all articles (at the time) begin and end with data models

• Data models has had success in analyzing the data and getting information about producing data

• Misuse of data models has lead to many questionable conclusion about the underlying system

• Algorithmic models are mostly developed in the machine learning community

• Modern learning has lead to changes in perception!

8/31/2006 9

The model becomes the truth!

• Invent or use a reasonably good parametric class of models for a complex mechanism

• Estimate the parameters and draw conclusions:– The conclusions are about the model’s mechanism and not about

the nature mechanism– If the model is a poor estimation of the nature, the conclusions are

wrong!• Example: • Assume that the data is iid following the above model• The coefficients {bm} are to be estimated, N(0,2)• Tests of hypothesis, confidence intervals, distribution of

residual sum of squares, etc.• Thousands of articles are published on related proofs• Conclusions drawn ignoring that models are valid

M

1mm0

xbby

8/31/2006 10

More problems with classical data models

• Multiplicity of data models – Answering the question of which model is the best– Each model gives a different picture of the reality

and leads to different conclusions

• Predictive accuracy– This is a function of the number of parameters

used so is not a good measure alone

• Other limitations of data models (next slide)

8/31/2006 11

Limitations of data models

• Multivariate analysis is just not working– Nobody really believes in multivariate Normal, but

everybody uses it– If all a man has is a hammer, then every problem

looks like a nail... As data becomes more complex the simplicity of model-based approach diminishes

– Approaching the problem by looking for a data model restricts statisticians from dealing with more interesting and realistic problems

8/31/2006 12

Algorithmic models

• Have been around for some time, pioneers among statisticians, include Olshen, Friedman, Wahba, Zhang, and Singer

• Many new problems have been attacked, including speech, image, and handwriting recognition, nonlinear time series, financial market predictions

• Shift from data models to the properties of the algorithms• Characterizing convergence and complexity• Example: Vapnik constructed informative bounds on the

generalization error of classification algorithm that depends on the capacity of the algorithm! (Support Vector Machines)

8/31/2006 13

Examples of recent advances

• Multiplicity of good models (Rashomon)– Bagging is a solution

• The conflict between simplicity and accuracy (Occam)– Occam dilemma: accuracy requires more complex

prediction. Simple and interpretable functions do not make accurate predictors

• Dimensionality – curse or blessing (Bellman)– How to extract and put together many small pieces of

information

8/31/2006 14

Breiman concluding remarks

• Nowhere it is written on stone what kind of models should be used

• Breiman is not against data models, but he thinks the emphasis has to be on the problem and not model

• Find a way to manage complex environments• E.g. microarray data, Internet traffic, ad-hoc network

complexity, ULSI variability, etc• The root of science is to check theory against reality• Need this philosophy to address real-world problems

8/31/2006 15

Exploratory data analysis (EDA)

• Analysis can be done by various techniqes– Mathematical– Logical– Tabular– Graphical– …

8/31/2006 16

EDA

• EDA mostly uses graphical techniques, but

• It is really a different philosophy to approach the problem

• Differs from classical methods, also referred to by confirmatory data analysis (CDA)

8/31/2006 17

EDA vs. CDA

• CDA– A general problem to

explore

– Collects some data

– Makes a hypothesis on the models

– Carries out an analysis of the data based on models

– Draws conclusions based on the model features

• EDA– A general problem to

explore

– Collects some data

– Carries out an analysis of the data

– Infers a model that is appropriate

– Draws conclusions based on the data features

8/31/2006 18

EDA vs. CDA (Cont’d)

• Rigor– CDA is rigorous, formal and objective– EDA is suggestive, subject to analysts view

• Data treatment– In CDA few numbers summarize data properties– In EDA all data is in focus

• Assumptions– In CDA one discovers statistically significant variations

from the assumed model, assuming it was correct– In EDA, the assumptions are few, analysis of data has

priority

8/31/2006 19

Why Exploratory Data Analysis?

• EDA is oriented toward the future, rather than the past– Utilize data to understand, rather than summarize– Really important in research

• A good feel for data is invaluable– Gain insights into the process behind the data– To understand what is NOT in the data

• Can (almost) only be obtained by graphical techniques– Graphs give information that no number can replace– Rely on human ability to recognize patterns and to compare

8/31/2006 20

Typical Assumptions for Measurement Process

• The data from a process is:– Random drawings (one data point should not influence the

other)– From a fixed distribution (an thus generalizable)– The distribution has a fixed location (the expectation is

fixed)– And a fixed variation (the way the data differs from the

expectation is fixed)

• We measure mean and variance to asses the last two assumptions

8/31/2006 21

EDA Techniques

• Plot a lot of aspects of the data in a variety of techniques, including scatter plots, barplots, histograms, pie charts, and factor plots

• E.g. Run sequence plot for mean and variance assumptions– All values of yi are plotted on a chart where the y-axis is yi

against the index i (x-axis)

– Graphically check the fixed location

– Graphically check the fixed variations

8/31/2006 22

Example- EDA

• Run sequence plot (compare the two)

Date post:	01-Jan-2016
Category:	Documents
Upload:	noelani-mcconnell
View:	32 times
Download:	1 times

Lecture 2 – Modern Statistical Modeling, an Overview

Documents