+ All Categories
Home > Documents > 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Date post: 08-Jan-2018
Category:
Upload: lily-glenn
View: 222 times
Download: 0 times
Share this document with a friend
Description:
Welcome Teaching staff: Who am I? Who are you? Stop by my office. Office hours: To Be Announced this week. And by appointment Teaching Assistants for the course: TA: Aniish Sridhar ) Office hours: TBA Undergrad Course Assistants: office hours TBA Yousuf, Anas, and Zehni
29
70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis
Transcript
Page 1: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

70-208: RegressionSpring 2016

John Gasper

Lecture 1: Introduction to Regression Analysis

Page 2: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Welcome

What is Regression? Why should we care? What can we do with it?• How much do sales increase with every advertisement placed?• How do wages of employees depend on education?• How will the price of a stock change?• Estimating demand (optimal pricing)• Estimating effects and Prediction/Forecasting

Page 3: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Welcome

Teaching staff:• Who am I?• Who are you? Stop by my office.• Office hours: To Be Announced this week.

• And by appointment

• Teaching Assistants for the course:• TA: Aniish Sridhar (aniishs@andrew )

• Office hours: TBA• Undergrad Course Assistants: office hours TBA

• Yousuf, Anas, and Zehni

Page 4: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Course Details• Textbooks:

• Statistics for Business (main text; you should have it)• Next Generation Excel (supplemental text – in the Library)

• Attendance and participation• Required. Clickers – bring them to every class.• Blackboard + Piazza discussion site

• Cell phones and laptops• Turn off your phones.• Computers OK for for taking notes and working through data.

NOT OK to check news, facebook, twitter, youtube…• Seriously. If I or a TA sees you, odds are that I’ll ask you to leave. It’s

disrespectful to me and other students.

Page 5: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Course Details• Grades: (aka what you stress over but shouldn’t)• How do you get a good grade in this class?The only way to learn the material is to do it. • Homework Exercises = 10%

• Problem Sets graded on Check System. • Lab Quizzes (x5) = 4% each (20%)

• Attend 90% of classes and scored best 4 of 5.• Midterm Exams (x3) = 15% each (45%)• Final Exam = 25%

Page 6: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Academic Integrity

I take academic integrity very seriously • I know all professors say that – but trust me…• I do my best to make sure that it doesn’t make sense for you

to cheat:• Homeworks• Exams

• If I suspect a violation, I will report it. • Before you copy someone’s homework (that won’t increase

your grade), remember that violations can include suspension or even expulsion

Page 7: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Why Excel?• The wisdom of Willie Sutton• Who is Willie Sutton?

• “…because that’s where the money is.”

• There are lots of great statistical packages (I’m a fan of R)• But there are many benefits of being comfortable with Excel• IMO those benefits outweigh the pain of using Excel for

something it wasn’t really designed to do.

• If you’re unfamiliar with Excel, I would highly suggest reviewing videos via Lynda.com:• http://www.cmu.edu/lynda/

Page 8: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Course Details• Warning: There is a lot of material in the course and we’ll

move quickly.

• Any questions?

Page 9: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Review

Data: what is it?• Types of measurements: nominal, ordinal, interval, and ratio• Categorical data

• Measures of Centrality: mode, median (if data are ordered)• Cross tabulations often useful

• Numerical• Measures of Centrality: median, mode, mean• Measures of Spread: variance/standard dev, range,

interquartile range, etc

Page 10: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Review: Describing Data

There are many ways to describe and examine data, and that at a basic level is what we’ll be doing in this class – summarizing relationships between variables.• You should be familiar with and know how to calculate:• Categorical

• 1 variable: bar charts, pie charts, etc. • 2 variables: Contingency tables (x-tabs); Chi-sq tests

• Numerical• 1 variable: histograms, boxplots, cumulative

distribution• 2 variables: scatterplots, correlation, t-test, etc…

Page 11: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

• Histogram, PDF and CDF of exam scores:

• Scatterplot of Exam 1 and Exam 2: (correlation = 0.39)

Common Plots

Page 12: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Review: Graphical summaries ?

{

Page 13: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Review: Graphical summaries boxplot

{

histogram

Page 14: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Review: Graphical summaries Center: Median?

• 3.5

Page 15: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Review: Graphical summaries Inter quartile range?

• First to third quartile

Page 16: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Review: Graphical summaries Center: Mean?

• 3.8 Why?

Page 17: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Review: Graphical summaries Center: Mean?

• The mean is greater than the median here because the data are slightly skewed 3.8 vs 3.5

Page 18: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Excel example

Dataset: cars1.xlsx

Page 19: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Probability Review• What does ‘P(heads) = .5’ mean? • What about ‘P(“Alice will get an A in Regression”) = .75’?

• Frequentist vs Bayesian interpretations. Differences don’t matter for this class and I’ll use language from both.

• Basic properties:• 0 ≤ P(A) ≤ 1• P(A) = 1 – P(Ac)• P(A or B) = P(A) + P(B) – P(A and B)• Events A and B are independent if the occurrence of one

doesn’t tell you anything about the occurrence of the other.

Page 20: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Conditional Probability

• P(A and B) is often called the “joint probability”• P(A) is the “marginal probability”

• P(A and B) + P(A and ~B) = P(A)• The conditional probability

• P(A|B) = P(A and B) / P(B)• P(A|B) is very different than P(B|A).

B ~BA P(A and B) P(A and ~B) P(A)

~A P(~A and B) P(~A and ~B) P(~A)

P(B) P(~B) 1

Page 21: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

More review:Normal Distribution

What is the Normal distribution?• Often called the “Bell Shaped Curve.”• This isn’t quite right. It is bell shaped, but there are many

bell shaped distributions that aren’t the Normal dist.

Normal, or Gaussian, distributions are going to be very important for us. • Often we’ll need to assume that a random variable X is

Normally distributed, denoted X ~ N(μ,σ2)

Page 22: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Normal Distribution

Different μ Different σ

Page 23: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Random Variables• Random doesn’t mean haphazard. Consider an uncertain investment: X

• X could lose 1000 (with probability = .3)• X could gain 10000 (with probability = .2)• X could gain 100 (with probability = .5)

• X is a Random Variable. What is the expectation of X?• E(X) = p(x1)x1 + p(x2)x2 + …p(xn)xn • E(X) = 0.5*100 + 0.2*10000 + 0.3*-1000 = 1750 = μ

• Variance of X?• Var(X) = E(X – μ)2 =σ2

• = (x1 – μ)2 p(x1) + (x2 – μ)2 p(x2) + … + (xn – μ)2 p(xn) • = (100- 1750)2 * 0.5 + (10000 – 1750)2 * 0.2 + (-1000 – 1750)2*0.3

• And higher order moments Skew, Kurtosis, etc.

• Regression is basically about Conditional Expectation: E(Y|X)• I.e., what do we expect about Y given we have some information X

Page 24: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

More on the Normal DistNormality• Why assume Normality? The Central Limit Theorem tells us that

we’re often OK: The probability distribution of a mean (or sum) of IID random

variables of tends to a Normal distribution (asymptotically)• Several versions of the CLT but we won’t go through the proofs

here (they can be a little nasty)• So why are we OK?

• Observed data are often (not always) the accumulation of many small factors (e.g., the value of the stock market depends on many investors, or scores on an exam)

Page 25: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Quantile Plots• A visual check on Normality

• Why wouldn’t just looking at the density or histogram work?• Sometimes skew, kurtosis, etc, is easy to see but often it is not

unless you look at a quantile plot

If data track the diagonal line, you can safely assume it’s a Normal distribution.

Page 26: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Standardizing a Variable:z-scores

What is a z-score?• Transforms a variable to standard deviation units away from

the mean. Centered at 0.• Why would we use it?

Page 27: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Probabilities and Percentiles

1. What is P(X = 600)? 2. What is P(X >= 600)?

Page 28: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Percentiles• The lifetime (in km) of a certain brand of automobile tires is a

normally distributed random variable, • X ~ N(μ=40,000 km, σ=2000 km)

• In a shipment of 3000 tires how many tires are expected to have a lifetime that is less than 35,000 miles?

• E(# of tires) = P(X < 35000) * 3000• So how do we calculate P( X < 35000)?

• Z-scores. Or very easy in Excel: NORM.DIST()• norm.dist(x, μ, σ, Cumulative?)• norm.dist(35000, 40000, 2000, TRUE) = .0062• E (# tires) = .0062 * 3000 = 18.6 = 19

Page 29: 70-208: Regression Spring 2016 John Gasper Lecture 1: Introduction to Regression Analysis.

Next time• If any of the topics today seem hazy, review those chapters

(take note of chapters 4, 12, and 15).• Problem Set 1 due next Monday 9am. • First quiz next Wednesday

• Pick up your clicker this week• Must have it by next Monday’s class


Recommended