+ All Categories
Home > Documents > Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics...

Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics...

Date post: 06-Mar-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
25
upf Regression Albert Satorra M` etodes Estad´ ıstics, UPF, hivern 2013 Albert Satorra ( M` etodes Estad´ ıstics, UPF, hivern 2013 GRAU en CP, hivern 2013 1 / 24
Transcript
Page 1: Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics Standardization Transforming non-normal data Inference: estimation of the mean ANOVA:

upf

Regression

Albert Satorra

Metodes Estadıstics, UPF, hivern 2013

Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 1 / 24

Page 2: Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics Standardization Transforming non-normal data Inference: estimation of the mean ANOVA:

upf

Continguts

1 Univariate statistics

Summary statistics

Standardization

Transforming non-normal data

Inference: estimation of the mean

ANOVA: variation across groups

2 Regression analysis

More data sets, for examples

Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 2 / 24

Page 3: Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics Standardization Transforming non-normal data Inference: estimation of the mean ANOVA:

upf

Univariate statistics

Summary statistics

Summary of distribution of Nota

summary(Nota)

Min. 1st Qu. Median Mean 3rd Qu. Max.

3.011 5.297 6.444 6.475 7.613 10.000

length(Nota) = 365

mean(Nota) = 6.475341

sd(Nota) =1.56445

Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 3 / 24

Page 4: Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics Standardization Transforming non-normal data Inference: estimation of the mean ANOVA:

upf

Univariate statistics

Summary statistics

Histogram of Nota

Nota

Den

sity

3 4 5 6 7 8 9 10

0.00

0.05

0.10

0.15

0.20

0.25

Figure: scatterplot

Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 4 / 24

Page 5: Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics Standardization Transforming non-normal data Inference: estimation of the mean ANOVA:

upf

Univariate statistics

Standardization

Percentiles, under normal distribution

mean(Nota) =6.48; sd(Nota) =1.56

Standardized variable: Znota = (Nota- 6.48)/ 1.56

How many people above 5? 1 - pnorm( (5- 6.48)/ 1.56) = 0.828618

i.e. 83%

Compare with true value, sum(Nota > 5)/length(Nota) =0.8109589

81%

PERCENTILES:

> 5 is the 17.13 percentile

since pnorm( (5- 6.48)/ 1.56)= 0.171382

> 7 is the 63.05 percentile,

since pnorm( (7- 6.48)/ 1.56)

[1] 0.6305587

In general, e.g. el 30 percentile is:

mean(Nota) + qnorm(0.32)*sd(Nota) = 5.74365

Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 5 / 24

Page 6: Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics Standardization Transforming non-normal data Inference: estimation of the mean ANOVA:

upf

Univariate statistics

Standardization

Valor immobles, non-normalitydata=read.table("http://www.econ.upf.edu/~satorra/dades/inmob.txt",header = T)

summary(Valor)

Min. 1st Qu. Median Mean 3rd Qu. Max.

160000 381300 751400 1031000 1388000 4937000

We normalize by taking log of Valor

logValor = log(Valor)

summary(logValor)

Min. 1st Qu. Median Mean 3rd Qu. Max.

11.98 12.85 13.53 13.54 14.14 15.41

mean(logValor)=13.54

sd(logValor)= 0.78

for the 3rd Qu. 1388000, we have that

pnorm((log(1388000) - mean(logValor))/sd(logValor))

[1] 0.7796363

which is close to the third quartile

The 60 percentile is to be found as qnorm(0.6)= 0.2533471

So: exp(13.53899 + qnorm(0.6)*0.7839275 ) = 925043.2

Note that the true percentile of 925043.2 is

sum(Valor <925043.2)/length(Valor) = 0.585; that is, 58.5

This accuracy would not be obtained if we did not use the log transformation of the variable Valor

The normal approximation of the 60 percentile is: mean(Valor) + qnorm(0.60)*sd(Valor) = 1246721

Note that 1246721 is the 71 percentile, since sum(Valor <1246721)/length(Valor) = 0.7072222

Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 6 / 24

Page 7: Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics Standardization Transforming non-normal data Inference: estimation of the mean ANOVA:

upf

Univariate statistics

Standardization

Histogram of Valor

Valor

Den

sity

0e+00 1e+06 2e+06 3e+06 4e+06 5e+06

0e+

001e

−07

2e−

073e

−07

4e−

075e

−07

6e−

077e

−07

Figure:

Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 7 / 24

Page 8: Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics Standardization Transforming non-normal data Inference: estimation of the mean ANOVA:

upf

Univariate statistics

Standardization

Histogram of logValor

logValor

Den

sity

12 13 14 15

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Figure:

Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 8 / 24

Page 9: Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics Standardization Transforming non-normal data Inference: estimation of the mean ANOVA:

upf

Univariate statistics

Inference: estimation of the mean

Estimation of the population mean using a sample

sample1=sample(round(Nota,2),20)

sample1

[1] 4.03 8.17 5.16 9.51 5.96 6.77 6.57 4.39 7.61 8.30 5.66 6.54 6.38 5.20 9.25 6.96 7.49 5.99 8.46 3.44

summary(sample1)

Min. 1st Qu. Median Mean 3rd Qu. Max.

3.440 5.545 6.555 6.592 7.750 9.510

sd(sample1)= 1.678046

t.test(sample1)

data: sample1

t = 17.5682, df = 19, p-value = 3.314e-13

alternative hypothesis: true mean is not equal to 0

95 percent confidence interval:

5.80665 7.37735 (we know that the true value is 6.48)

t.test(sample1, conf.level = 0.9)

...

alternative hypothesis: true mean is not equal to 0

90 percent confidence interval:

5.943191 7.240809

Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 9 / 24

Page 10: Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics Standardization Transforming non-normal data Inference: estimation of the mean ANOVA:

upf

Univariate statistics

Inference: estimation of the mean

Estimation of the population mean using a sample

sample2=sample(round(Nota,2),20)

Sample 2:

summary(sample2)

Min. 1st Qu. Median Mean 3rd Qu. Max.

4.160 5.535 6.120 6.308 7.268 8.570

sd(sample2)

[1] 1.201925

t.test(sample2)

data: sample2

t = 23.4727, df = 19, p-value = 1.702e-15

alternative hypothesis: true mean is not equal to 0

95 percent confidence interval:

5.745982 6.871018

Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 10 / 24

Page 11: Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics Standardization Transforming non-normal data Inference: estimation of the mean ANOVA:

upf

Univariate statistics

ANOVA: variation across groups

Group variation ? (groups are classrooms)

ANOVA:

aov2= aov(TestExamenFinal ~ as.factor(GrupTeoria))

summary(aov2)

Df Sum Sq Mean Sq F value Pr(>F)

as.factor(GrupTeoria) 4 996 249.10 8.998 6.17e-07 ***

Residuals 360 9966 27.68

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

YES!

Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 11 / 24

Page 12: Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics Standardization Transforming non-normal data Inference: estimation of the mean ANOVA:

upf

Univariate statistics

ANOVA: variation across groups

Class group variation of Nota

1 2 3 4 5

05

1015

20

Figure: scatterplot

Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 12 / 24

Page 13: Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics Standardization Transforming non-normal data Inference: estimation of the mean ANOVA:

upf

Univariate statistics

ANOVA: variation across groups

Group variation ? (groups are types of exam)

aggregate(Nota, list(Tipus),mean)

Group.1 x

1 1 10.90449

2 2 10.73333

3 3 10.84239

4 4 11.04255

boxplot(Nota ~ Tipus, col=2:5, main="per tipus (1 a 4) d’examen")

dev.copy2pdf(file="/AlbertNou/COURSES/Metodes05_07_10/boxplot2.pdf")

ANOVA

aov1= aov(Nota ~ as.factor(Tipus))

summary(aov1)

Df Sum Sq Mean Sq F value Pr(>F)

#as.factor(Tipus) 3 5 1.534 0.051 0.985

# Residuals 361 10958 30.354

Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 13 / 24

Page 14: Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics Standardization Transforming non-normal data Inference: estimation of the mean ANOVA:

upf

Univariate statistics

ANOVA: variation across groups

1 2 3 4

05

1015

20

per tipus (1 a 4) d'examen

Figure: scatterplot

Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 14 / 24

Page 15: Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics Standardization Transforming non-normal data Inference: estimation of the mean ANOVA:

upf

Univariate statistics

ANOVA: variation across groups

On categorial data relations

Examples of categorical variables include marital status (never married,married, divorced, or widowed) or race (black, white, Asian, native American, andInuit). We are interested in the study of associations between variables that mayhave more than two categories.

More details on relation among categorical variables are to be found here:Slides

Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 15 / 24

Page 16: Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics Standardization Transforming non-normal data Inference: estimation of the mean ANOVA:

upf

Regression analysis

The regression model relates a single outcome to one or a number ofcovariates. The model embodies basic ideas about explaining variation, statisticalcontrol, and the functional form of social and behavioral relationships.

Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 16 / 24

Page 17: Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics Standardization Transforming non-normal data Inference: estimation of the mean ANOVA:

upf

Regression analysis

Variation of grade of final exam (Y) across differentlevels of midterm exam (X)

scatterplot of Y vs X

MitjanaControls

Test

Exa

men

Fin

al

40 50 60 70 80 90 100

05

1015

20

Figure: scatterplot

Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 16 / 24

Page 18: Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics Standardization Transforming non-normal data Inference: estimation of the mean ANOVA:

upf

Regression analysis

Simple Linear Regression

> out=lm(TestExamenFinal ~ MitjanaControls)

> summary(out)

Call:

lm(formula = TestExamenFinal ~ MitjanaControls)

Residuals:

Min 1Q Median 3Q Max

-12.7665 -3.7240 0.0601 3.7760 13.2760

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -4.415 1.984 -2.226 0.0267 *

MitjanaControls 0.179 0.023 7.782 7.49e-14 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 5.087 on 363 degrees of freedom

Multiple R-squared: 0.143, Adjusted R-squared: 0.1406

F-statistic: 60.56 on 1 and 363 DF, p-value: 7.491e-14Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 17 / 24

Page 19: Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics Standardization Transforming non-normal data Inference: estimation of the mean ANOVA:

upf

Regression analysis

Simple Linear Regression

library(foreign)

help(read.spss)

data = read.spss("http://www.econ.upf.edu/~satorra/dades/europeancountries.sav", use.value.labels = TRUE,

to.data.frame = TRUE, max.value.labels = Inf, trim.factor.names = FALSE, trim_values = TRUE, reencode = NA, use.missings = to.data.frame)

attach(data);

names(data)

plot(GNPCAP, LIFE_EXPECTANCY, type ="n", axes=F)

axis(1)

axis(2)

text(GNPCAP[-27], LIFE_EXPECTANCY[-27], COUNTRY[-27], col="blue", cex=0.8)

abline(lm(LIFE_EXPECTANCY ~ GNPCAP), lty=4, col="red" , lwd=3)

text(GNPCAP[27], LIFE_EXPECTANCY[27], COUNTRY[27], col="red", cex=1.2)

dev.copy2pdf(file="/AlbertNou/COURSES/Metodes05_07_10/YvsX2.pdf")

Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 18 / 24

Page 20: Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics Standardization Transforming non-normal data Inference: estimation of the mean ANOVA:

upf

Regression analysis

Variation of life expectation (Y) against GNP (X)

GNPCAP

LIF

E_E

XP

EC

TAN

CY

5000 10000 15000 20000

7072

7476

78

Austria

Belarus

Belgium

Bosnia

Bulgaria

Croatia Czech Rep.

Denmark

Estonia

Finland

France

Georgia

Germany

Greece

Hungary

Iceland

Ireland

Italy

Latvia

Lithuania

Netherlands Norway

Poland

Portugal

Romania

Russia

Sweden

Switzerland

Ukraine

United Kingdom Spain Spain

Figure: scatterplotAlbert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 19 / 24

Page 21: Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics Standardization Transforming non-normal data Inference: estimation of the mean ANOVA:

upf

Regression analysis

Simple Linear Regression

> summary(lm(LIFE_EXPECTANCY ~ GNPCAP))

Call:

lm(formula = LIFE_EXPECTANCY ~ GNPCAP)

Residuals:

Min 1Q Median 3Q Max

-3.9348 -1.1209 0.3521 0.8021 4.0403

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 7.039e+01 6.373e-01 110.461 < 2e-16 ***

GNPCAP 3.804e-04 4.996e-05 7.615 2.14e-08 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 1.711 on 29 degrees of freedom

Multiple R-squared: 0.6666, Adjusted R-squared: 0.6551

F-statistic: 57.99 on 1 and 29 DF, p-value: 2.142e-08

Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 20 / 24

Page 22: Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics Standardization Transforming non-normal data Inference: estimation of the mean ANOVA:

upf

Regression analysis

Simple Linear Regression using IBM SPSS StatisticsThese are the steps to produce the chart by using syntax :

1 Open the data file (in the IBM SPSS Statistics main menu, select FILE ... OPEN ... DATA, browse to the folder wherethe data file resides, select the data file and click Open)

2 From the main menu, select FILE ... NEW ... SYNTAX (not FILE ... NEW ... SCRIPT; here is where the differencebetween ”syntax” and ”script” comes in...). This will open a Syntax window.

3 copy and paste the lines below into the new Syntax window

4 from the main menu in the Syntax window, select RUN ... ALL

Having the data file opened, copy and paste the next lines in the Syntax window and then run it by selecting RUN ... ALL in the main menu.

GGRAPH

/GRAPHDATASET NAME="graphdataset" VARIABLES=GNPCAP LIFE_EXPECTANCY COUNTRY MISSING=LISTWISE

REPORTMISSING=NO

/GRAPHSPEC SOURCE=INLINE.

BEGIN GPL

SOURCE: s=userSource(id("graphdataset"))

DATA: GNPCAP=col(source(s), name("GNPCAP"))

DATA: LIFE_EXPECTANCY=col(source(s), name("LIFE_EXPECTANCY"))

DATA: COUNTRY=col(source(s), name("COUNTRY"), unit.category())

GUIDE: axis(dim(1), label("GNP per Capita (US Dollars)"))

GUIDE: axis(dim(2), label("Life expectancy"))

ELEMENT: point(position(GNPCAP*LIFE_EXPECTANCY), label(COUNTRY))

END GPL.

Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 21 / 24

Page 23: Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics Standardization Transforming non-normal data Inference: estimation of the mean ANOVA:

upf

Regression analysis

Dades immobles: Valor versus Superficie Construida

data=read.table("http://www.econ.upf.edu/~satorra/dades/inmob.txt",header = T); attach(data); names(data)

[1] "TipologiaInm" "Calidad" "TamanoPoblacion" "SuperficieConstruida" "Valor" "NumHabita" "NumBanos"

[8] "NumGarajes" "Ascensor"

plot(Valor, SuperficieConstruida)

plot( SuperficieConstruida, Valor)

abline(lm(Valor ~ SuperficieConstruida), col="red", lty=3, lwd=2)

out1= lm(Valor ~ SuperficieConstruida)

summary(out1)

Residuals:

Min 1Q Median 3Q Max

-1764579 -151065 -19344 153227 1308771

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -186807.3 14760.1 -12.66 <2e-16 ***

SuperficieConstruida 11727.7 119.6 98.06 <2e-16 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 338800 on 1798 degrees of freedom

Multiple R-squared: 0.8425, Adjusted R-squared: 0.8424

F-statistic: 9616 on 1 and 1798 DF, p-value: < 2.2e-16

Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 22 / 24

Page 24: Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics Standardization Transforming non-normal data Inference: estimation of the mean ANOVA:

upf

Regression analysis

More data sets, for examples

Multiple Regression (CPS data)

data=read.table("http://www.econ.upf.edu/~satorra/dades/CurrentPopulationSurveyData99ext.raw.txt",header=T)

names(data)

[1] "female" "age" "mstat" "edyr" "hgc" "reth" "hw"

qqnorm(hw)

qqnorm(log(hw))

hist(hw)

out1 <- lm(log(hw) ~ edyr + age + female+ mstat, data = data)

summary(out1)

library(car)

avPlots(out1)

Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 23 / 24

Page 25: Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics Standardization Transforming non-normal data Inference: estimation of the mean ANOVA:

upf

Regression analysis

More data sets, for examples

Other data sets: fatality data (Iraq), welfare data, )

### Fatality data

library(foreign)

fata=read.dta(" http://www.econ.upf.edu/~satorra/dades/TraficFatality.dta")

### iraq data

fata=read.table("http://www.econ.upf.edu/~satorra/dades/datairaq.raw", header=T)

## welfare data

dades=read.table("http://www.econ.upf.edu/~satorra/dades/Datawelfare.raw", header=T)

Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 24 / 24


Recommended