Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics...

upf

Regression

Albert Satorra

Metodes Estadıstics, UPF, hivern 2013

Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 1 / 24

upf

Continguts

1 Univariate statistics

Summary statistics

Standardization

Transforming non-normal data

Inference: estimation of the mean

ANOVA: variation across groups

2 Regression analysis

More data sets, for examples


upf

Univariate statistics

Summary statistics

Summary of distribution of Nota

summary(Nota)

Min. 1st Qu. Median Mean 3rd Qu. Max.

3.011 5.297 6.444 6.475 7.613 10.000

length(Nota) = 365

mean(Nota) = 6.475341

sd(Nota) =1.56445


upf


Summary statistics

Histogram of Nota

Nota

Den

sity

3 4 5 6 7 8 9 10

0.00

0.05

0.10

0.15

0.20

0.25

Figure: scatterplot


upf


Standardization

Percentiles, under normal distribution

mean(Nota) =6.48; sd(Nota) =1.56

Standardized variable: Znota = (Nota- 6.48)/ 1.56

How many people above 5? 1 - pnorm( (5- 6.48)/ 1.56) = 0.828618

i.e. 83%

Compare with true value, sum(Nota > 5)/length(Nota) =0.8109589

81%

PERCENTILES:

> 5 is the 17.13 percentile

since pnorm( (5- 6.48)/ 1.56)= 0.171382

> 7 is the 63.05 percentile,

since pnorm( (7- 6.48)/ 1.56)

[1] 0.6305587

In general, e.g. el 30 percentile is:

mean(Nota) + qnorm(0.32)*sd(Nota) = 5.74365


upf


Standardization

Valor immobles, non-normalitydata=read.table("http://www.econ.upf.edu/~satorra/dades/inmob.txt",header = T)

summary(Valor)


160000 381300 751400 1031000 1388000 4937000

We normalize by taking log of Valor

logValor = log(Valor)

summary(logValor)


11.98 12.85 13.53 13.54 14.14 15.41

mean(logValor)=13.54

sd(logValor)= 0.78

for the 3rd Qu. 1388000, we have that

pnorm((log(1388000) - mean(logValor))/sd(logValor))

[1] 0.7796363

which is close to the third quartile

The 60 percentile is to be found as qnorm(0.6)= 0.2533471

So: exp(13.53899 + qnorm(0.6)*0.7839275 ) = 925043.2

Note that the true percentile of 925043.2 is

sum(Valor <925043.2)/length(Valor) = 0.585; that is, 58.5

This accuracy would not be obtained if we did not use the log transformation of the variable Valor

The normal approximation of the 60 percentile is: mean(Valor) + qnorm(0.60)*sd(Valor) = 1246721

Note that 1246721 is the 71 percentile, since sum(Valor <1246721)/length(Valor) = 0.7072222


upf


Standardization

Histogram of Valor

Valor

Den

sity

0e+00 1e+06 2e+06 3e+06 4e+06 5e+06

0e+

001e

−07

2e−

073e

−07

4e−

075e

−07

6e−

077e

−07

Figure:


upf


Standardization

Histogram of logValor

logValor

Den

sity

12 13 14 15

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Figure:


upf



Estimation of the population mean using a sample

sample1=sample(round(Nota,2),20)

sample1

[1] 4.03 8.17 5.16 9.51 5.96 6.77 6.57 4.39 7.61 8.30 5.66 6.54 6.38 5.20 9.25 6.96 7.49 5.99 8.46 3.44

summary(sample1)


3.440 5.545 6.555 6.592 7.750 9.510

sd(sample1)= 1.678046

t.test(sample1)

data: sample1

t = 17.5682, df = 19, p-value = 3.314e-13

alternative hypothesis: true mean is not equal to 0

95 percent confidence interval:

5.80665 7.37735 (we know that the true value is 6.48)

t.test(sample1, conf.level = 0.9)

...



5.943191 7.240809


upf



Estimation of the population mean using a sample

sample2=sample(round(Nota,2),20)

Sample 2:

summary(sample2)


4.160 5.535 6.120 6.308 7.268 8.570

sd(sample2)

[1] 1.201925

t.test(sample2)

data: sample2

t = 23.4727, df = 19, p-value = 1.702e-15



5.745982 6.871018


upf



Group variation ? (groups are classrooms)

ANOVA:

aov2= aov(TestExamenFinal ~ as.factor(GrupTeoria))

summary(aov2)

Df Sum Sq Mean Sq F value Pr(>F)

as.factor(GrupTeoria) 4 996 249.10 8.998 6.17e-07 ***

Residuals 360 9966 27.68

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

YES!


upf



Class group variation of Nota

1 2 3 4 5

05

1015

20

Figure: scatterplot


upf



Group variation ? (groups are types of exam)

aggregate(Nota, list(Tipus),mean)

Group.1 x

1 1 10.90449

2 2 10.73333

3 3 10.84239

4 4 11.04255

boxplot(Nota ~ Tipus, col=2:5, main="per tipus (1 a 4) d’examen")

dev.copy2pdf(file="/AlbertNou/COURSES/Metodes05_07_10/boxplot2.pdf")

ANOVA

aov1= aov(Nota ~ as.factor(Tipus))

summary(aov1)

Df Sum Sq Mean Sq F value Pr(>F)

#as.factor(Tipus) 3 5 1.534 0.051 0.985

# Residuals 361 10958 30.354


upf



1 2 3 4

05

1015

20

per tipus (1 a 4) d'examen

Figure: scatterplot


upf



On categorial data relations

Examples of categorical variables include marital status (never married,married, divorced, or widowed) or race (black, white, Asian, native American, andInuit). We are interested in the study of associations between variables that mayhave more than two categories.

More details on relation among categorical variables are to be found here:Slides


http://www.econ.upf.edu/~satorra/dades/AssociationCategoricalVariables.pdf

upf

Regression analysis

The regression model relates a single outcome to one or a number ofcovariates. The model embodies basic ideas about explaining variation, statisticalcontrol, and the functional form of social and behavioral relationships.


upf

Regression analysis

Variation of grade of final exam (Y) across differentlevels of midterm exam (X)

scatterplot of Y vs X

MitjanaControls

Test

Exa

men

Fin

al

40 50 60 70 80 90 100

05

1015

20

Figure: scatterplot


upf

Regression analysis

Simple Linear Regression

> out=lm(TestExamenFinal ~ MitjanaControls)

> summary(out)

Call:

lm(formula = TestExamenFinal ~ MitjanaControls)

Residuals:

Min 1Q Median 3Q Max

-12.7665 -3.7240 0.0601 3.7760 13.2760

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -4.415 1.984 -2.226 0.0267 *

MitjanaControls 0.179 0.023 7.782 7.49e-14 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 5.087 on 363 degrees of freedom

Multiple R-squared: 0.143, Adjusted R-squared: 0.1406

F-statistic: 60.56 on 1 and 363 DF, p-value: 7.491e-14Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 17 / 24

upf

Regression analysis


library(foreign)

help(read.spss)

data = read.spss("http://www.econ.upf.edu/~satorra/dades/europeancountries.sav", use.value.labels = TRUE,

to.data.frame = TRUE, max.value.labels = Inf, trim.factor.names = FALSE, trim_values = TRUE, reencode = NA, use.missings = to.data.frame)

attach(data);

names(data)

plot(GNPCAP, LIFE_EXPECTANCY, type ="n", axes=F)

axis(1)

axis(2)

text(GNPCAP[-27], LIFE_EXPECTANCY[-27], COUNTRY[-27], col="blue", cex=0.8)

abline(lm(LIFE_EXPECTANCY ~ GNPCAP), lty=4, col="red" , lwd=3)

text(GNPCAP[27], LIFE_EXPECTANCY[27], COUNTRY[27], col="red", cex=1.2)

dev.copy2pdf(file="/AlbertNou/COURSES/Metodes05_07_10/YvsX2.pdf")


upf

Regression analysis

Variation of life expectation (Y) against GNP (X)

GNPCAP

LIF

E_E

XP

EC

TAN

CY

5000 10000 15000 20000

7072

7476

78

Austria

Belarus

Belgium

Bosnia

Bulgaria

Croatia Czech Rep.

Denmark

Estonia

Finland

France

Georgia

Germany

Greece

Hungary

Iceland

Ireland

Italy

Latvia

Lithuania

Netherlands Norway

Poland

Portugal

Romania

Russia

Sweden

Switzerland

Ukraine

United Kingdom Spain Spain

Figure: scatterplotAlbert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 19 / 24

upf

Regression analysis


> summary(lm(LIFE_EXPECTANCY ~ GNPCAP))

Call:

lm(formula = LIFE_EXPECTANCY ~ GNPCAP)

Residuals:


-3.9348 -1.1209 0.3521 0.8021 4.0403

Coefficients:


(Intercept) 7.039e+01 6.373e-01 110.461 < 2e-16 ***

GNPCAP 3.804e-04 4.996e-05 7.615 2.14e-08 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 1.711 on 29 degrees of freedom


F-statistic: 57.99 on 1 and 29 DF, p-value: 2.142e-08


upf

Regression analysis

Simple Linear Regression using IBM SPSS StatisticsThese are the steps to produce the chart by using syntax :

1 Open the data file (in the IBM SPSS Statistics main menu, select FILE ... OPEN ... DATA, browse to the folder wherethe data file resides, select the data file and click Open)

2 From the main menu, select FILE ... NEW ... SYNTAX (not FILE ... NEW ... SCRIPT; here is where the differencebetween ”syntax” and ”script” comes in...). This will open a Syntax window.

3 copy and paste the lines below into the new Syntax window

4 from the main menu in the Syntax window, select RUN ... ALL

Having the data file opened, copy and paste the next lines in the Syntax window and then run it by selecting RUN ... ALL in the main menu.

GGRAPH

/GRAPHDATASET NAME="graphdataset" VARIABLES=GNPCAP LIFE_EXPECTANCY COUNTRY MISSING=LISTWISE

REPORTMISSING=NO

/GRAPHSPEC SOURCE=INLINE.

BEGIN GPL

SOURCE: s=userSource(id("graphdataset"))

DATA: GNPCAP=col(source(s), name("GNPCAP"))

DATA: LIFE_EXPECTANCY=col(source(s), name("LIFE_EXPECTANCY"))

DATA: COUNTRY=col(source(s), name("COUNTRY"), unit.category())

GUIDE: axis(dim(1), label("GNP per Capita (US Dollars)"))

GUIDE: axis(dim(2), label("Life expectancy"))

ELEMENT: point(position(GNPCAP*LIFE_EXPECTANCY), label(COUNTRY))

END GPL.


upf

Regression analysis

Dades immobles: Valor versus Superficie Construida

data=read.table("http://www.econ.upf.edu/~satorra/dades/inmob.txt",header = T); attach(data); names(data)

[1] "TipologiaInm" "Calidad" "TamanoPoblacion" "SuperficieConstruida" "Valor" "NumHabita" "NumBanos"

[8] "NumGarajes" "Ascensor"

plot(Valor, SuperficieConstruida)

plot( SuperficieConstruida, Valor)

abline(lm(Valor ~ SuperficieConstruida), col="red", lty=3, lwd=2)

out1= lm(Valor ~ SuperficieConstruida)

summary(out1)

Residuals:


-1764579 -151065 -19344 153227 1308771

Coefficients:


(Intercept) -186807.3 14760.1 -12.66 <2e-16 ***

SuperficieConstruida 11727.7 119.6 98.06 <2e-16 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 338800 on 1798 degrees of freedom


F-statistic: 9616 on 1 and 1798 DF, p-value: < 2.2e-16


upf

Regression analysis


Multiple Regression (CPS data)

data=read.table("http://www.econ.upf.edu/~satorra/dades/CurrentPopulationSurveyData99ext.raw.txt",header=T)

names(data)

[1] "female" "age" "mstat" "edyr" "hgc" "reth" "hw"

qqnorm(hw)

qqnorm(log(hw))

hist(hw)

out1 <- lm(log(hw) ~ edyr + age + female+ mstat, data = data)

summary(out1)

library(car)

avPlots(out1)


upf

Regression analysis


Other data sets: fatality data (Iraq), welfare data, )

### Fatality data

library(foreign)

fata=read.dta(" http://www.econ.upf.edu/~satorra/dades/TraficFatality.dta")

### iraq data

fata=read.table("http://www.econ.upf.edu/~satorra/dades/datairaq.raw", header=T)

## welfare data

dades=read.table("http://www.econ.upf.edu/~satorra/dades/Datawelfare.raw", header=T)


Date post:	06-Mar-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Regressionsatorra/M/M2013Regressio1.pdfupf Continguts 1 Univariate statistics Summary statistics...

Documents