upf
Regression
Albert Satorra
Metodes Estadıstics, UPF, hivern 2013
Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 1 / 24
upf
Continguts
1 Univariate statistics
Summary statistics
Standardization
Transforming non-normal data
Inference: estimation of the mean
ANOVA: variation across groups
2 Regression analysis
More data sets, for examples
Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 2 / 24
upf
Univariate statistics
Summary statistics
Summary of distribution of Nota
summary(Nota)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.011 5.297 6.444 6.475 7.613 10.000
length(Nota) = 365
mean(Nota) = 6.475341
sd(Nota) =1.56445
Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 3 / 24
upf
Univariate statistics
Summary statistics
Histogram of Nota
Nota
Den
sity
3 4 5 6 7 8 9 10
0.00
0.05
0.10
0.15
0.20
0.25
Figure: scatterplot
Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 4 / 24
upf
Univariate statistics
Standardization
Percentiles, under normal distribution
mean(Nota) =6.48; sd(Nota) =1.56
Standardized variable: Znota = (Nota- 6.48)/ 1.56
How many people above 5? 1 - pnorm( (5- 6.48)/ 1.56) = 0.828618
i.e. 83%
Compare with true value, sum(Nota > 5)/length(Nota) =0.8109589
81%
PERCENTILES:
> 5 is the 17.13 percentile
since pnorm( (5- 6.48)/ 1.56)= 0.171382
> 7 is the 63.05 percentile,
since pnorm( (7- 6.48)/ 1.56)
[1] 0.6305587
In general, e.g. el 30 percentile is:
mean(Nota) + qnorm(0.32)*sd(Nota) = 5.74365
Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 5 / 24
upf
Univariate statistics
Standardization
Valor immobles, non-normalitydata=read.table("http://www.econ.upf.edu/~satorra/dades/inmob.txt",header = T)
summary(Valor)
Min. 1st Qu. Median Mean 3rd Qu. Max.
160000 381300 751400 1031000 1388000 4937000
We normalize by taking log of Valor
logValor = log(Valor)
summary(logValor)
Min. 1st Qu. Median Mean 3rd Qu. Max.
11.98 12.85 13.53 13.54 14.14 15.41
mean(logValor)=13.54
sd(logValor)= 0.78
for the 3rd Qu. 1388000, we have that
pnorm((log(1388000) - mean(logValor))/sd(logValor))
[1] 0.7796363
which is close to the third quartile
The 60 percentile is to be found as qnorm(0.6)= 0.2533471
So: exp(13.53899 + qnorm(0.6)*0.7839275 ) = 925043.2
Note that the true percentile of 925043.2 is
sum(Valor <925043.2)/length(Valor) = 0.585; that is, 58.5
This accuracy would not be obtained if we did not use the log transformation of the variable Valor
The normal approximation of the 60 percentile is: mean(Valor) + qnorm(0.60)*sd(Valor) = 1246721
Note that 1246721 is the 71 percentile, since sum(Valor <1246721)/length(Valor) = 0.7072222
Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 6 / 24
upf
Univariate statistics
Standardization
Histogram of Valor
Valor
Den
sity
0e+00 1e+06 2e+06 3e+06 4e+06 5e+06
0e+
001e
−07
2e−
073e
−07
4e−
075e
−07
6e−
077e
−07
Figure:
Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 7 / 24
upf
Univariate statistics
Standardization
Histogram of logValor
logValor
Den
sity
12 13 14 15
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Figure:
Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 8 / 24
upf
Univariate statistics
Inference: estimation of the mean
Estimation of the population mean using a sample
sample1=sample(round(Nota,2),20)
sample1
[1] 4.03 8.17 5.16 9.51 5.96 6.77 6.57 4.39 7.61 8.30 5.66 6.54 6.38 5.20 9.25 6.96 7.49 5.99 8.46 3.44
summary(sample1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.440 5.545 6.555 6.592 7.750 9.510
sd(sample1)= 1.678046
t.test(sample1)
data: sample1
t = 17.5682, df = 19, p-value = 3.314e-13
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
5.80665 7.37735 (we know that the true value is 6.48)
t.test(sample1, conf.level = 0.9)
...
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
5.943191 7.240809
Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 9 / 24
upf
Univariate statistics
Inference: estimation of the mean
Estimation of the population mean using a sample
sample2=sample(round(Nota,2),20)
Sample 2:
summary(sample2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.160 5.535 6.120 6.308 7.268 8.570
sd(sample2)
[1] 1.201925
t.test(sample2)
data: sample2
t = 23.4727, df = 19, p-value = 1.702e-15
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
5.745982 6.871018
Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 10 / 24
upf
Univariate statistics
ANOVA: variation across groups
Group variation ? (groups are classrooms)
ANOVA:
aov2= aov(TestExamenFinal ~ as.factor(GrupTeoria))
summary(aov2)
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(GrupTeoria) 4 996 249.10 8.998 6.17e-07 ***
Residuals 360 9966 27.68
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
YES!
Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 11 / 24
upf
Univariate statistics
ANOVA: variation across groups
Class group variation of Nota
1 2 3 4 5
05
1015
20
Figure: scatterplot
Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 12 / 24
upf
Univariate statistics
ANOVA: variation across groups
Group variation ? (groups are types of exam)
aggregate(Nota, list(Tipus),mean)
Group.1 x
1 1 10.90449
2 2 10.73333
3 3 10.84239
4 4 11.04255
boxplot(Nota ~ Tipus, col=2:5, main="per tipus (1 a 4) d’examen")
dev.copy2pdf(file="/AlbertNou/COURSES/Metodes05_07_10/boxplot2.pdf")
ANOVA
aov1= aov(Nota ~ as.factor(Tipus))
summary(aov1)
Df Sum Sq Mean Sq F value Pr(>F)
#as.factor(Tipus) 3 5 1.534 0.051 0.985
# Residuals 361 10958 30.354
Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 13 / 24
upf
Univariate statistics
ANOVA: variation across groups
1 2 3 4
05
1015
20
per tipus (1 a 4) d'examen
Figure: scatterplot
Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 14 / 24
upf
Univariate statistics
ANOVA: variation across groups
On categorial data relations
Examples of categorical variables include marital status (never married,married, divorced, or widowed) or race (black, white, Asian, native American, andInuit). We are interested in the study of associations between variables that mayhave more than two categories.
More details on relation among categorical variables are to be found here:Slides
Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 15 / 24
upf
Regression analysis
The regression model relates a single outcome to one or a number ofcovariates. The model embodies basic ideas about explaining variation, statisticalcontrol, and the functional form of social and behavioral relationships.
Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 16 / 24
upf
Regression analysis
Variation of grade of final exam (Y) across differentlevels of midterm exam (X)
scatterplot of Y vs X
MitjanaControls
Test
Exa
men
Fin
al
40 50 60 70 80 90 100
05
1015
20
Figure: scatterplot
Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 16 / 24
upf
Regression analysis
Simple Linear Regression
> out=lm(TestExamenFinal ~ MitjanaControls)
> summary(out)
Call:
lm(formula = TestExamenFinal ~ MitjanaControls)
Residuals:
Min 1Q Median 3Q Max
-12.7665 -3.7240 0.0601 3.7760 13.2760
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.415 1.984 -2.226 0.0267 *
MitjanaControls 0.179 0.023 7.782 7.49e-14 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 5.087 on 363 degrees of freedom
Multiple R-squared: 0.143, Adjusted R-squared: 0.1406
F-statistic: 60.56 on 1 and 363 DF, p-value: 7.491e-14Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 17 / 24
upf
Regression analysis
Simple Linear Regression
library(foreign)
help(read.spss)
data = read.spss("http://www.econ.upf.edu/~satorra/dades/europeancountries.sav", use.value.labels = TRUE,
to.data.frame = TRUE, max.value.labels = Inf, trim.factor.names = FALSE, trim_values = TRUE, reencode = NA, use.missings = to.data.frame)
attach(data);
names(data)
plot(GNPCAP, LIFE_EXPECTANCY, type ="n", axes=F)
axis(1)
axis(2)
text(GNPCAP[-27], LIFE_EXPECTANCY[-27], COUNTRY[-27], col="blue", cex=0.8)
abline(lm(LIFE_EXPECTANCY ~ GNPCAP), lty=4, col="red" , lwd=3)
text(GNPCAP[27], LIFE_EXPECTANCY[27], COUNTRY[27], col="red", cex=1.2)
dev.copy2pdf(file="/AlbertNou/COURSES/Metodes05_07_10/YvsX2.pdf")
Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 18 / 24
upf
Regression analysis
Variation of life expectation (Y) against GNP (X)
GNPCAP
LIF
E_E
XP
EC
TAN
CY
5000 10000 15000 20000
7072
7476
78
Austria
Belarus
Belgium
Bosnia
Bulgaria
Croatia Czech Rep.
Denmark
Estonia
Finland
France
Georgia
Germany
Greece
Hungary
Iceland
Ireland
Italy
Latvia
Lithuania
Netherlands Norway
Poland
Portugal
Romania
Russia
Sweden
Switzerland
Ukraine
United Kingdom Spain Spain
Figure: scatterplotAlbert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 19 / 24
upf
Regression analysis
Simple Linear Regression
> summary(lm(LIFE_EXPECTANCY ~ GNPCAP))
Call:
lm(formula = LIFE_EXPECTANCY ~ GNPCAP)
Residuals:
Min 1Q Median 3Q Max
-3.9348 -1.1209 0.3521 0.8021 4.0403
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.039e+01 6.373e-01 110.461 < 2e-16 ***
GNPCAP 3.804e-04 4.996e-05 7.615 2.14e-08 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 1.711 on 29 degrees of freedom
Multiple R-squared: 0.6666, Adjusted R-squared: 0.6551
F-statistic: 57.99 on 1 and 29 DF, p-value: 2.142e-08
Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 20 / 24
upf
Regression analysis
Simple Linear Regression using IBM SPSS StatisticsThese are the steps to produce the chart by using syntax :
1 Open the data file (in the IBM SPSS Statistics main menu, select FILE ... OPEN ... DATA, browse to the folder wherethe data file resides, select the data file and click Open)
2 From the main menu, select FILE ... NEW ... SYNTAX (not FILE ... NEW ... SCRIPT; here is where the differencebetween ”syntax” and ”script” comes in...). This will open a Syntax window.
3 copy and paste the lines below into the new Syntax window
4 from the main menu in the Syntax window, select RUN ... ALL
Having the data file opened, copy and paste the next lines in the Syntax window and then run it by selecting RUN ... ALL in the main menu.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=GNPCAP LIFE_EXPECTANCY COUNTRY MISSING=LISTWISE
REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: GNPCAP=col(source(s), name("GNPCAP"))
DATA: LIFE_EXPECTANCY=col(source(s), name("LIFE_EXPECTANCY"))
DATA: COUNTRY=col(source(s), name("COUNTRY"), unit.category())
GUIDE: axis(dim(1), label("GNP per Capita (US Dollars)"))
GUIDE: axis(dim(2), label("Life expectancy"))
ELEMENT: point(position(GNPCAP*LIFE_EXPECTANCY), label(COUNTRY))
END GPL.
Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 21 / 24
upf
Regression analysis
Dades immobles: Valor versus Superficie Construida
data=read.table("http://www.econ.upf.edu/~satorra/dades/inmob.txt",header = T); attach(data); names(data)
[1] "TipologiaInm" "Calidad" "TamanoPoblacion" "SuperficieConstruida" "Valor" "NumHabita" "NumBanos"
[8] "NumGarajes" "Ascensor"
plot(Valor, SuperficieConstruida)
plot( SuperficieConstruida, Valor)
abline(lm(Valor ~ SuperficieConstruida), col="red", lty=3, lwd=2)
out1= lm(Valor ~ SuperficieConstruida)
summary(out1)
Residuals:
Min 1Q Median 3Q Max
-1764579 -151065 -19344 153227 1308771
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -186807.3 14760.1 -12.66 <2e-16 ***
SuperficieConstruida 11727.7 119.6 98.06 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 338800 on 1798 degrees of freedom
Multiple R-squared: 0.8425, Adjusted R-squared: 0.8424
F-statistic: 9616 on 1 and 1798 DF, p-value: < 2.2e-16
Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 22 / 24
upf
Regression analysis
More data sets, for examples
Multiple Regression (CPS data)
data=read.table("http://www.econ.upf.edu/~satorra/dades/CurrentPopulationSurveyData99ext.raw.txt",header=T)
names(data)
[1] "female" "age" "mstat" "edyr" "hgc" "reth" "hw"
qqnorm(hw)
qqnorm(log(hw))
hist(hw)
out1 <- lm(log(hw) ~ edyr + age + female+ mstat, data = data)
summary(out1)
library(car)
avPlots(out1)
Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 23 / 24
upf
Regression analysis
More data sets, for examples
Other data sets: fatality data (Iraq), welfare data, )
### Fatality data
library(foreign)
fata=read.dta(" http://www.econ.upf.edu/~satorra/dades/TraficFatality.dta")
### iraq data
fata=read.table("http://www.econ.upf.edu/~satorra/dades/datairaq.raw", header=T)
## welfare data
dades=read.table("http://www.econ.upf.edu/~satorra/dades/Datawelfare.raw", header=T)
Albert Satorra ( Metodes Estadıstics, UPF, hivern 2013 ) GRAU en CP, hivern 2013 24 / 24