Date post: | 03-Jun-2018 |
Category: |
Documents |
Upload: | thermopolis3012 |
View: | 232 times |
Download: | 0 times |
of 102
8/11/2019 Linear Regression Chp1
1/102
Linear Statistical Analysis I
Fall 2013
() Linear Statistical Analysis I Fall 2013 1 / 101
http://find/8/11/2019 Linear Regression Chp1
2/102
Outline
Introduction to R statistical softwareIntroduction to regression and model building
() Linear Statistical Analysis I Fall 2013 2 / 101
http://find/8/11/2019 Linear Regression Chp1
3/102
R language
R is a popular statistical software package especially suitable for
data analysis and graphical representation.
It is a free open source software.
The R package can be downloaded from
http://cran.us.r-project.org/ where you can also find useful
information for R. Note that they have different versions for
different operation systems such as Windows, Linux, MacOS.
online tutorials by googling the key words R language tutorial.
() Linear Statistical Analysis I Fall 2013 3 / 101
http://find/8/11/2019 Linear Regression Chp1
4/102
R language
Open the R package.
() Linear Statistical Analysis I Fall 2013 4 / 101
http://find/http://goback/8/11/2019 Linear Regression Chp1
5/102
type commands after the prompt symbol
For example, if we want to compute 3943 and 9,> 3/(9*4-5)
[1] 0.09677419
> sqrt(9)
[1] 3
If you want to make comments on the commands, use the symbol
#. The expression after this symbol will not be executed.
> 3/(9*4-5) # This is a comment.
[1] 0.09677419
> sqrt(9) # compute the square root of 9.
[1] 3
() Linear Statistical Analysis I Fall 2013 5 / 101
http://find/http://goback/8/11/2019 Linear Regression Chp1
6/102
workspce and working directory
The workspace is the place in your computer where R reads or
saves data or R objects.The directory of the workspace is the working directory.
Find the current working directory
> getwd()
[1] "F:/Teaching/2013_spring_Computing"
change the current working directory
> setwd("F:/Teaching")
> getwd()
[1] "F:/Teaching"
or you can use the File menu in the Console window of R.
If there is an error when you input data, first check whether the
data is in the working directory.
() Linear Statistical Analysis I Fall 2013 6 / 101
http://find/8/11/2019 Linear Regression Chp1
7/102
arithmetic expressions and variables
arithmetic expressions: type expressions after the prompt symbol.
R will evaluate their values.> 2+3
[1] 5
> 3-4
[1] -1
> 3*5
[1] 15
> 3/5
[1] 0.6
> 13%%5 # compute the remiander of 13 divided by 5[1] 3
> 13%/%5 #the integer quotient of 13 divided by 5
[1] 2
> 3^2
[1] 9() Linear Statistical Analysis I Fall 2013 7 / 101
http://find/8/11/2019 Linear Regression Chp1
8/102
arithmetic expressions and variables
> sqrt(9)[1] 3
> log(3)
[1] 1.098612
> log2(2)
[1] 1> log10(100)
[1] 2
> exp(3)
[1] 20.08554
> factorial(3) #compute 3!
[1] 6
> log((1+exp(5)))*3^(-2)
[1] 0.5563017
() Linear Statistical Analysis I Fall 2013 8 / 101
http://goforward/http://find/http://goback/8/11/2019 Linear Regression Chp1
9/102
arithmetic expressions and variables
variable: the name is case sensitive, that is, upper and lower case
letters are distinct.
assignment operator x x # check the value of x
[1] 20.38906
equivalently, you can use
> x=13+exp(2)> x
[1] 20.38906
() Linear Statistical Analysis I Fall 2013 9 / 101
http://find/8/11/2019 Linear Regression Chp1
10/102
arithmetic expressions and variables
you can include variables in expressions, however, the values ofthe variables must be assigned before the expressions.
> x=2+3
> (2+3)*4
[1] 20
> x=2+3
> x*4
[1] 20
> y*4
Error: object y not found> y=x^2
> y
[1] 25
() Linear Statistical Analysis I Fall 2013 10 / 101
http://find/8/11/2019 Linear Regression Chp1
11/102
Relational and Logical Operators
logical values: TRUE (can be denoted by T) and FALSE
(denoted by F).we can apply the arithmetic operators to the logical values, in this
case, TRUE is identified as 1 and FALSE is identified as 0
> TRUE+TRUE
[1] 2
> TRUE+FALSE
[1] 1
> TRUE*FALSE
[1] 0
> x=TRUE
> x
[1] TRUE
> y=FALSE
> y-x
[1] -1() Linear Statistical Analysis I Fall 2013 11 / 101
http://find/8/11/2019 Linear Regression Chp1
12/102
Relational and Logical Operators
The relational operators are
> 1>2 # logical expression with two possible[1] FALSE
> 1 1 1>=2
[1] FALSE
> (1+3)==4
[1] TRUE
> 3*2!=6
[1] FALSE
> x=(3*2!=6)
> x
[1] FALSE() Linear Statistical Analysis I Fall 2013 12 / 101
http://find/8/11/2019 Linear Regression Chp1
13/102
Relational and Logical Operators
logical operators: can be used to connect two or more logical
expressions
> (1>2)&((1+3)==4) # "&" means "and". The combined
# expression is true only if
# both the two expressions are tr
[1] FALSE> (1 TRUE&FALSE
[1] FALSE
> (1>2)|((1+3)==4) # "|" means "or". The combined# expression is true if and onl
# at least one of the two
# expressions are true.
[1] TRUE
() Linear Statistical Analysis I Fall 2013 13 / 101
http://find/8/11/2019 Linear Regression Chp1
14/102
Relational and Logical Operators
> (1 (1>2)|((1+3)==5)
[1] FALSE
> TRUE|FALSE
[1] TRUE> ((1+2)==1)
[1] FALSE
> !((1+2)==1) #"!" means "NOT".
[1] TRUE
> !TRUE
[1] FALSE
> !FALSE
[1] TRUE
() Linear Statistical Analysis I Fall 2013 14 / 101
http://find/http://goback/8/11/2019 Linear Regression Chp1
15/102
basic data types in R
vector: concatente several numbers or vectors into a vector byusing the operator c()
> c(1,3,2)
[ 1 ] 1 3 2
> c(c(1,2),c(3,5),c(3,0,0))
[1] 1 2 3 5 3 0 0
> x=c(3,4,9)
> x
[ 1 ] 3 4 9
> y=c(x,x,c(0,1,2))> y
[1] 3 4 9 3 4 9 0 1 2
() Linear Statistical Analysis I Fall 2013 15 / 101
http://find/http://goback/8/11/2019 Linear Regression Chp1
16/102
basic data types in R
Special vectors:
> 1:9[1] 1 2 3 4 5 6 7 8 9
> 5:1
[ 1 ] 5 4 3 2 1
> rep(0.5,6) # repeat 0.5 six times
[1] 0.5 0.5 0.5 0.5 0.5 0.5
> rep(-3,8)
[1] -3 -3 -3 -3 -3 -3 -3 -3
> seq(0,1, length.out=11) # get 11 numbers from
#0 to 1 with equal space.
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
> seq(1,2,length.out=21)
[1] 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35
[9] 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75
[17] 1.80 1.85 1.90 1.95 2.00() Linear Statistical Analysis I Fall 2013 16 / 101
http://find/http://goback/8/11/2019 Linear Regression Chp1
17/102
basic data types in R
logical vector: consists of logical values
> x=c(T,T,F,F) #TRUE can be denoted by T
> x
[1] TRUE TRUE FALSE FALSE
> x=1:5> x==1 # each component of x is compared with 1
[1] TRUE FALSE FALSE FALSE FALSE
> y=(x!=1)
> y
[1] FALSE TRUE TRUE TRUE TRUE
() Linear Statistical Analysis I Fall 2013 17 / 101
http://find/8/11/2019 Linear Regression Chp1
18/102
basic data types in R
extract or modify subsets of a vector by using index vector
> x=c(3,2,5,5,7,8)> x[c(1,3,5)]# extract the 1,3,5th numbers in x
[ 1 ] 3 5 7
> y=x[c(3,1,5)]
> y
[ 1 ] 5 3 7
> z=x[-c(2,4)]# remove the 2,4th numbers in x
> z
[ 1 ] 3 5 7 8
> x[1:3]
[ 1 ] 3 2 5
> x[1:3]=c(1,2,3)
> x
[ 1 ] 1 2 3 5 7 8
() Linear Statistical Analysis I Fall 2013 18 / 101
http://find/8/11/2019 Linear Regression Chp1
19/102
basic data types in R
find all possible values in a vector
> x=c(3,2,5,5,7,8,7,8,9)> unique(x)
[ 1 ] 3 2 5 7 8 9
() Linear Statistical Analysis I Fall 2013 19 / 101
http://find/8/11/2019 Linear Regression Chp1
20/102
basic data types in R
> x[c(1,3,5)]=0
> x[ 1 ] 0 2 0 5 0 8
extract or modify subsets of a vector by using logical vector
> x=c(3,2,5,5,7,8,3,6,9)
> x=c(3,2,5,5,7,8,3,6,9) #we will extract the
#numbers larger than 4
> index=(x>4)
> index
[1] FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRU> x[index]
[ 1 ] 5 5 7 8 6 9
> x[index]# only values corresponding TRUE
# in "index" will be slected
[ 1 ] 5 5 7 8 6 9() Linear Statistical Analysis I Fall 2013 20 / 101
http://find/8/11/2019 Linear Regression Chp1
21/102
basic data types in R
extract or modify subsets of a vector by using logical vector
> x=1:10 #extract numbers less or equal
# to 6 and larger than 3
> x[(x3)]
[ 1 ] 4 5 6
> x=c(3,2,5,5,7,8,3,6,9)
> y=c(1,3,5,7,6,4,2,3,5)
> #we extract the numbers in
> #the positions corresponding to y>4
> x[y>4]
[ 1 ] 5 5 7 9
() Linear Statistical Analysis I Fall 2013 21 / 101
http://find/http://goback/8/11/2019 Linear Regression Chp1
22/102
basic data types in R
extract the indices corresponding to the numbers which satisfies
specific conditions in a vector.
> x=c(T,T,F,T,F,F,T,T,T,F)
> which(x)
[ 1 ] 1 2 4 7 8 9
> x=c(3,2,5,5,7,8,3,6,9)
> which(x>5)
[ 1 ] 5 6 8 9
> x[which(x>5)][ 1 ] 7 8 6 9
() Linear Statistical Analysis I Fall 2013 22 / 101
http://find/8/11/2019 Linear Regression Chp1
23/102
operations on vectors
The elementary arithmetic operators, +, -, *, /, (for raising to a
power), log, exp, sin, cos, tan, sqrt and so on, can be applied to
vectors in an element-by-element sense.
> x=1:5
> x
[ 1 ] 1 2 3 4 5> y=10:15
> y
[1] 10 11 12 13 14 15
> y-x
[1] 9 9 9 9 9 14
Warning message:
In y - x : longer object length is not
a multiple of shorter object length
() Linear Statistical Analysis I Fall 2013 23 / 101
http://find/8/11/2019 Linear Regression Chp1
24/102
operations on vectors
> x
[ 1 ] 1 2 3 4 5> y=11:15 #x and y should have the same lengths.
> y-x
[1] 10 10 10 10 10
> 2*x
[1] 2 4 6 8 10
> x^2
[1] 1 4 9 16 25
> y/x
[1] 11.000000 6.000000 4.333333 3.500000 3.000> x+1
[ 1 ] 2 3 4 5 6
> x*y
[1] 11 24 39 56 75
() Linear Statistical Analysis I Fall 2013 24 / 101
http://find/8/11/2019 Linear Regression Chp1
25/102
operations on vectors
get the length, minimum, maximum, mean, variance of a vector
> z=c(43, 45, 5, 44, 767, 57, 68,33,111)> length(z)
[1] 9
> min(z)
[1] 5
> max(z)[1] 767
> which(z==min(z))
[1] 3
> which(z==max(z))
[1] 5
> mean(z)
[1] 130.3333
> var(z)
[1] 57815.75> s d z() Linear Statistical Analysis I Fall 2013 25 / 101
http://find/8/11/2019 Linear Regression Chp1
26/102
operations on vectors
sort the numbers in a vector in an increasing or decreasing order
> z=c(43, 45, 5, 44, 767, 57, 68,33,111)> sort(z)
[1] 5 33 43 44 45 57 68 111 767
> sort(z,decreasing = T)
[1] 767 111 68 57 45 44 43 33 5
> y=sort(z,index.return = T)> y
$x
[1] 5 33 43 44 45 57 68 111 767
$ix
[1] 3 8 1 4 2 6 7 9 5
> y$x
[1] 5 33 43 44 45 57 68 111 767
> z[y$ix]
[1] 5 33 43 44 45 57 68 111 767() Linear Statistical Analysis I Fall 2013 26 / 101
http://goforward/http://find/http://goback/8/11/2019 Linear Regression Chp1
27/102
matrix
we start with some simple matrices
> matrix(1.5, nrow=2,ncol=3,)[,1] [,2] [,3]
[1,] 1.5 1.5 1.5
[2,] 1.5 1.5 1.5
> diag(3.3,3)
[,1] [,2] [,3][1,] 3.3 0.0 0.0
[2,] 0.0 3.3 0.0
[3,] 0.0 0.0 3.3
> y=diag(-2,3,4)
> y
[,1] [,2] [,3] [,4]
[1,] -2 0 0 0
[2,] 0 -2 0 0
[3,] 0 0 -2 0() Linear Statistical Analysis I Fall 2013 27 / 101
http://find/8/11/2019 Linear Regression Chp1
28/102
matrix
form a matrix from a vector
> x=matrix(c(1,2,3,4,5,6), 2,3)> x
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
> matrix(c(1,2,3,4,5,6), 3,2)[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> matrix(c(1,2,3,4,5,6), 3,2,byrow=T)
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6() Linear Statistical Analysis I Fall 2013 28 / 101
http://goforward/http://find/http://goback/8/11/2019 Linear Regression Chp1
29/102
matrix
form a matrix by combining vectors
> x=c(1,2,3)
> y=c(4,5,6)
> z=c(7,8,9)
> cbind(x,y,z)#combine three vectors as columns
x y z[ 1 , ] 1 4 7
[ 2 , ] 2 5 8
[ 3 , ] 3 6 9
> rbind(x,y,z)#combine three vectors as rows
[,1] [,2] [,3]
x 1 2 3
y 4 5 6
z 7 8 9
() Linear Statistical Analysis I Fall 2013 29 / 101
i
http://find/8/11/2019 Linear Regression Chp1
30/102
matrix
form a matrix by combining matrices> w=cbind(x,y)
> w
x y
[1,] 1 4
[2,] 2 5
[3,] 3 6
> v=cbind(z,y)
> v
z y
[1,] 7 4
[2,] 8 5
[3,] 9 6
() Linear Statistical Analysis I Fall 2013 30 / 101
t i
http://find/8/11/2019 Linear Regression Chp1
31/102
matrix
form a matrix by combining matrices
> cbind(w,v)
x y z y
[ 1 , ] 1 4 7 4
[ 2 , ] 2 5 8 5
[ 3 , ] 3 6 9 6> rbind(w,v)
x y
[1,] 1 4
[2,] 2 5
[3,] 3 6
[4,] 7 4
[5,] 8 5
[6,] 9 6
() Linear Statistical Analysis I Fall 2013 31 / 101
t i
http://find/8/11/2019 Linear Regression Chp1
32/102
matrix
extract or modify subsets of a matrix
> x=matrix(1:15,3,5)> x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 5 8 11 14
[3,] 3 6 9 12 15> x[2,4]# the number in the second row and the fou
[1] 11
> y=x[1,]
># extract the first row
># which is converted to a vector.
> y
[1] 1 4 7 10 13
> str(y)
int [1:5] 1 4 7 10 13() Linear Statistical Analysis I Fall 2013 32 / 101
m t i
http://find/8/11/2019 Linear Regression Chp1
33/102
matrix
> z=x[,2]
># extract the second column> z
[ 1 ] 4 5 6
> str(z)
int [1:3] 4 5 6
> w=x[1:2,1:3]> w
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
> v=x[c(1,3),]
> v
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 3 6 9 12 15() Linear Statistical Analysis I Fall 2013 33 / 101
http://find/8/11/2019 Linear Regression Chp1
34/102
matrix
8/11/2019 Linear Regression Chp1
35/102
matrix
extract irregularly distributed subsets of a matrix
> x[index]=0
> x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 0 8 0 14[3,] 3 6 9 12 0
> which(x==0, arr.ind = T)
row col
[1,] 2 2
[2,] 2 4[3,] 3 5
> x[which(x==0, arr.ind = T)]
[ 1 ] 0 0 0
() Linear Statistical Analysis I Fall 2013 35 / 101
operations on matrices
http://find/8/11/2019 Linear Regression Chp1
36/102
operations on matrices
The elementary arithmetic operators, +, -, *, /, (for raising to apower), log, exp, sin, cos, tan, sqrt and so on, can be applied to
matrices in an element-by-element sense.
> x=matrix(1:6,2,3)
> x[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
> sqrt(x)
[,1] [,2] [,3][1,] 1.000000 1.732051 2.236068
[2,] 1.414214 2.000000 2.449490
() Linear Statistical Analysis I Fall 2013 36 / 101
operations on matrices
http://find/8/11/2019 Linear Regression Chp1
37/102
operations on matrices
> y=matrix(11:16,2,3)
> y
[,1] [,2] [,3]
[1,] 11 13 15[2,] 12 14 16
> x+y
[,1] [,2] [,3]
[1,] 12 16 20
[2,] 14 18 22
() Linear Statistical Analysis I Fall 2013 37 / 101
operations on matrices
http://find/8/11/2019 Linear Regression Chp1
38/102
operations on matrices
transpose of a matrix
> y=matrix(11:16,2,3)> y
[,1] [,2] [,3]
[1,] 11 13 15
[2,] 12 14 16
> t(y)
[,1] [,2]
[1,] 11 12
[2,] 13 14
[3,] 15 16> y=1:3 # vector: a special matrix with one column
> t(y)
[,1] [,2] [,3]
[1,] 1 2 3
() Linear Statistical Analysis I Fall 2013 38 / 101
http://find/8/11/2019 Linear Regression Chp1
39/102
matrix multiplication
8/11/2019 Linear Regression Chp1
40/102
matrix multiplication
> A=matrix(1:6,3,2)
> A[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> B=matrix(11:16,2,3)
> B
[,1] [,2] [,3]
[1,] 11 13 15
[2,] 12 14 16
> A%*%B # different from A*B
[,1] [,2] [,3]
[1,] 59 69 79
[2,] 82 96 110
[3,] 105 123 141() Linear Statistical Analysis I Fall 2013 40 / 101
Linear equations and inversion of a matrix
http://find/8/11/2019 Linear Regression Chp1
41/102
Linear equations and inversion of a matrix
For example, to solve the following equations
2x+y+3z=13x-y+z=4
x+y+z=2
> A=cbind(c(2,3,1),c(1,-1,1),c(3,1,1))
> A # the matrix of coefficients
[,1] [,2] [,3][1,] 2 1 3
[2,] 3 -1 1
[3,] 1 1 1
> a=c(1,4,2)
> solve(A,a)
[1] 2.333333 1.333333 -1.666667
> v=solve(A,a)
> A%*%v
[,1]() Linear Statistical Analysis I Fall 2013 41 / 101
Linear equations and inversion of a matrix
http://find/8/11/2019 Linear Regression Chp1
42/102
Linear equations and inversion of a matrix
For example, to solve the following equations
> A%*%v-a
[,1]
[1,] 0.000000e+00
[2,] -4.440892e-16
[3,] 4.440892e-16
> solve(A) # the inverse of A
[,1] [,2] [,3]
[1,] -0.3333333 0.3333333 0.6666667
[2,] -0.3333333 -0.1666667 1.1666667[3,] 0.6666667 -0.1666667 -0.8333333
() Linear Statistical Analysis I Fall 2013 42 / 101
character vector and factor
http://find/8/11/2019 Linear Regression Chp1
43/102
character vector and factor
A character vector is a vector whose components are character strings
instead of numbers. For example, we construct vectors of names andgenders of five persons in one office.
> names=c("Emily","John", "Lily","Grace","William")
> str(names)
chr [1:5] "Emily" "John" "Lily" "Grace" "William"> gender=c("F","M","F","F","M")
> str(gender)
chr [1:5] "F" "M" "F" "F" "M"
In the vector gender, there are two categories: male and female,labeled by M and F. Sometimes, we are interested in the
information such as how many elements in each category. In this case,
we can convert the character vector to a factor.
() Linear Statistical Analysis I Fall 2013 43 / 101
character vector and factor
http://find/8/11/2019 Linear Regression Chp1
44/102
character vector and factor
A character vector is a vector whose components are character strings
instead of numbers. For example, we construct vectors of names andgenders of five persons in one office.
> names=c("Emily","John", "Lily","Grace","William")
> str(names)
chr [1:5] "Emily" "John" "Lily" "Grace" "William"> gender=c("F","M","F","F","M")
> str(gender)
chr [1:5] "F" "M" "F" "F" "M"
In the vector gender, there are two categories: male and female,labeled by M and F. Sometimes, we are interested in the
information such as how many elements in each category. In this case,
we can convert the character vector to a factor.
() Linear Statistical Analysis I Fall 2013 44 / 101
character vector and factor
http://find/8/11/2019 Linear Regression Chp1
45/102
> gender=as.factor(gender)
> gender
[ 1 ] F M F F M
Levels: F M
> levels(gender)
[1] "F" "M"
Different levels denote different categories.
() Linear Statistical Analysis I Fall 2013 45 / 101
data frame
http://find/8/11/2019 Linear Regression Chp1
46/102
The data frame is one of the most important data types in R.
It is similar to matrix. Its elements are organized in rows andcolumns.
However, in matrix, all elements are all numbers.
In data frames, it is often the case that in a data frame, some
elements are numbers and the others are numbers.when R reads data from an external data file, R always save the
data as a data frame.
It is not efficient to type the data directly in R if the data size is
large.We can read the data from a file where the data is saved.
It is important that the file is in the working directory. Otherwise, R
cannot find the file. Or you can tell R where is the file.
() Linear Statistical Analysis I Fall 2013 46 / 101
input data from a file
http://find/8/11/2019 Linear Regression Chp1
47/102
p
The data has been stored in a file called ch.1.ex.1.dat.
() Linear Statistical Analysis I Fall 2013 47 / 101
http://find/8/11/2019 Linear Regression Chp1
48/102
input data from a file
8/11/2019 Linear Regression Chp1
49/102
The data has been stored in a file called ch.1.ex.2.dat. The
numbers is separated by commas
() Linear Statistical Analysis I Fall 2013 49 / 101
http://find/8/11/2019 Linear Regression Chp1
50/102
> data=read.table("ch.1.ex.2.dat",sep=",")
> str(data)
data.frame: 4 obs. of 3 variables:
$ V1: int 1 2 4 5
$ V2: int 2 4 5 6
$ V3: int 3 7 7 8
> dataV1 V2 V3
1 1 2 3
2 2 4 7
3 4 5 7
4 5 6 8
() Linear Statistical Analysis I Fall 2013 50 / 101
http://find/8/11/2019 Linear Regression Chp1
51/102
8/11/2019 Linear Regression Chp1
52/102
> data=read.table("ch.1.ex.3.dat",header=T)
> str(data)
data.frame: 4 obs. of 3 variables:
$ A: int 1 2 4 5
$ B: int 2 4 5 6
$ C: int 3 7 7 8
> dataA B C
1 1 2 3
2 2 4 7
3 4 5 7
4 5 6 8
() Linear Statistical Analysis I Fall 2013 52 / 101
input data from an excel file
http://find/8/11/2019 Linear Regression Chp1
53/102
The data has been stored in a csv excel file called ch.1.ex.4.csv.
() Linear Statistical Analysis I Fall 2013 53 / 101
http://find/8/11/2019 Linear Regression Chp1
54/102
> data=read.csv("ch.1.ex.4.csv")
> str(data)
data.frame: 4 obs. of 3 variables:
$ A: int 1 5 4 5
$ B: int 2 4 5 8
$ C: int 3 4 6 7
> dataA B C
1 1 2 3
2 5 4 4
3 4 5 6
4 5 8 7
() Linear Statistical Analysis I Fall 2013 54 / 101
input data from an excel file
http://find/8/11/2019 Linear Regression Chp1
55/102
The data has been stored in a csv excel file called ch.1.ex.5.csv.
() Linear Statistical Analysis I Fall 2013 55 / 101
http://find/8/11/2019 Linear Regression Chp1
56/102
> data=read.csv("ch.1.ex.5.csv",header=F)
> data
V1 V2 V3
1 1 2 3
2 5 4 43 4 5 6
4 5 8 7
() Linear Statistical Analysis I Fall 2013 56 / 101
data sets in R
http://find/8/11/2019 Linear Regression Chp1
57/102
R itself contains some data sets which can be loaded directly. You
can use
> data()
to check available data sets in R.
For example, let us consider a data set named sleep. We can
load the data set by using> data(sleep)
> sleep # check the data
extra group ID
1 0.7 1 1
2 -1.6 1 23 -0.2 1 3
...........
...........
20 3.4 2 10() Linear Statistical Analysis I Fall 2013 57 / 101
http://find/8/11/2019 Linear Regression Chp1
58/102
> str(sleep)
data.frame: 20 obs. of 3 variables:$ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8
$ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1
$ ID : Factor w/ 10 levels "1","2","3","4",..:
Data which show the effect of two drugs on 10 patients.
There are 20 observations and three variables.
the second variable represents the types of drugs given.
the third is the patients ID.
the first is the increase in hours of sleep compared to control.
() Linear Statistical Analysis I Fall 2013 58 / 101
dif b f d f
http://find/8/11/2019 Linear Regression Chp1
59/102
extract or modify subsets of a data frame
> # two different ways to extract the first varibl
> sleep$extra[1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0.0
[16] 4.4 5.5 1.6 4.6 3.4
> sleep[,1]
[1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0.0
[16] 4.4 5.5 1.6 4.6 3.4
> sleep[1,1]
[1] 0.7
> sleep[1,]
extra group ID1 0.7 1 1
() Linear Statistical Analysis I Fall 2013 59 / 101
extract or modify subsets of a data frame
> # extract the measurements for the first drug.
http://find/8/11/2019 Linear Regression Chp1
60/102
# extract the measurements for the first drug.
> sleep[sleep$group=="2",1]
[1] 1.9 0.8 1.1 0.1 -0.1 4.4
[7] 5.5 1.6 4.6 3.4
> # extract the measurements for the fifth patient
> sleep[sleep$ID=="5",1]
[1] -0.1 -0.1
> # extract the measurements for the third> # patient when he took the second type of drug.
> sleep[(sleep$ID=="3")&(sleep$group=="2"),1]
[1] 1.1
> names(sleep) # the names of variables
[1] "extra" "group" "ID"> names(sleep)=c("hours","Drug","Patient")
> sleep
hours Drug Patient
1 0.7 1 1
2 -1.6 1 2() Linear Statistical Analysis I Fall 2013 60 / 101
If all elements of a data frame are numbers, the data frame can be
converted to a matrix.
http://find/8/11/2019 Linear Regression Chp1
61/102
> data=read.csv("ch.1.ex.5.csv",header=F)
> str(data)
data.frame: 4 obs. of 3 variables:
$ V1: int 1 5 4 5
$ V2: int 2 4 5 8
$ V3: int 3 4 6 7
> X=as.matrix(data)> str(X)
i n t [ 1 : 4 , 1 : 3 ] 1 5 4 5 2 4 5 8 3 4 . . .
> X
V1 V2 V3
[1,] 1 2 3[2,] 5 4 4
[3,] 4 5 6
[4,] 5 8 7
() Linear Statistical Analysis I Fall 2013 61 / 101
http://find/8/11/2019 Linear Regression Chp1
62/102
Conversly, a matrix can be convert to a data frame
> X=matrix(1:15,5,3)
> X
[,1] [,2] [,3]
[1,] 1 6 11
[2,] 2 7 12
[3,] 3 8 13
[4,] 4 9 14
[5,] 5 10 15
> str(X)
int [1:5, 1:3] 1 2 3 4 5 6 7 8 9 10 ...
> Y=data.frame(X)
> Y
X1 X2 X31 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15
> str(Y)
data.frame: 5 obs. of 3 variables:
$ X1: int 1 2 3 4 5
$ X2: int 6 7 8 9 10$ X 3: i nt 11 1 2 13 1 4 15
() Linear Statistical Analysis I Fall 2013 62 / 101
control flow
http://find/8/11/2019 Linear Regression Chp1
63/102
In addition to typing the commands after the prompt symbol, you
can write a group of commands in your text editor such as
notepad, MS word and so on, then copy and paste these
commands after the prompt symbol in R.
For example, I write the following commands in Notepad,
() Linear Statistical Analysis I Fall 2013 63 / 101
control flow
http://find/8/11/2019 Linear Regression Chp1
64/102
Then I copy and paste them to R
> data=read.csv("ch.1.ex.5.csv",header=F)
> X=as.matrix(data)
> names(X)=c("x1","x2","x3")
> print(X)V1 V2 V3
[1,] 1 2 3
[2,] 5 4 4
[3,] 4 5 6
[4,] 5 8 7
() Linear Statistical Analysis I Fall 2013 64 / 101
control flow
http://find/8/11/2019 Linear Regression Chp1
65/102
The control-flow specify the order in which computations are
performed.
Some commonly used control-flow methods: If-Else statement,loops (including for and while statements)
If-Else statement: Example: compute the absolute value of a. I
write the following commands in Notepad,
() Linear Statistical Analysis I Fall 2013 65 / 101
control flow
http://find/8/11/2019 Linear Regression Chp1
66/102
Then I copy and paste them to R
> a=1
> if(a>0)
+ {absolute=a
+ print(absolute)+ }else
+ {absolute=-a
+ print(absolute)
+ }
[1] 1
() Linear Statistical Analysis I Fall 2013 66 / 101
control flow
http://goforward/http://find/http://goback/8/11/2019 Linear Regression Chp1
67/102
> a=1
> if(a>0)
+ {absolute=a
+ print(absolute)
+ }else+ {absolute=-a
+ print(absolute)
+ }
[1] 1
() Linear Statistical Analysis I Fall 2013 67 / 101
http://goforward/http://find/http://goback/8/11/2019 Linear Regression Chp1
68/102
Example 2
8/11/2019 Linear Regression Chp1
69/102
> x=-3
> if((x2))+ {f=0
+ print(f)
+ }else
+ {if(x
8/11/2019 Linear Regression Chp1
70/102
> x=8
> if((x2))+ {f=0
+ print(f)
+ }else
+ {if(x
8/11/2019 Linear Regression Chp1
71/102
the for loop
> sum=0 #set the initial values of the sum
> for (i in 1:6)
+ {sum=sum+i
+ print(c("i=",i, "sum=",sum))
+ }[1] "i=" "1" "sum=" "1"
[1] "i=" "2" "sum=" "3"
[1] "i=" "3" "sum=" "6"
[1] "i=" "4" "sum=" "10"
[1] "i=" "5" "sum=" "15"
[1] "i=" "6" "sum=" "21"
() Linear Statistical Analysis I Fall 2013 71 / 101
control flow
http://find/8/11/2019 Linear Regression Chp1
72/102
the for loop> sum=0 #set the initial values of the sum
> for (i in c(2,4,6,8,10))
+ {sum=sum+i
+ print(c("i=",i, "sum=",sum))
+ }
[1] "i=" "2" "sum=" "2"
[1] "i=" "4" "sum=" "6"
[1] "i=" "6" "sum=" "12"
[1] "i=" "8" "sum=" "20"
[1] "i=" "10" "sum=" "30"
() Linear Statistical Analysis I Fall 2013 72 / 101
Data management
http://find/8/11/2019 Linear Regression Chp1
73/102
The data input has been introduced. I will introduce the topics
including data output, missing, data manipulation, Merging, combining,
and subsetting datasets.save a R object: for example,
() Linear Statistical Analysis I Fall 2013 73 / 101
save a R object
http://goforward/http://find/http://goback/8/11/2019 Linear Regression Chp1
74/102
> X=matrix(1:15,3,5)
> X
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 5 8 11 14
[3,] 3 6 9 12 15
> save(X,file="ex.RData")> X=0
> X
[1] 0
> load("ex.RData")
> X[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 5 8 11 14
[3,] 3 6 9 12 15() Linear Statistical Analysis I Fall 2013 74 / 101
output data to external files
http://find/8/11/2019 Linear Regression Chp1
75/102
> data(sleep)
> str(sleep)
data.frame: 20 obs. of 3 variables:
$ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0
$ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1
$ ID : Factor w/ 10 levels "1","2","3","4",..: 1 2
> sleep.1=sleep[sleep$group=="1",c(1,3)]> sleep.1
extra ID
1 0.7 1
2 -1.6 2
3 -0.2 3............
> write.table(sleep.1,file="sleep.1.dat")
() Linear Statistical Analysis I Fall 2013 75 / 101
output data to external files
http://goforward/http://find/http://goback/8/11/2019 Linear Regression Chp1
76/102
both the row names and the column names are written in the file
> write.table(sleep.1,file="sleep.1.dat")
() Linear Statistical Analysis I Fall 2013 76 / 101
http://find/8/11/2019 Linear Regression Chp1
77/102
output data to external files
If t th fil b d fil
8/11/2019 Linear Regression Chp1
78/102
If we want the file be saved as a csv file, use
> write.csv(sleep.1,file="sleep.1.csv")
() Linear Statistical Analysis I Fall 2013 78 / 101
missing data
http://find/8/11/2019 Linear Regression Chp1
79/102
It is very common in practice that there are missing values in a
data set. For example, there are missing values denoted by *.
() Linear Statistical Analysis I Fall 2013 79 / 101
missing data
http://find/8/11/2019 Linear Regression Chp1
80/102
> X=read.table("ch.1.ex.6.dat",na.strings = "*")
> XV1 V2 V3 V4 V5
1 1 2 3 4 NA
2 1 3 5 7 8
3 7 NA 9 0 1
In R, all missing values are denoted by NA which can be
identufied by the following command,
> x=X[,2]
> x[1] 2 3 NA
> is.na(x)
[1] FALSE FALSE TRUE
() Linear Statistical Analysis I Fall 2013 80 / 101
missing data
http://find/8/11/2019 Linear Regression Chp1
81/102
> is.na(X)
V1 V2 V3 V4 V5
[1,] FALSE FALSE FALSE FALSE TRUE
[2,] FALSE FALSE FALSE FALSE FALSE
[3,] FALSE TRUE FALSE FALSE FALSE
> sum(is.na(X)) # the number of missing values
[1] 2> mean(X[,1])
[1] 3
> mean(X[,2])
[1] NA
> #claculate the mean with the missing values excluded> mean(X[,2],na.rm=TRUE)
[1] 2.5
() Linear Statistical Analysis I Fall 2013 81 / 101
missing data
http://find/8/11/2019 Linear Regression Chp1
82/102
Example: write a function to replace the missing values in a data frame by the column mean with the missing values excluded in
the corresponding colum. If the whole column is missed, we remove the culomn.
> missing.replace=function(X)
+ {
+ temp=NULL
+ for (i in 1:dim(X)[2])
+ {if(sum(is.na(X[,i]))==0)
+ {temp=cbind(temp,X[,i])
+ }else
+ {if(sum(!is.na(X[,i]))>0)
+ {y=X[,i]
+ y[is.na(X[,i])]=mean(y,na.rm=TRUE)
+ temp=cbind(temp,y)
+ }
+ }
+ }
+ temp=data.frame(temp)
+ temp
+ }
() Linear Statistical Analysis I Fall 2013 82 / 101
missing data
http://find/8/11/2019 Linear Regression Chp1
83/102
> X
V1 V2 V3 V4 V5
1 1 2 3 4 NA
2 1 3 5 7 8
3 7 NA 9 0 1
> missing.replace(X)
V1 y V3 V4 y.1
1 1 2.0 3 4 4.5
2 1 3.0 5 7 8.0
3 7 2.5 9 0 1.0
> Y=X> Y[,2]=NA
> Y
V1 V2 V3 V4 V5
1 1 NA 3 4 NA
2 1 NA 5 7 8
3 7 NA 9 0 1
> missing.replace(Y)
V1 V2 V3 y
1 1 3 4 4.52 1 5 7 8.0
3 7 9 0 1.0
() Linear Statistical Analysis I Fall 2013 83 / 101
http://find/8/11/2019 Linear Regression Chp1
84/102
merging data
8/11/2019 Linear Regression Chp1
85/102
Suppose thatX andYare the data frames for the two data sets, respectively.
> merge(X,Y,by="id")id year female inc maxval
1 1 81 0 5500 1800
2 1 80 0 5000 1800
3 1 82 0 6000 1800
4 2 82 1 3300 2400
5 2 80 1 2000 2400
6 2 81 1 2200 2400
> merge(X,Y,by="id",all=T)
id year female inc maxval1 1 81 0 5500 1800
2 1 80 0 5000 1800
3 1 82 0 6000 1800
4 2 82 1 3300 2400
5 2 80 1 2000 2400
6 2 81 1 2200 2400
7 3 82 0 1000 NA
8 3 80 0 3000 NA
9 3 81 0 2000 NA10 4 NA NA NA 1900
() Linear Statistical Analysis I Fall 2013 85 / 101
graphical exploration of data
http://find/8/11/2019 Linear Regression Chp1
86/102
Example: This famous (Fishers or Andersons) iris data set gives the measurements in centimeters of the variables sepal lengthand width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa,versicolor, and virginica.
> X=read.table("cell.1.dat")
> str(X)
data.frame: 1460 obs. of 7 variables:
$ V1: int 0 0 0 0 0 1 1 1 1 1 ...
$ V2: int 100 100 100 100 100 100 100 100 100 100 ...$ V3: int 0 0 0 0 0 0 0 0 0 0 ...
$ V4: int 0 0 0 0 0 0 0 0 0 0 ...
$ V5: int 0 0 0 0 0 0 0 0 0 0 ...
$ V6: int 0 0 0 0 0 10 6 12 16 8 ...
$ V7: int 1600 1600 1604 1590 1581 1568 1569 1577 1570 1577 ...
> Y=read.table("cell.2.dat")
> Z=read.table("cell.3.dat")
() Linear Statistical Analysis I Fall 2013 86 / 101
scatter plot
http://goforward/http://find/http://goback/8/11/2019 Linear Regression Chp1
87/102
> plot(iris$Sepal.Length,iris$Petal.Length)
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
1
2
3
4
5
6
7
iris$Sepal.Length
iris$Petal.Length
() Linear Statistical Analysis I Fall 2013 87 / 101
scatter plot
If we want to change the labels of x axis, y axis and the title of the plot.
http://find/8/11/2019 Linear Regression Chp1
88/102
> plot(iris$Sepal.Length,iris$Petal.Length,xlab="Sepal Length"
+ ,ylab="Petal Length",main="The Scatter plot")
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
1
2
3
4
5
6
7
The Scatter plot
Sepal Length
PetalLength
() Linear Statistical Analysis I Fall 2013 88 / 101
Matrix of scatterplotsIf we want to draw the scatteplots of all pairs of the variables
> pairs(iris)
http://find/8/11/2019 Linear Regression Chp1
89/102
Sepal.Length
2.0 2.5 3.0 3.5 4.0
0.5 1.0 1.5 2.0 2.5
4.
5
6.
0
7.
5
2.
0
3.
0
4.
0
Sepal.Width
Petal.Length
1
3
5
7
0.5
1.5
2.
5
Petal.Width
4.5 5.5 6.5 7.5
1 2 3 4 5 6 7
1.0 1.5 2.0 2.5 3.0
1.0
2.
0
3.
0
Species
() Linear Statistical Analysis I Fall 2013 89 / 101
http://find/8/11/2019 Linear Regression Chp1
90/102
histogram
> boxplot(iris$Sepal.Length[iris$Species=="setosa"],iris$Sepal.Length[iris$Species=="versicolor"],
+ iris$Sepal.Length[iris$Species=="virginica"],names=c("setosa", "versicolor", "virginica"),
8/11/2019 Linear Regression Chp1
91/102
+ main="Sepal Length of three species")
setosa versicolor virginica
4.
5
5.
0
5.
5
6.
0
6.
5
7.0
7.
5
8.
0
Sepal Length of three species
() Linear Statistical Analysis I Fall 2013 91 / 101
Regression
Regression is the study of relationships dependence between two
http://find/8/11/2019 Linear Regression Chp1
92/102
Regression is the study of relationships, dependence between two
sets of variables based on the data or observations made onthese variables.
Based on the relationships, we can make prediction on one set of
variables from the values of the other set.
Example 1: Predict the price of a stock in 6 months from now, on
the basis of company performance measures and economic data.
Example 2: Estimate the amount of glucose in the blood of a
diabetic person, from the infrared absorption spectrum of that
persons blood.
Example 3: Examine the correlation between the level ofprostate-specific antigen and a number of clinical measures such
as cancer volume, prostate weight, age, and so on.
() Linear Statistical Analysis I Fall 2013 92 / 101
variables
http://find/8/11/2019 Linear Regression Chp1
93/102
Variables are quantitative or qualitative measurements of
characteristics of objects under our study. Typically, the values of
variables vary for different objects.
They are considered as random variables with some probability
distributions.
We will use the uppercase letters, such asX,Y, to denote
variables and the lowercase letters, xandy, to denote their
particular values.
() Linear Statistical Analysis I Fall 2013 93 / 101
variables
In this course, for each regression model, the variables will beclassified into two categories:
http://find/8/11/2019 Linear Regression Chp1
94/102
classified into two categories:
independent variables, or inputs, or features, or predictor, orregressor, or explanatory variable; typically denoted by X1,X2, .dependent variables, or outputs, or responses, or outcome variable;typically denoted byY1,Y2, . In this course, we only considerone response.
The partition of dependent and independent variables is obviousin some data set but may vary according to the purposes of the
study in other data.
For example, suppose that we collected the temperature and
precipitation data of a city over a period. If we want to construct a
model to predict the temperature from the precipitation, then theresponse is temperature and the predictor is precipitation,
however, If precipitation has to be predicted from temperature,
then the response is precipitation and thepredictoristemperature.
() Linear Statistical Analysis I Fall 2013 94 / 101
variable
http://find/http://find/8/11/2019 Linear Regression Chp1
95/102
Example 1: Y=price of the stock in 6 months from now,
X1=company performance measures andX2=economic data.
Example 2:Y=the amount of glucose in the blood of a diabetic
person,X=the infrared absorption spectrum of that personsblood.
Example 3: Y=level of prostate-specific antigen andX1=cancer
volume,X2=prostate weight,X3=age, and so on.
() Linear Statistical Analysis I Fall 2013 95 / 101
regression models
http://find/8/11/2019 Linear Regression Chp1
96/102
To construct a model from which given any values of the
predictors: X1,X2, , we can give a good guess of the value ofY such that the difference between the guess and the true value is
as small as possible. The guess is called predicted value denote
dby Y.
deterministic system: given any values ofX1, X2, , the value ofYis determinstic. That is, if any two observations of X1, X2, ,x(1)i andx
(2)i ,i=1, 2, , the true values ofY arey(1) andy(2),
then ifx(1)i =x
(2)i ,i=1, 2, , theny(1) =y(2).
mathematically,Ycan be consdered as a function of X1,X2, ,that is,Y =f(X1, X2, ).
() Linear Statistical Analysis I Fall 2013 96 / 101
regression models
http://goforward/http://find/http://goback/8/11/2019 Linear Regression Chp1
97/102
No randomness is involved in deterministic system.
However, in practice, there are many factors affecting the
responses, we cannot find and measure all those factors.
In statistics, we do not consider deterministic models. Instead, wewill consider models having randomness.
These models are more flexible and more appropriate.
() Linear Statistical Analysis I Fall 2013 97 / 101
regression models
Specifically, consider
http://find/8/11/2019 Linear Regression Chp1
98/102
Specifically, consider
Y =f(X1, X2, ) +,
fis a (unknown) function of the variablesX1,X2, , which areinteresting and easy to observe or measure.
The termis a ranomd variable which is used to account for therandomness due to the unobserved factors or variables, noises or
errors.
A standard assumption is that is independent of X1,X2, andits expection is 0.
Then we haveE(Y|X1,X2, ) =f(X1, X2, ).More assumptions will be imposed in the following classes.
() Linear Statistical Analysis I Fall 2013 98 / 101
linear regression models
http://find/8/11/2019 Linear Regression Chp1
99/102
The functionf(X1,X2, )is typically unknown and not a functionwith simple form.
In this course, we will assume the function is linear, that is, if we
havepexplanation variables, then
f(X1,X2, ,Xp) =0+ 1X1+ 2X2+ +pXp,where0, 1, , pare coefficients which are typically unknownparameters and the estimation of them based on data is a main
topic of the course.
Why do we assumefis a linear function?
() Linear Statistical Analysis I Fall 2013 99 / 101
linear regression models
http://find/8/11/2019 Linear Regression Chp1
100/102
The reasons we assumefis a linear function:Sincefcan be very general function, it is hopeless to find the
exact form offbased on limited sample.
Hence, an approximation tofis desirable. Acutually, the linear
function is the first order approximation to any smooth function in
a range ofX1,X2, which is not too large.Although there are other better approximation, the computation is
a very important consideration especially in the precomputer age
of statistics. It is much easier to perform analysis with the linear
models than other models.
() Linear Statistical Analysis I Fall 2013 100 / 101
linear regression models
http://find/8/11/2019 Linear Regression Chp1
101/102
Even in todays computer era there are still good reasons to study
and use them. They are simple and often provide an adequate
and interpretable description of how the inputs affect the output.
For prediction purposes they can sometimes outperform fancier
nonlinear models, especially in situations with small numbers of
training cases, low signal-to-noise ratio or sparse data.
Finally, linear methods can be applied to transformations of the
inputs and this considerably expands their scope.
() Linear Statistical Analysis I Fall 2013 101 / 101
http://find/8/11/2019 Linear Regression Chp1
102/102