STAT 6601STAT 6601Classification:Classification:
Neural NetworksNeural NetworksV&R 12.2V&R 12.2
By By
Gary Gongwer, Gary Gongwer,
Madhu Iyer, Mable KongMadhu Iyer, Mable Kong
Classification Classification Classification is a multivariate technique concerned Classification is a multivariate technique concerned
with assigning data cases (i.e. observations) to one with assigning data cases (i.e. observations) to one of a fixed number of possible classes (represented of a fixed number of possible classes (represented by nominal output variables).by nominal output variables).
The Goal of classification is to:The Goal of classification is to: sort observations into two or more labeled classes. sort observations into two or more labeled classes.
The emphasis is on deriving a rule that can be used The emphasis is on deriving a rule that can be used to optimally assign to optimally assign new new objects to the labeled objects to the labeled classes.classes.
In short, the aim of classification is to assign input In short, the aim of classification is to assign input cases to one of a number of classescases to one of a number of classes
Simple pattern Simple pattern Classification ExampleClassification Example
Let us consider a simple Let us consider a simple problem of distinguishing problem of distinguishing handwritten versions of the handwritten versions of the characters ‘a’ and ‘b’.characters ‘a’ and ‘b’.
We seek an algorithm which We seek an algorithm which can distinguish as reliably as can distinguish as reliably as possible between the two possible between the two characters.characters.
Therefore, goal in this Therefore, goal in this classification problem is to classification problem is to develop an algorithm which will develop an algorithm which will assign any image, represented assign any image, represented by a vector by a vector xx, to one of two , to one of two classes, which we shall denote classes, which we shall denote by Cby Ckk, where k=1,2, so that , where k=1,2, so that class Cclass C1 1 corresponds to the corresponds to the character ‘a’ and class Ccharacter ‘a’ and class C22 corresponds to ‘b’.corresponds to ‘b’.
ExampleExample A large number of input variables can present A large number of input variables can present
severe problems for pattern recognition severe problems for pattern recognition systems. One technique to alleviate such systems. One technique to alleviate such problems is to combine input variables problems is to combine input variables together to make a smaller number of new together to make a smaller number of new variables called variables called featuresfeatures..
In the present example we could evaluate the In the present example we could evaluate the ratio of the height of the character to its width ratio of the height of the character to its width ( x( x11) and we might expect that characters from ) and we might expect that characters from class Cclass C22 (corresponding to ‘b’) will typically (corresponding to ‘b’) will typically have larger values of xhave larger values of x11 than the characters than the characters from class Cfrom class C11 (corresponding to ‘a’). (corresponding to ‘a’).
How can we make the best use of x1 to How can we make the best use of x1 to classify a new image so as to minimize classify a new image so as to minimize the number of misclassifications?the number of misclassifications?
One approach would be to build a One approach would be to build a classifier system which uses a classifier system which uses a threshold for the value of x1 and which threshold for the value of x1 and which classifies as C2 any image for which x1 classifies as C2 any image for which x1 exceeds the threshold, and which exceeds the threshold, and which classifies all other images as C1. classifies all other images as C1.
The number of misclassifications will The number of misclassifications will be minimized if we choose the be minimized if we choose the threshold to be at the point where the threshold to be at the point where the two histograms cross.two histograms cross.
This classification procedure is based This classification procedure is based on the evaluation of x1 followed by its on the evaluation of x1 followed by its comparison with a threshold.comparison with a threshold.
Problem of this classification Problem of this classification procedure: There is still significant procedure: There is still significant overlap of the histograms, and many of overlap of the histograms, and many of the new characters we will test will be the new characters we will test will be misclassified.misclassified.
Now consider another feature Now consider another feature x2. We try to classify new x2. We try to classify new images on the basis of the images on the basis of the values of x1 and x2. values of x1 and x2.
We see examples of patterns We see examples of patterns from two classes plotted in from two classes plotted in the (x1,x2) space. It is the (x1,x2) space. It is possible to draw a line in this possible to draw a line in this space, known as the decision space, known as the decision boundary which gives good boundary which gives good separation of the two classes. separation of the two classes.
New patterns which lie above New patterns which lie above the decision boundary are the decision boundary are classified as belonging to C1 classified as belonging to C1 while patterns falling below while patterns falling below the decision boundary are the decision boundary are classified as C2.classified as C2.
We could continue to consider larger We could continue to consider larger number of independent features in number of independent features in the hope of improving the the hope of improving the performance .performance .
Instead we could aim to build a Instead we could aim to build a classifier which has the smallest classifier which has the smallest probability of making a mistake.probability of making a mistake.
Classification TheoryClassification Theory In the terminology of pattern recognition, In the terminology of pattern recognition,
the given examples together with their the given examples together with their classifications are known as the training set classifications are known as the training set and future cases form the test set.and future cases form the test set.
Our primary measure of success is the Our primary measure of success is the error or (misclassification) rate.error or (misclassification) rate.
Confusion matrix gives the number of cases Confusion matrix gives the number of cases with true class with true class ii classified as of class classified as of class jj. .
Assign costs LAssign costs Lijij to allocating a case of class to allocating a case of class ii to class to class jj. Therefore we are interested in . Therefore we are interested in the average error cost rather than the error the average error cost rather than the error rate.rate.
Average Error CostAverage Error Cost The average error cost is minimized by The average error cost is minimized by
the Bayes rule, which is to allocate to the the Bayes rule, which is to allocate to the class c minimizing class c minimizing
∑ ∑iiLLijij p(i|x) p(i|x) where p(i|x) is the posterior distribution where p(i|x) is the posterior distribution
of the classes after observing x. of the classes after observing x. If the costs of all errors are the same this If the costs of all errors are the same this
rule amounts to choosing the class c with rule amounts to choosing the class c with the largest posterior probability p(c|x).the largest posterior probability p(c|x).
Minimum average cost is known as the Minimum average cost is known as the Bayes risk.Bayes risk.
Classification and Classification and RegressionRegression
We can represent the outcome of the classification We can represent the outcome of the classification in terms of a variable y which takes the value 1 if in terms of a variable y which takes the value 1 if the image is classified as C1, and the value of 0 if the image is classified as C1, and the value of 0 if it is classified as C2. it is classified as C2.
yykk = y = ykk(x;w)(x;w) w denotes the vector of parameters often called w denotes the vector of parameters often called
weightsweights The importance of neural networks in this context The importance of neural networks in this context
is that they offer a very powerful and very general is that they offer a very powerful and very general framework for representing non-linear mappings framework for representing non-linear mappings from several input variables to several output from several input variables to several output variables where the form of the mapping is variables where the form of the mapping is governed by a number of adjustable parameters.governed by a number of adjustable parameters.
Objective: Simulate the Objective: Simulate the Behavior of a Human NerveBehavior of a Human Nerve
Inputs are accumulated by a weighted Inputs are accumulated by a weighted sum. sum.
This sum is the input for output function This sum is the input for output function φ.φ.
A single neuron is not very A single neuron is not very flexibleflexible
Input layer contains the value of each variableInput layer contains the value of each variable Hidden layer allows approximations by Hidden layer allows approximations by
combining multiple logarithmic functionscombining multiple logarithmic functions Output neuron with highest probability Output neuron with highest probability
determines classdetermines class
Regression = LearningRegression = Learning
The weights are adjusted iteratively The weights are adjusted iteratively (batch or on-line)(batch or on-line)
Initially, they are random and smallInitially, they are random and small Weight decay (Weight decay (λλ) keeps weights from ) keeps weights from
becoming too largebecoming too large
BackpropagationBackpropagation
Adjusts weights “back to front”Adjusts weights “back to front” Uses partial derivatives and chain Uses partial derivatives and chain
rulerule
ijij w
Ew
Avoiding Local MaximaAvoiding Local Maxima
Make weights initially randomMake weights initially random Use multiple runs and take the Use multiple runs and take the
averageaverage
An Example: Cushing’s Syndrome
Cushing’s syndrome is a hypersensitive disorder associated with over-secretion of cortisol by the adrenal gland.
Three recognized types of syndromes:
a: adenoma b: bilateral
hyperplasia c: carcinoma u: unknown type
The observations are urinary excretion rates (mg/24hr) of the steroid metabolites tetrahydrocortisone = T and pregnanetriol = P, and are consider on log scale.
Cushing’s Syndrome Data
Tetrahydrocortisone Pregnanetriol Typea1 3.1 11.70 aa2 3.0 1.30 aa3 1.9 0.10 aa4 3.8 0.04 aa5 4.1 1.10 aa6 1.9 0.40 ab1 8.3 1.00 bb2 3.8 0.20 bb3 3.9 0.60 bb4 7.8 1.20 bb5 9.1 0.60 bb6 15.4 3.60 bb7 7.7 1.60 b
b8 6.5 0.40 bb9 5.7 0.40 bb10 13.6 1.60 bc1 10.2 6.40 cc2 9.2 7.90 cc3 9.6 3.10 cc4 53.8 2.50 cc5 15.8 7.60 cu1 5.1 0.40 uu2 12.9 5.00 uu3 13.0 0.80 uu4 2.6 0.10 uu5 30.0 0.10 uu6 20.5 0.80 u
R CodeR Codelibrary(MASS); library(class); library(nnet)cush <- log(as.matrix(Cushings[, -3]))[1:21,]tpi <- class.ind(Cushings$Type[1:21, drop = T])xp <- seq(0.6, 4.0, length = 100); np <- length(xp)yp <- seq(-3.25, 2.45, length = 100)cushT <- expand.grid(Tetrahydrocortisone = xp,
Pregnanetriol = yp)
pltnn <- function(main, ...) { plot(Cushings[,1], Cushings[,2], log="xy", type="n", xlab="Tetrahydrocortisone", ylab = "Pregnanetriol",
main=main, ...) for(il in 1:4) { set <- Cushings$Type==levels(Cushings$Type)[il] text(Cushings[set, 1], Cushings[set, 2], as.character(Cushings$Type[set]), col = 2 + il) }}#pltnn plots T and P against each other by type (a, b, c, u)
> cush <- log(as.matrix(Cushings[, -> cush <- log(as.matrix(Cushings[, -3]))[1:21,]3]))[1:21,]
> cush> cush Tetrahydrocortisone PregnanetriolTetrahydrocortisone Pregnanetriola1 1.1314021 2.45958884a1 1.1314021 2.45958884a2 1.0986123 0.26236426a2 1.0986123 0.26236426a3 0.6418539 -2.30258509a3 0.6418539 -2.30258509a4 1.3350011 -3.21887582a4 1.3350011 -3.21887582a5 1.4109870 0.09531018a5 1.4109870 0.09531018a6 0.6418539 -0.91629073a6 0.6418539 -0.91629073b1 2.1162555 0.00000000b1 2.1162555 0.00000000b2 1.3350011 -1.60943791b2 1.3350011 -1.60943791b3 1.3609766 -0.51082562b3 1.3609766 -0.51082562b4 2.0541237 0.18232156b4 2.0541237 0.18232156b5 2.2082744 -0.51082562b5 2.2082744 -0.51082562b6 2.7343675 1.28093385b6 2.7343675 1.28093385b7 2.0412203 0.47000363b7 2.0412203 0.47000363b8 1.8718022 -0.91629073b8 1.8718022 -0.91629073b9 1.7404662 -0.91629073b9 1.7404662 -0.91629073b10 2.6100698 0.47000363b10 2.6100698 0.47000363c1 2.3223877 1.85629799c1 2.3223877 1.85629799c2 2.2192035 2.06686276c2 2.2192035 2.06686276c3 2.2617631 1.13140211c3 2.2617631 1.13140211c4 3.9852735 0.91629073c4 3.9852735 0.91629073c5 2.7600099 2.02814825c5 2.7600099 2.02814825
> tpi <- class.ind(Cushings$Type[1:21, > tpi <- class.ind(Cushings$Type[1:21, drop = T])drop = T])
> tpi> tpi a b ca b c [1,] 1 0 0[1,] 1 0 0 [2,] 1 0 0[2,] 1 0 0 [3,] 1 0 0[3,] 1 0 0 [4,] 1 0 0[4,] 1 0 0 [5,] 1 0 0[5,] 1 0 0 [6,] 1 0 0[6,] 1 0 0 [7,] 0 1 0[7,] 0 1 0 [8,] 0 1 0[8,] 0 1 0 [9,] 0 1 0[9,] 0 1 0[10,] 0 1 0[10,] 0 1 0[11,] 0 1 0[11,] 0 1 0[12,] 0 1 0[12,] 0 1 0[13,] 0 1 0[13,] 0 1 0[14,] 0 1 0[14,] 0 1 0[15,] 0 1 0[15,] 0 1 0[16,] 0 1 0[16,] 0 1 0[17,] 0 0 1[17,] 0 0 1[18,] 0 0 1[18,] 0 0 1[19,] 0 0 1[19,] 0 0 1[20,] 0 0 1[20,] 0 0 1[21,] 0 0 1[21,] 0 0 1
plt.bndry <- function(size=0, decay=0, ...) { cush.nn <- nnet(cush, tpi, skip=T, softmax=T,
size=size, decay=decay, maxit=1000) invisible(b1(predict(cush.nn, cushT), ...)) }
cush – data frame of x values of examples.tpi – data frame of target values of examples.skip – switch to add skip-layer connections from input to output.softmax – switch for softmax (log-linear model) and maximum
conditional likelihood fitting.size – number of units in the hidden layer.decay – parameter for weight decay.maxit – maximum number of iterations.invisible – return a (temporarily) invisible copy of an object.
predict – generic function for predictions from the results of various model fitting functions. The function invokes particular _methods_ which depend on the 'class' of the first argument. Here: using cush.nn to predict cushT
b1 <- function(Z, ...) { zp <- Z[,3] - pmax(Z[,2], Z[,1]) contour(exp(xp), exp(yp), matrix(zp, np), add=T, levels=0, labex=0, ...) zp <- Z[,1] - pmax(Z[,3], Z[,2]) contour(exp(xp), exp(yp), matrix(zp, np), add=T, levels=0, labex=0, ...) }
par(mfrow = c(2, 2))par(mfrow = c(2, 2))
pltnn("Size = 2")pltnn("Size = 2")set.seed(1); plt.bndry(size = 2, col = 2)set.seed(1); plt.bndry(size = 2, col = 2)set.seed(3); plt.bndry(size = 2, col = 3)set.seed(3); plt.bndry(size = 2, col = 3)plt.bndry(size = 2, col = 4)plt.bndry(size = 2, col = 4)
pltnn("Size = 2, lambda = 0.001")pltnn("Size = 2, lambda = 0.001")set.seed(1); plt.bndry(size = 2, decay = 0.001, col = 2)set.seed(1); plt.bndry(size = 2, decay = 0.001, col = 2)set.seed(2); plt.bndry(size = 2, decay = 0.001, col = 4)set.seed(2); plt.bndry(size = 2, decay = 0.001, col = 4)
pltnn("Size = 2, lambda = 0.01")pltnn("Size = 2, lambda = 0.01")set.seed(1); plt.bndry(size = 2, decay = 0.01, col = 2)set.seed(1); plt.bndry(size = 2, decay = 0.01, col = 2)set.seed(2); plt.bndry(size = 2, decay = 0.01, col = 4)set.seed(2); plt.bndry(size = 2, decay = 0.01, col = 4)
pltnn("Size = 5, 20 lambda = 0.01")pltnn("Size = 5, 20 lambda = 0.01")set.seed(2); plt.bndry(size = 5, decay = 0.01, col = 1)set.seed(2); plt.bndry(size = 5, decay = 0.01, col = 1)set.seed(2); plt.bndry(size = 20, decay = 0.01, col = 2)set.seed(2); plt.bndry(size = 20, decay = 0.01, col = 2)
# functions pltnn and b1 are in the scripts# functions pltnn and b1 are in the scriptspltnn("Many local maxima")pltnn("Many local maxima")Z <- matrix(0, nrow(cushT), ncol(tpi))Z <- matrix(0, nrow(cushT), ncol(tpi)) for(iter in 1:20) {for(iter in 1:20) { set.seed(iter)set.seed(iter) cush.nn <- nnet(cush, tpi, skip = T, softmax = T, cush.nn <- nnet(cush, tpi, skip = T, softmax = T,
size = 3,size = 3, decay = 0.01, maxit = 1000, trace = F)decay = 0.01, maxit = 1000, trace = F) Z <- Z + predict(cush.nn, cushT)Z <- Z + predict(cush.nn, cushT) cat("final value", cat("final value",
format(round(cush.nn$value,3)), "\n")format(round(cush.nn$value,3)), "\n") b1(predict(cush.nn, cushT), col = 2, lwd = 0.5)b1(predict(cush.nn, cushT), col = 2, lwd = 0.5)}}pltnn("Averaged")pltnn("Averaged")b1(Z, lwd = 3)b1(Z, lwd = 3)
ReferencesReferences
Bishop, C.M. (1995) Neural Networks Bishop, C.M. (1995) Neural Networks for Pattern Recognition. Oxford: for Pattern Recognition. Oxford: Clarendon Press.Clarendon Press.
Ripley, B.D. (1996) Pattern Ripley, B.D. (1996) Pattern Recognition and Neural Networks. Recognition and Neural Networks. Cambridge: Cambridge University Cambridge: Cambridge University press.press.