+ All Categories
Home > Documents > Walczak_Artificial Neural Networks in Classification of NIR Spectral Data

Walczak_Artificial Neural Networks in Classification of NIR Spectral Data

Date post: 21-Feb-2018
Category:
Upload: natrix2
View: 221 times
Download: 0 times
Share this document with a friend

of 12

Transcript
  • 7/24/2019 Walczak_Artificial Neural Networks in Classification of NIR Spectral Data

    1/12

    ELSEVIER

    Chemometrics and Intelligent Laboratory Systems 33 (199 6) 35-46

    Chemomet r i cs and

    inte l l igent

    laboratory sys tems

    Artificial neural networks in classification of NIR spectral data:

    Design of the training set

    W. Wu a, B. Walcmk a,, D.L. Massart a7 ,S. Heuerding b, F. Erni b, I.R. Last ,

    K.A. Prebble

    a

    ChemoAC, Pharmaceutical I nstitute, Vri je Uni versiteit Brussel, Laarbeeklaan 103, B-1090

    Brussel

    Belgium

    b Sandoz Pharma AC, Anal ytical Research and Development, CH -4002 Basle , Switzerland

    Analytical Department Laboratories, The Wellcome Foundation L td, Dartfor d, Kent DA1 5AH , UK

    Received 24 May 1995; accepted 18 September 1995

    Abstract

    Artificial neural networks (NN) with back-error propagation were

    used for the classification with NIR spectra and applied

    to the classification of different strengths of drugs. Four training set selection methods were compared by applying each of

    them to three. different data sets. The NN architecture was selected through a pruning method, and batching operation, adap-

    tive learning rate and momentum were used to train the NN. The presented results demonstrate that selection methods based

    on Kennard-Stone and D-optimal designs are better than those based on the Kohonen self-organized mapping and on ran-

    dom selection m ethods and allow 10 0% correct classification for both recognition and prediction. The Kennard-Stone de-

    sign is more practical than the D-optimal design. The Kohonen self-organized mapping method is better than the random

    selection method.

    Keywords: Drug

    analysis; Neural network; NIR; Pattern recognition

    1 Introduction

    One observes an increasing interest in the applica-

    tion of neural networks (NNs) in chemical calibra-

    tion and pattern recognition problems [l-13]. Al-

    though NNs do not require any assumptions about

    data distribution, they can be successfully applied

    Corresponding author.

    On leave from Silesian University, Katowice, Poland.

    only to sufficiently large and representative data sets.

    The term sufficiently large is relative. The im por-

    tant factor is the ratio of the number of samples to the

    number of weights considered in the net architecture.

    Widrow [ 141 suggests as a rule of thum b that the

    training se t size should be about 10 times the num ber

    of weights in a network. According to other authors

    [15], the maximum number of nodes in the hidden

    layer should be of the order g(m + l), where and

    g denote the number of input and output units, re-

    spectively. Although these suggestions differ to some

    extent, all NN users agree that the higher the ratio of

    the number of samples to the number of weights the

    0169 -743 9/% / 15.00 8 199 6 Elsevier Science B.V. All rights reserved

    SSDI

    0169-7439(95)00077-l

  • 7/24/2019 Walczak_Artificial Neural Networks in Classification of NIR Spectral Data

    2/12

    36

    W. Wu et al./ Chemometrics and Intelligent Laboratory Systems 33 1996) 35-46

    better the generalization ability of NN . For the given

    number of samples this ratio can be maximized by

    minimising the net architecture (reducing input data,

    pruning redundant weights, etc.).

    The second requirement, data representativity,

    means that the samples in the data set should be

    (evenly) spread over the expected range of data vari-

    ability. In some cases it may be possible to generate

    such samples as the training set using experimental

    design techniques. However, in most cases such as in

    analysis of food samples, one does not have this pos-

    sibility [16,17]. Usually, one needs to select the train-

    ing (model) samples from a large set of samples. The

    other samples could be used to test the net. However,

    using all samples to train the net may lead to overfit-

    ting and to large prediction errors for the test set. To

    avoid this the net training must be monitored. This

    means that apart from the training set, one needs two

    other data sets, the monitoring and test sets. In indus-

    trial practice, the sample size is not very large. One

    can use the same data set to monitor training and later

    evaluate the NN. Hence, at least two data sets (the

    training and test sets) are required. The principles for

    the design of these two sets are the same as the prin-

    ciples of design of any model set. Our study aims to

    evaluate different strategies of training set design,

    namely random selection, the Kohonen self-organis-

    ing map approach [ 181, and two ne w approaches pro-

    posed by us namely the Kennard and Stone design

    [

    191 and D-optimal design [2 0,21].

    2. Theory

    2.1. Notation

    m

    number of input variables for NN

    g

    number of classes (i.e. number of output

    variables of NN)

    n

    number of objects in the data

    I

    z

    number of objects in the training set

    number of variables in the training set

    X

    matrix of the training set n X m)

    N number of objects in the training set or in the

    test set

    Y

    target matrix (N

    X g>

    out

    output matrix of NN (N

    X 8 .

    2.2. Design of training set

    2.2.1. Random selection

    There are several ways of selecting the training set.

    The simp lest one is a random selection which m eans

    that no clear selection criterion is applied. There is a

    risk that objects of some class are not selected in the

    training set. In order to avoid such risk, we select 3/4

    of the objects separately from each class, and put

    them together a s training set. If 3/4 of the objects is

    not an integer, the number is rounded to the nearest

    integer in decreasing direction.

    2.2.2. Kohonen self-organising maps I1 8 61

    Another possible procedure for selecting the train-

    ing set is to apply clustering techniques. The Koho-

    nen network can be applied as such [6]. Zupan et al.

    compared three kinds of methods in an example of the

    reactivity of chemical bonds, and found that the Ko-

    honen self-organising map performed best [6]. The

    main goal of the Kohonen neural network is to map

    objects from m-dimensional into two-dimensional

    space. When the objects have similar properties in the

    original space, they will map to the same node. In this

    study, a (3

    X

    3) Kohonen network is chosen, contain-

    ing 9 nodes. The learning rate is 0.1 at the beginning

    and is linearly decreased so that at the last training

    cycle it reaches 0 . The neighbourhood size is also

    decreased linearly but reaches a minimum of 1 after

    one-quarter of the training cycles, and remains 1 for

    the rest of the training. The network is stabilised af-

    ter each pattern has been presented to the network

    about 500 times. 3/4 of the objects are randomly se-

    lected from the objects which map to the same node.

    If 3/4 of the objects is not an integer, we round it to

    the nearest integers in increasing direction, otherwise

    some nodes would have no objects after rounding.

    This procedure is applied to each class separately. All

    selected objects are put together as the training set.

    2.2.3.

    Kennard-Stone design [19,25,291

    The Kennard-Stone algorithm technique was

    originally used to produce a design when no standard

    experimental design can be applied. With this tech-

    nique, all objects are considered as candidates for the

    training set. The design objects are chosen sequen-

    tially. At each stage, the aim is to select the objects

  • 7/24/2019 Walczak_Artificial Neural Networks in Classification of NIR Spectral Data

    3/12

    W. Wu e t a l . / Chem om e t r i c s a nd I n t e l li g e n t L abo r a t o r y S y s t em s 33 1996 ) 3 5 - 4 6

    37

    so that they are uniformly spaced over the object

    space. The first two objects are selected by choosing

    the two objects that are farthest apart. The third ob-

    ject selected is the one farthest from the first two ob-

    jects, etc. Let d,, denote the squared Euclidean dis-

    tance from the ith object to the jth object. Suppose

    k

    objects have already been selected, where k

    and three kinds of placebo. Twenty different tablets

    of each dosage form (active and placebo) are mea-

    sured four times through a glass plate on which the

    tablets are positioned. The average spectra of the four

    were collected.

    Data set2 contains 160 NIR spectra(lOOOl-4000

    cm- 779 wavelengths) of capsules containing drug s

  • 7/24/2019 Walczak_Artificial Neural Networks in Classification of NIR Spectral Data

    4/12

    38

    W . Wu e t a l . / Chew me t r i c s a nd I n t e l li g e n t L abo r a t o r y S y s t em s 33 1996 ) 3 5 - 4 6

    of different dosages (0.1, 0.25, 0.5, 1.0 and 2.5 mg)

    and three kinds of placebo. Twenty different cap-

    sules of each dosage form (active and placebo) are

    measured four times through the glass plate.

    Daru set 3 contains 135 NIR spectra (1100-2500

    nm; 700 wavelengths) of tablets containing different

    dosages (20,50, 100 and 200 mg) of the experimen-

    tal active ingredient, a placebo and a clinical com-

    parator. There are respectively 15, 17, 15, and 21

    spectra in the classes of different dosages, 47 spectra

    in the class of placebo and 20 spectra in the class of

    comparator. Spectra are measured through the blister

    package, which contributes to the spectrum at around

    1700 nm.

    3.1. D at a p re p rocessing

    The pre-processing step consists of trimming, data

    transformation, and training set selection. The first

    and last 15 wavelengths were trimmed from each of

    the spectra to remove edge effects. In this study, the

    standard normal variate @NV) transformation [24]

    was applied to reduce the effects of scatter, particle

    size, etc. After transformation, each data set was di-

    vided into the training set and test set by the tech-

    niques described. The data of the training set were

    subjected to a principal component analysis (PCA).

    The first 10 principal com ponents were taken into

    consideration. The scores of the objects from the test

    set in the PC space were calculated using the load-

    ings obtained from the training set.

    4.

    Neural network parameters

    The multilayer feedforward network trained with

    the backpropagation learning algorithm was applied

    [22]. The goal of net training is to minimize the root

    mean square error @MS)

    RMs =

    d

    fL Cjs1( ij - OUtij)

    Ng

    where yij is the element of target matrix y (N X g>

    for the data considered (training set or test set), and

    ourij is the element of the output matrix out N X g)

    of the NN.

    To make backpropagation faster the following

    three techniques were used: batching operation,

    adaptive learning rate and momentum [23]. In batch-

    ing operation, one can apply multiple input vectors

    simultaneously, instead of one input vector per time,

    and obtain the networks response to each of them.

    Adding an adaptive learning rate can also decrease

    training time. This procedure increases the training

    speed, but only to the extent that the net can learn

    without large error increases. At each iteration new

    weights and biases are calculated using the current

    learning rate. The new output of the net and error term

    are then calculated. If the new error exceeds the old

    error by more than a predefined ratio (typically 1.041,

    the new weights, biases, output and error are dis-

    carded. In addition, the learning rate is decreased

    (typically multiplied by 0.7). If the new error is less

    than the old error, the learning rate is increased (typi-

    cally multiplied by 1.05). Otherwise the new weights,

    etc., are kept [23]. Momentum decreases backpropa-

    gations sensitivity to small details in the error sur-

    face and helps the net to avoid getting stuck in shal-

    low minima.

    To avoid overfitting, the perform ance of the net-

    work is tested every hundred or thousand epochs

    during the training, and the weights for which the

    minim al RM S for the test set is observed are

    recorded.

    The target vector describing the belongingness of

    the object to a class was set to binary values of 1 (for

    corresponding class) and 0 (for other classes). The fi-

    nal output of the net can be evaluated in two differ-

    ent ways: the object can be considered as correctly

    classified if the largest output regardless of its abso-

    lute value is observed on a node signalling the cor-

    rect class, or the object can be considered as cor-

    rectly classified if the largest output is observed on a

    node signalling the correct class and its value is

    higher than 0.5. The second criterion is stricter and

    was chosen in the study to evaluate NN performance.

    It allows soft modelling of data, i.e. it can happen that

    the

    i t b

    object will not be classified into any of the

    predefined classes.

    The performance of a classification system is ad-

    ditionally expressed as the percent of the number of

    correctly classified objects of the training and test

    sets, divided by the total number of objects present in

    these sets.

  • 7/24/2019 Walczak_Artificial Neural Networks in Classification of NIR Spectral Data

    5/12

    W. Wu et al. / Chemom etrics and Intelligent Loboratory Systems 33 1996) 35-46

    39

    5. Results and discussion

    5.1. Selection of NN architecture

    To compare different techniques of training set se-

    lection, we can compare them for the optimal struc-

    ture obtained with each training set selection tech-

    nique. As described by Zupan and Gasteiger [6], an-

    other way is to compare the performance using a

    fixed structure of NN. We chose the latter procedure.

    The architecture of the network was first optimised

    for the data which were divided into the training and

    test sets by the Kennard-Stone algorithm. Then we

    used this structure as the fixed structure. The input

    and output values were range-scaled between 0.1 and

    0.9 variable by variable. The backpropagation leam-

    ing rule with adaptive learning rate and momentum

    was used. The initial values of learning rate and mo-

    mentum were fixed at 0.1 and 0.3, respectively.

    An effort was made to check the influence of the

    random initialisation of the net weights upon the fi-

    nal classification results. For instance, with data set 1

    rerunning a neural net (4 nodes in the hidden layer)

    10 times with the randomised initial weights results

    in correct classification rates (CCRs) for training and

    test sets, each time equal to lOO% , while the mean

    values of RMSs for training and test sets are equal to

    0.0565 and 0.0619, with standard deviations of

    0.0061 and 0.0059, respectively. It demonstrates that

    CCRs for training and test sets are stable with differ-

    ent seeds of random generator, although the RM Ss

    change. The adaptive learning rate makes the results

    more independent of the initial values of weights. The

    NN utilised in this study consisted of two active lay-

    ers of nodes with a sigmoidal transfer function. The

    number of nodes in the output layer is determined by

    the number of classes. Normally the number of nodes

    in the input layer is also determined by the structure

    of the data. As already explained, for NIR data the

    number of variables is much larger than the number

    of objects and the variables are highly correlated. The

    data can be orthogonalized and reduced by principal

    component analysis, but then the number of input PCs

    should be optimised. According to Widrows sugges-

    tion (see Section 11, the number of objects ought to

    be about 10 times the number of weights. However,

    in practical use, the number of objects is limited and

    there are seldom so many available. Therefore, we

    relaxed this condition during the optimisation of the

    NN architecture: the ratio of the number of objects to

    the number of weights ough t to be more than 1. If the

    numbers of input and output nodes are fixed, the

    maximum number of hidden nodes can be estimated

    using this rule. For instance, if there are 60 objects in

    the training set, we never train an NN having more

    than 60 weights. If we want to use 10 input nodes and

    6 output nodes, then the number of hidden nodes

    cannot exceed 3. The NN with 3 hidden nodes, 10

    input nodes and 6 output nodes has together 57 (11

    X

    3 + 4 X 6) w eights. For the net with 4 hidden

    nodes, the number of weights (11

    X

    4 + 5

    X

    6 = 79)

    is already larger than 60.

    There is no standard way to optimise the architec-

    ture of NN. The simplest way is to try systematically

    all combinations of nodes to find the optimal number

    of nodes in the input and the hidden layer. Data set 1

    Table 1

    Data set 1: correctly classified rate of the training set (CCR) and test set (CCRt) of all combinations within 10 input nodes and 4 hidden

    nodes; maximum number of epochs 5000

    Input nodes 1 hidden node 2 hidden nodes

    3 hidden nodes 4 hidden nodes

    CCR cCRt

    CCR CCRt CCR CCRt

    CCR

    CCRt

    2 28.6 28.6 59.1

    65.7 69.5 77.1 80.0 77.1

    28.6

    28.6 66.7 71.4 91.4 97.1 100 100

    28.6

    28.6 71.4 71.4

    96.2

    97.1

    100

    100

    28.6

    28.6 69.5 71.4 100 100

    100 100

    28.6

    28.6

    71.4

    71.4 85.7

    85.7 100 100

    28.6

    28.6 75.2 77.1 100

    100 100 100

    28.6

    28.6 79.1 80.0

    98.1

    97.1

    100 100

    28.6

    28.6 71.4

    68.6 100

    94.3

    100 97.1

    28.6

    28.6 71.4 68.6

    99.1

    97.1

    100 100

  • 7/24/2019 Walczak_Artificial Neural Networks in Classification of NIR Spectral Data

    6/12

    lo

    W. Wu et al. Chemometri cs and I ntelli gent L aborator y Systems 33 (1996) 35-46

    is used as an example. We consider all combinations

    of the number of input nodes varying from 2 to 10

    (i.e. the first two to the first 10 PCs) and the number

    of hidden nodes varying from 1 to 4. The results of

    all these nets are shown in Table 1. There are eight

    NNs which give 100% classification for both recog-

    nition and prediction. However, this approach re-

    quires a lot of trials, and could be even more time

    consuming if we would like to take into account all

    possibly combinations of two, three, etc., PCs.

    A more efficient approach to optimise the number

    of nodes in the input and the hidden layers is as fol-

    lows: first the number of input nodes is fixed at the

    maximal number of PC factors, and the number of

    hidden nodes is increased from small to large until the

    performance of the network does not improve any

    more or both the recognition and prediction percent-

    ages are 100%.

    The maximum number of PCs to be entered can be

    decided by the variance explained (for instance, the

    number of PCs needed to explain 99% variance) or

    by the results of pilot experiments. In the latter case,

    we train the net using the first 10 PCs as input and

    the maximum number of hidden nodes which is esti-

    mated by the above rule. If the NN performs well, 10

    can be used as the maximum number of PCs. If the

    performance is not satisfying, m ore PCs will be taken

    into account.

    When the number of nodes in the hidden layer has

    been fixed and the maximum number of input nodes

    has been decided, the number of input nodes is pruned

    according to the value of the weights. If the weights

    connected to one input node are all large, this indi-

    cates that the variable corresponding to the input node

    plays an important role in NN. If the weights con-

    nected to one input node are all small, this indicates

    that the variable corresponding to the input node plays

    a small role in the NN and can be pruned off. If the

    weights are intermediate, one can try to prune the

    variable and if the NN still performs well one can

    decide to prune it definitively. A hidden node can also

    be pruned if the weights connecting the hidden node

    to the input nodes, and the weights between the hid-

    den node and the output nodes, are small. The mag-

    bl 0.2

    2-Layer

    Badcpmpagetion

    ith

    AdaptiveLR & Momentum

    I I

    1 I I I I 1

    I I

    I I I I

    I

    1

    I

    0 500

    looo 1500

    2ooo 2500 3cMl 3 m 4ooo 4500 5ooo

    Epoch; . training - test

    60

    1 I I

    I I

    I I

    I

    0 300

    looo 1 500 Moo 2500 3ooo 3500 4ooo 4500 m

    Epoch; training -test

    Fig. 1. a) The root mean square error RM S) as a function of the numb er of haining epochs; b) the percentage of correctly classified

    objects as a function of the number of training epochs; network architecture 10 4 7); data set 1.

  • 7/24/2019 Walczak_Artificial Neural Networks in Classification of NIR Spectral Data

    7/12

    W. Wu e t a l ./ C hem om e t r i c s a nd I n t e l li g e n t L abo r a t o r y S y s t em s 33 1996 ) 3 5 - 4 6

    41

    Input Biases

    (b)

    0

    2 4

    6 6

    10 12

    ::

    0 5

    I

    1.5

    2 2.5

    3 3 .5

    4 4 .5

    Nauronj

    Fig. 2. (a) Hinton diagram of the weights between the nodes of the input layer and tbe nodes of the hidden layer in the network (10 X 4 X 7);

    (b) sum of the absolute values of the weights of tbe node in the input layer, (c) sum of the absolute values of the weights of the node in the

    hidden layer; data set 1.

    (a)

    p-Layer &&propgation with AdaptiveLR 8 Momentum

    0.3

    1

    1

    I 1 I I I

    I

    A

    . . . . . . . . . . . . . ,

    0.05

    I

    I

    0 500 loo0 1500

    zfloo

    2 5 00 3 ow s o 0

    4 o o o 4 50 0 5 o o o

    Epoch; Iraining -test

    (bl

    100

    1

    c

    g 90-

    iti

    u 80-

    i3

    $

    70 - :

    60

    I I 1

    1

    I I

    I

    0 iw

    l o o c l 1500 2oal 2500 3o cm 3500 4wo 4500 5ow

    E p o c h ; . training -test

    Fig. 3. (a) The root mean square error (RMS) as a function of the number of aaining epochs; (b) the percentage of comxtly classified

    objects as a function of the number of training epochs; network architecture (3 X 4 X 7); data set 1.

  • 7/24/2019 Walczak_Artificial Neural Networks in Classification of NIR Spectral Data

    8/12

    42

    W. Wu et al. Chemomerr ics and I ntelli gent L& orator y Systems 33 (19%) 35-46

    nitude of weights can be easily displayed in the Hin-

    ton diagram. This diagram displays the elements of

    the weight matrix with squares whose areas are pro-

    portional to their magnitude. The bias vector is sepa-

    rated from the other weights with a solid vertical line.

    The largest square corresponds to the weight with

    largest magnitude and all others arc drawn with sizes

    relative to the largest square [23]. The sum of the ab-

    solute values of the weights connected to a node can

    be used to estimate the importance of the role played

    by the node. This pruning is repeated until the per-

    formance of the network degrades.

    Table 2 demon strates the results of classification

    for the sequence of steps in the optimisation of the net

    architecture for data set 1. As one can see, a 100%

    correct classification is observed for the NN w ith the

    first 10 PCs as input variables and 4 nodes in the

    hidden layer.

    Fig. 1 demonstrates the performance of the net-

    work with 10 input and 4 hidden nodes during the

    training. Fig. 2 shows the Hinton diagram. The

    weights of input nodes 4 to 10 are much smaller than

    those of the first three nodes. This suggests that the

    Table 2

    Data set 1: correctly classified rate of the training set (CCR ) and

    test set (CCRt); training set selected by the Kemxud-Stone proce-

    dure; maximum number of epochs 5000

    Input Hidden

    CCR cCRt Time

    nodes nodes

    (C) (%) (s)

    10

    3 99.1 97.1 568

    10 4 100 100 534

    3 4 100

    100 531

    2 4

    80 77.1 627

    PCs 4 to 10 do not contribute significantly to the net-

    work performance and the first three PC factors play

    an important role in classification. After pruning

    them, the network performance does not decrease

    (Fig. 3). However, recognition and prediction per-

    centages become worse when the input nodes are re-

    duced to 2 (PCs 3 to 1 0 are rejected>. Therefore, the

    optimal structure of the network for data set 1 is 4

    nodes in the hidden layer and 3 input nodes. The fi-

    nal weights of the optimal network are shown in the

    Hinton diagram (Figs. 4 and 5). This indicates that

    input i & Biases

    0.5 1 1.5

    2

    2.5 3 3.5

    4 4.5

    mj

    Fig. 4. (a) Hinton diagram of the weights between the nodes of the input layer and the nodes of the hidden layer in the optimal network

    (3 X 4 X 7); (b) sum of tbe absolute values of the weights of the node in the input layer. (c) sum of the absolute values of the weights of the

    node in tbe hidden layer, data set 1.

  • 7/24/2019 Walczak_Artificial Neural Networks in Classification of NIR Spectral Data

    9/12

    W. Wu et al. Chemometri cs and I ntelli gent Laborator y Systems 33 (19%) 35-46

    43

    Input i 8 Biases

    (b)iilli

    0.5

    1 1.5

    2 2.5

    3 3.5

    4

    4.5

    0

    1 1

    I

    I I I

    0

    1

    2

    3 4 5

    6 7

    8

    Neuron

    j

    Fig. 5. (a) Hinton diagram of the weights between the nodes of the hidden layer and the nodes of the output layer in the optimal network

    (3 X 4 X 7); (b) sum of the absolute values of the weights of the node in the hidden layer; (c) sum of the absolute values of the weights of

    the node in the output layer; data set 1.

    (4 R~CIII

    I51

    IO

    t

    .

    % *ia

    (4 -

    151------

    10

    t

    .

    I

    *

    Y

    (c) KannaId-Btona

    :J--yy

    15

    (d) Doptimal

    1

    5

    -5

    5

    10 15

    3

    0 5 10 15

    Fig. 6. the design of the training sat by random selection, Kohonen self-organising m ap, Kennard-Stone algorithm and D-optimal design

    with a simulated data set; ( *) objects of the training set

    ; ( - )

    objects of the test set.

  • 7/24/2019 Walczak_Artificial Neural Networks in Classification of NIR Spectral Data

    10/12

    44

    W. Wu et al./ Chemometri cs and I nfelligent L & orator y Systems 33 (19%) 35-46

    Table 3

    Data set 2: correctly classified rate of the training set (CCR) and

    test set (CCRt); training set selected by the Kemtard-Stone prcce-

    dure; maximum number of epochs 15000

    Input

    Hidden

    CCR

    CCRt Time

    10

    4

    99.2

    97.5 1670

    10

    5 100

    100 2182

    9 5 100 100 1781

    8

    5 99.2

    100

    1763

    this diagram can be used as a visual tool to reduce the

    number of the input scores and to select the input

    variables. The optimal architecture is obtained after

    we train the NN four times as described in Table 2;

    while for the same data we need 36 trials with the

    systematic trial method. This kind of pruning method

    is much faster than the systematic trial method.

    However, this pruning m ethod can be effectively ap-

    plied only whe n the performance of the NN is very

    good. The idea of this method is to reduce the size of

    the architecture without changing the performance of

    NN . If the performance of the NN is bad, there is no

    sense in trying to improve the performance by prun-

    ing.

    Using the pruning method, the optimal architec-

    ture for data set 2 is 5 hidden nodes with 9 input

    nodes (Table 3); for data set 3 it is 3 hidden nodes

    with 6 input nodes (Table 4). This architecture and

    parameters were used in the following experiments.

    5.2. Comparison of the four techniques of training set

    selection

    In order to visually compare the four techniques,

    two-dimensiona l data of 40 objects were first simu-

    lated. In this case, half of the objects was selected by

    Table 4

    Data set 3: correctly classified rate of the training set (CCR) and

    test set (CCRt); training set selected by the Kennard-Stone proce-

    dure; maximum number of epochs 20000

    Input

    Hidden

    ::

    CCRt Time

    nodes

    nodes

    (46)

    (s)

    7

    2

    76.8

    69.4

    1699

    7

    3 100 100 1521

    6 3 100 100 1867

    5 3 99.0

    100

    1845

    the studied methods except for the Kohonen method.

    For the Kohonen method, 21 objects were selected,

    because sometimes half of the objects mapping in the

    same node was not an integer, and this was rounded

    as described earlier.

    Fig. 6 shows that objects selected by the Ken-

    nard-Stone and D-optimal cover the whole data do-

    main; while the objects selected by the random selec-

    tion and Kohonen methods do not. The Kennard-

    Stone procedure selects the objects so that they are

    distributed evenly, and the D-optimal selects the ex-

    treme objects. The number of objects which is out of

    the range of the selected objects by the Kohonen

    method is lower than that by random selection. Ken-

    nard-Stone and D-optimal m ethods seem to select

    objects that are more appropriate in the sense that

    they are more representative for building the class

    borders than the other methods.

    Further, we studied the effect of the four tech-

    niques on the performance of NN by keeping the ar-

    chitecture and parameters of the network constant. In

    the methods of random selection, the training set ob-

    jects are randomly selected and in the case of Koho-

    nen self-organising maps, they are randomly selected

    from each cluster. This random selection step leads to

    the possibility that the selection is sometimes very

    good and sometimes very bad. To overcome such a

    drawback, the methods of random selection and Ko-

    honen self-organising maps were repeated three

    times. The results are shown in Tables 5-7. To train

    the NN one time, it takes about 10 min for data set 1,

    and half an hour for data sets 2 and 3.

    For data set 1 , there are no differences in the per-

    formance of recognition, the recognition percentages

    Table 5

    Data set 1 : comparison of the four different techniques of training

    set selection; number of correctly classified objects divided by to-

    tal number of objects expressed behveen parentheses

    Method

    Random

    CCR (%) CCRt (%)

    Time (s)

    100(105/105 ) 97.1 (34/35) 624

    Random

    Random

    Kohonen

    Kohonen

    Kohonen

    Kemmrd-Stone

    Doptimal

    100 (105j105) 100 (35/35) 634

    100 (105/105) 94.3 (33/35)

    636

    100(113/113) %.3(26/27)

    566

    100(116/116) 100(24/24) 577

    100 (116/116) 100 (24/24)

    684

    100(105/105) 100(35/35)

    531

    100 (105/105) loo (35/35) 516

  • 7/24/2019 Walczak_Artificial Neural Networks in Classification of NIR Spectral Data

    11/12

    W. Wu et al./ Chemometrics and Intelligent Laboratory Systems 33 1996) 35-46

    45

    Data set 2: comparison of the four different techniques of training

    set selection; number of correctly classified objects divided by to-

    tal number of objects expressed between parentheses

    Table 6

    Table 8

    Data set 3: correctly classified rate of the training set (CCR) and

    test set (CCRt); training set selected by D-optimal design ; maxi-

    mum number of epochs 2OCGO

    Method CCR (%o) CCRt (%I Time (s)

    Input

    nodes

    6

    5

    4

    Hidden

    nodes

    3

    3

    3

    CCR

    (o/o)

    100

    100

    88.9

    CCRt

    (8)

    100

    100

    88.9

    Time

    (s)

    1881

    1916

    1915

    Random 99.2 (119/120) 92.5 (37/40) 2218

    Random 100(120/12 0) 97.5 (39/40) 2213

    Random 99.2(119/120) 97.5 (39/40) 2212

    Kohonen 100(132/13 2) 96.4 (27/28) 1942

    Kohonen 100(130/13 0) 100 (30/30)

    1956

    Kohonen 99.2(130/131)

    96.6 (28/29) 2309

    Kennard-Stone 100(120/120)

    100 @O/40)

    1781

    D-optimal 100(120/120)

    100 @O/40)

    2246

    being 100% for all methods. There are differences

    though in the performance of prediction. The predic-

    tion percentages with the Kennard-Stone and D-op-

    timal training sets are 100%. For the random selec-

    tion method, one of the three replicates gives 100%

    prediction percentage. For the Kohonen method, two

    of the three replicates perform 100% in prediction.

    For data set 2, Kennard-Stone and D-optimal

    training sets lead to perfect performance (100% cor-

    rect classification) for both recognition and predic-

    tion. The results of the random selection are not sat-

    isfying. None of the replicates allow 100% predic-

    tion, and only one of the three replicates gives 100%

    of recognition. W ith the Kohonen method, the results

    of one replicate are good (100% recognition and

    100% prediction success), an d the results of the other

    two replicates are bad.

    For data set 3, the Kennard-Stone and D-optimal

    training sets give the same perfect performance. With

    the random selection and the Kohonen methods, the

    Table 7

    Data set 3: comparison of the four different technique3 of training

    set selection; numb er of cone&y classified objects dtvided by to-

    tal number of obkcts expressed between parentheses

    Method CCR (%o) CCRt (%I Time (s)

    Random 100 (99/99)

    Random

    100 W/99)

    Random

    100 @9/99)

    Kohonen

    97.2 (103/106)

    Kohonen 1OQW7/107)

    Kohonen 100(110/110)

    Kennard-Stone 100 W/99)

    D-optimal 100 W/99)

    94.4 34/36) 1863

    88.9 32/36) 1868

    100 36/36) 1884

    96.6 28/29)

    1667

    100 28/28) 1685

    100 (25/25)

    1632

    100 (36/36)

    1867

    100 (36/36)

    1881

    performances of the three replicates are sometimes

    good and sometimes bad. One of the three replicates

    gives good results for the random selection method,

    and so do two of the three replicates for the Kohonen

    method.

    In order to compare the Kennard-Stone design and

    D-optimal design, we tried to optimize. the architec-

    ture of NN again for the D-optimal design. The ar-

    chitecture of data set 3 can be further improved us-

    ing the pruning method (Table 8).

    The architecture of the other data sets cannot be

    pruned further. This suggests that the D-optimal se-

    lection might sometimes be slightly better than the

    Kennard-Stone. D-optimal selection selects the train-

    ing set objects which describe the whole information

    as well as possible. There are more extreme objects

    selected by this method than selected by the other

    methods (Fig. 6). For classification, the aim is to de-

    rive the border of every class and therefore the ex-

    treme objects are more useful than others during

    training.

    6. Conclusion

    Artificial NN are shown to be useful pattern

    recognition tools for the classification of NIR spec-

    tral data of drugs when the training sets are correctly

    selected. Comparing the four training set selection

    methods, the Kennard-Stone and D-optimal proce-

    dures are better than the random selection and Koho-

    nen methods. The results of the D-optimal design may

    be slightly better than those of the Kermard-Stone

    design. However, the computing time of the D-opti-

    mal design (using Kennard-Stone design as the ini-

    tial points) is larger than that of the Kennard-Stone

  • 7/24/2019 Walczak_Artificial Neural Networks in Classification of NIR Spectral Data

    12/12

    46

    W. Wu et ul. Chemometri cs ana ntelli gent Laborator y Systems 33 (1996) 35-46

    procedure. The random selection and Kohonen meth-

    ods do not allow good performance in our study.

    The number of data sets studied is not sufficiently

    large to prove that these conclusions are always valid

    for any data set. However, they allow at least to state

    that the Kennard-Stone procedure will be a useful

    approach in certain instances, and according to us, in

    most instances.

    References

    [II

    D

    [31

    141

    151

    b51

    [71

    181

    [91

    [lOI

    [III

    X.H. Song and R.Q. Yu, Chemom. Intell. Lab. Syst., 19

    (1993) 101-109.

    C. Borggaard and H.H. Thodberg, Anal. Chem., 64 (1992)

    545-55 1.

    Y.W. Li and P.V. Espen, Chemom. Intell. Lab. Syst., 25

    (1994) 241-248.

    D. Wienke and G. Kateman, Chemom. Intell. Lab. Syst.. 23

    ( 1994) 309-329.

    T.B. Blank and S.D. Brown, J. Chemom., 8 (1994) 391-407.

    J. Zupan and J. Gasteiger, Neural Networks for Chemists: An

    Introduction, Weinheim, New York, 1993.

    T. Noes, K. Kvaal, T . Isaksson and C. M iller, J. Near Infrared

    spectrosc., 1 (199 3) 1-11.

    P. de B. Harrington, Chem om. Intell. Lab. Syst., 19 (199 3)

    143-154.

    B.J. Wythoff, Chemom. Intell. Lab. Syst., 20 (1993) 129-

    148.

    G. Kateman, Chemom. Intell. Lab. Syst., 19 (1993) 135-142.

    J.R.M. Smits, L.W. Bmedveld, M.W.J. Derksen and G. Kate-

    man, Anal. Chim. Acta, 258 (1992) 1 -25.

    [12] A.P. Weijer, L. Buydens and G. Kateman, Chemom. Intell.

    Lab. Syst., 16 (1992) 77-86.

    [13] J. Zupan and J. Gasteiger, Anal. Ch im. Acta, 248 (1991) l-

    [14] ZWidrow , Adaline and Madaline, in Proceedings of IEEE

    1st International Conference on Neural Networks, 1987 , pp.

    143-158.

    1151 A. Maren, C. Harston and R. Pap, Handbook of Neural Com-

    puting Applications, Academic Press, San Diego, 1990.

    [16] T. Naes, J. Chemom.. l(1987) 121-134.

    [

    171 T. Naes and T. Isaksson, A ppl. Spectrosc., 4 3 (1989) 328-33 5.

    [

    181 T. Kohonen, Self-Organisation and Associative Memory,

    Springer, Heidelberg, 1984.

    [19] R.W. Kemmrd and L.A. Stone, Technometrics, 11 (1969)

    137-148.

    [20] R. Carlson, Design and Optimization in Organic Synthesis,

    Elsevier, Amsterdam, 1992.

    [21] P.F. de Aguiar, B. Bourguignon, M.S. Khots and D.L. Mas-

    sart, Chemom. Intell. Lab. Syst. (in press).

    [22] T.P. Vogl, J.K. Mangis, A.K. Rigler, W.T. Zink and D.L.

    Alkon, Biol. Cybemet., 59 (1988) 257-263.

    [23] H. Demuth and M. Beale, Neural Network Toolbox Users

    Guide, The MathWorks, Inc., 1993.

    124 1 R.J. Barnes, M.S. Dh anoa and S.J. Lister, Appl. Spectrosc.,

    43 ( 1989) 772-777.

    [25] B. Bourguignon, P.F. de Aguiar, K. Thorns and D.L. Mas-

    sart, J. Chromatogr. Sci., 32 (1994) 144-152.

    [26] T.J. Mitchell, Technomettics, 16 (197 4) 203-2 10.

    127 1 V.V. Fedorov, Theory of Optimal Experiments, Moscow

    University. English translation by W.J. Studden and E.M.

    Klimo, Academic Press, New York, 1972.

    128 1 A.C. Atkinson, Chemom . Intell. Lab. Syst., 28 (199 5) 35-47.

    129 1 B. Bourguignon, P.F. de Aguiar, M .S. Khots and D.L. Mas-

    sart, Anal. Chem., 66 (1994) 893-904.


Recommended