Fault Detection With An Adaptive Distance For The k-Nearest Neighbors Rule
Ghislain Verdier, Ariane Ferreira
Ecole Nationale Superieure des Mines de Saint-Etienne,Centre de Microelectronique de Provence, Gardanne, France
([email protected],[email protected])
ABSTRACT
In recent years, fault detection has become a crucial issue for many industrial fields, notably the semiconductor manufacturingwhere process control engineers constantly try to improve the equipment productivity by detecting as quickly as possible anabnormal behavior. Due to the number of variables and the correlations between them in this type of applications, statisticalmethods dealing with fault detection need to be multivariate. Usually, the multivariate control chart procedures used in theindustry derived from the Hotelling T 2
. However, this rule can only be used when the observations are generated by aGaussian distribution, an assumption rarely satisfied in practice. An alternative consists to apply nonparametric control chartsfor which there is no assumption needed on the distribution. A nonparametric rule, the k-Nearest Neighbors Detection rule isstudied in this paper. The approach consists in evaluating the distance of an observation to its nearest neighbors and declaringa fault if this distance is too large. In this paper, a new adaptive Mahalanobis distance is proposed. It takes into account thelocal correlation structure of the data and then improves the number of faults detected for a fixed false alarm rate, comparedto a classic distance such as the Euclidean distance.Keywords: Adaptive Mahalanobis distance, Correlated variables, Hotelling T 2 , Multivariate methods, Statistical fault detection, k-Nearest Neighbors rule.
1. INTRODUCTION
Process control is crucial in many industries, especially inthe semiconductor manufacturing. A major step of processcontrol is Fault Detection and Classification (FDC). Theaim of FDC is to construct a decision rule able to detectas quickly as possible an abnormal evolution of the system(for example a machine) in order to prevent more criticalproblems in the future. Statistical methods, like Multivariate Control Charts (MCC), are among the most widely usedmethods for the construction of such decision rules. Generally, the problem is stated as follow: the statistical lawof the observations under control is unknown or partiallyunknown. The goal is to get a decision rule from a learning sample ofdata collected during normal operating mode.The decision rule must be able to detect if a new observation comes from normal or fault mode. The most commonapproach supposes that the learning sample observations(the observations without fault) are generated from a multidimensional Gaussian distribution with unknown mean andvariance. In this case, the Hotelling T 2 rule is the most appropriate solution to detect a fault characterized by a shiftin the mean of the observations (see [7]). However, the statistical law of the observations under control is often nonGaussian. Therefore, the Hotelling T 2 is no longer adaptedand leads to a lot of false alarms and non-detections. An alternative is to apply nonparametric detection rules. Theseapproaches can be applied without assumption on the statisticallaw generating the observations under control sincethe detection rules are implemented only with the learningsample. Among these methods, we can cite the rules studied by Devroye and Wise [4] and Baillo et al. [2] based on astandard level set estimator, or by Baillo and Cuevas [1] inwhich a kernel estimator is used. In the literature, severalpapers deal with fault detection using one-class SupportVector Machines approaches (see for example Manevitzand Yousef [8]). More recently, He and Wang [5] propose
1978-1-4244-4136-5/09/$25.00 @2009 IEEE
to adapt the classification rule of the k-nearest neighborsto the problem of Fault Detection (FD). The approach isto evaluate the distance of an observation compared to thenormal operating region. The cumulative distance of thisobservation to it's k nearest neighbors in the learning sample is calculated. Ifthis distance is too high, the observationis considered as an out-of-control. He and Wang [5] applythis decision rule with the Euclidean distance. However,this distance is not always well adapted to a problem sinceit does not take into account the correlations between variables. The aim of this paper is to propose a new distancefor the k-Nearest Neighbors Detection rule (k-NND rule).It is an adaptive Mahalanobis distance based on the correlation structure of the nearest neighbors of each observationunder monitoring.
The paper is organized as follows: in section II, the problem of fault detection is presented and the Hotelling T 2
rule is reminded. In section III, the k Nearest NeighborsDetection rule is detailed and the new adaptive distance isintroduced. Hotelling T 2 and k-NND rule (with Euclideanand adaptive distances) are studied on simulation trials insection IV, and finally, section V gives conclusions on theproposed method.
2. THE FAULT DETECTION PROBLEM ANDTHE HOTELLING T 2
2.1. The context
In the sequel, assume that Xl, ..., XN are the measurements of a system under control, constituting a learningsample. A measurement is a set of values taken by the pro-cess variables, supposed to be an Rd random vector. Thevariables (Xi)i=l,ooo,N are independant and identically distributed according to an unknown law La, with a probability measure denoted by Po. A new mesure comes, a vectorX N + l , and the system is said to be out-of-control if thenew observation is generated by a law Ll (-I- La). Since
1273
.co and .c1 are unknown, we decide that a change has occured if the new observation is outside a tolerance regionSo defined by a probability of false alarm a such that
PO [XN+1 rt So] = a.
So is constructed from the learning sample (X i ) i = l ,... ,N .
2.2. Gaussian observations and Hotelling T 2
Suppose that the observations under control are generatedby a multidimensional Gaussian distribution with an unknown mean
Hotelling T 2 is defined by an ellipse (any observation inside the ellipse is declared "in control") well suited to theobservations of the learning sample. Otherwise, in figure2 the variable X i are not generated by a Gaussian distribution. Therefore, the tolerance region obtained with theT 2 is clearly not adapted to the behavior of the systemunder control , and for example, a point with coordinatesx = (6,0), clearly atypical on the graph, would be declared" in control ".
- 4 - 2.
9 .45
It is therefore important to develop detection methods thatapply in the case of non-Gaussian distributions. If thereis no a priori information on the statistical law of the dataunder control , the best approach is to apply nonparametricmethods, only based on the observations of the learningsample. The approach of the k-NND rule discussed in thesequel is one of them.
Xil
Fig. 1: N=400 observations generated by a Gaussiandistribution and the tolerance region of Hotelling T 2 .
Fig. 2: N=400 observations generated by a non-Gaussiandistribution and the tolerance region of Hotelling T 2
.
++
Xl1
- 1
and
Unfortunately, the distribution .co is often non-Gaussian,notably in semiconductor industry where the distributionsunder control of most equipments can be multimodal withseveral functioning points. In this case, the Hotelling T 2 ,
assuming Gaussian distribution, is not adapted. It leads totoo many false alarms and non-detections.
In figure I, N = 400 observations are generated by aGaussian distribution. The tolerance region obtained by the
The Hotelling T 2 test statistic for the observation X N+1 isthen
When T 2 (X N+1) 2: TOCL> the system is declared out-ofcontrol. By construction, the probability that T 2(X
N +1 )
is greater than TOC L ' whereas X N +1 has been generatedby .co, tends to the false alarm probability a (fixed by theuser) when N is large.
and an unknown variance-covariance matrix
1 ~ - - tS = N _ 1 L..)Xi - X)(Xi - X) ,
i=l
where yt is the usual notation for the transpose ofa matrixy.
2 - t 1 -T (XN+1) = (XN+1 - X) S - (XN+1 - X), (1)
and it is compared to an upper control limit defined by:
T? = (N - l)(N + l)d F (d N - d)UC L N(N - d) "', ,
where F",(d ,N - d) is the (1 - a)-quantile of the Fisherdistribution with parameters d and N - d (see Montgomery[7]).
The tolerance region can be defined with the Hotelling T 2
detection rule introduced in [6]. The mean and variancecovariance matrix are respectively estimated by :
1 N
x = N L Xii = l
1274
3. THE k-NEAREST NEIGHBORS RULE
3.1. Presentation
Initially, the k-NND rule is a nonparametric supervisedclassification method. The supervised classification consists in predicting the unknown class or label ofan observation (lor 0, healthy or sick, etc. for a binary classification)given a learning sample of labeled observations (see [3] foran overview). The k-nearest neighbors classification ruleattributes to an observation the label which has the majority among the k-nearest neighbors ofthis observation in thelearning sample. The idea is that if an observation is closeto a group with almost all the same label, this observationmust belong to the same class.
He and Wang [5] proposes to adapt this rule to the problem of fault detection. In this case, on the other hand, thereis only one class represented in the learning sample, theclass of observations under control. The principle of thedetection method will be as follows: an observation under control will generally take its values in a near neighborhood of the learning sample. Then a new observationX N +1 is declared out-of-control if it is too far from thedata under control. In order to assess the distance betweenX N+ l and the observations under control (Xi)i=l,ooo,N, acumulative distance is calculated between X N + 1 and its knearest neighbors located in the learning sample:
k
D~(XN+l) == L D2(XN+l, X(j)),j=l
where k is a positive integer fixed by the experimenter,D is a distance (for example the Euclidean distance) andX(l), ...,X(k) are the k nearest neighbors of XN+l.
If X N + l is generated by £0, it is probably close to severalobservations of the learning sample. Then the cumulativedistance to its k nearest neighbors is small. If this cumulative distance is too high, XN+1 is declared out-of control.Therefore, it is necessary to determine a control limit forthe test statistic D~(.) given a false alarm rate Q. Contraryto the Hotelling T , the distribution of the test statistic cannot be rigorously expressed from a statistical point of view.The threshold is then chosen empirically from the learning sample. The cumulative distance is evaluated on eachsample in the training data set. The set (D~(Xi))i=l,ooo,N
obtained is an empirical distribution of the distance D~ (.)given Xl, ... , XN. The control limit, denoted h, is definedas the (1 - a) empirical quantile of (D~(Xi))i=l,ooo,N,ie(1 - a) x 100% of the values (D~(Xi))i=l,ooo,N are lowerthan h. By construction, when N is large, the probabilityof false alarm satisfies:
PO[D~(XN+l) > hlXl, ..., XN] ~ a.
The Algorithm: Following He and Wang [5], the algorithm of the detection rule is divided in two parts: thefirst one is a preliminary step concerning the choice of thethreshold and the second one is the monitoring procedure.
Firstly, fix a positive integer k and choose a distance D onRd (for example, the Euclidean distance).
Part 1 - Threshold choice
1- For all i == 1, ... , N, find the k-nearest neighbors of Xiin the learning sample Xl, ... , Xi-I, X i+ l, ... , X N and calculate the cumulative distance:
k
d; == D~(Xi) == LD2(Xi,X(j)).j=l
2- The threshold is chosen as the (1- a) empirical quantileof the distance distribution:
h == d([N(l-a)))
where [N(l - a)] is the integer part of N(l - a) andd(l)' ..., d(N) are the order statistics of the sample.
Part 2 - Monitoring
1- Find the k-nearest neighbors of X N + l in the learningsample Xl, ... , X N and calculate the cumulative distance:
k
dN+l == D~(XN+l) == L D2(XN+l, X(j)).j=l
2- Apply the decision rule: if dN + l 2: h then X N + lis declared out-of-control and an alarm is triggered. IfdN+l < h, the system in under control and the previoustwo steps are repeated for the next observation XN+2.
This detection rule performs better than the Hotelling T 2
for non-Gaussian observations (see section IV). For a fixedfalse alarm rate, the number of detection of abnormal datais greater for the nonparametric rule than for the parametricone.
3.2. The new adaptive distance
So far we have not discussed about the choice of the distance in detail. He and Wang [5] use the classical Euclideandistance defined by
trtv, y.) == (Y:o - y.)t(y:o - y.)"', J '" J '" J'
a distance for which all components of the vectors aretreated equally. However, it seems to us that it is necessary to use a distance more suited to data, a statistical distance, which can take into account the covariances or correlations between variables. In Statistics, the correlationcoefficient between two variables (a quantity taking valuesin [-1; 1]) is used to study the linear relationship betweenthese variables: if the correlation is high (as it is the casein the data simulated in Figure 1), the variable X i 2 tends toincrease when XiI increases (on the contrary, when the correlation between two variables is low (negative), a variabledecreases when the other increases). Therefore, in Figure1, a point with coordinates Xl == (1, -1) is statisticallycloser to the mean of the distribution (here, JL == (3,1))than a point with coordinates X2 == (5, -1), while the euclidean distance between these two points and JL would bethe same.
In the Hotelling T 2 rule, when the distribution of the observations is Gaussian, the correlation between two components is taking into account in the test statistic. Indeed,
1275
(3)
in the Hotelling T 2, the Mahalanobis distance, which is
based on the variance-covariance matrix of the variables(or its estimation), is calculated between the variable of interest and the mean ofthe distribution (or its estimation), asin equation (1). Therefore, in the example of Figure 1, thepoint Xl == (1, -1) is in the tolerance region (in the controlellipse) whereas a point X2 == (5, -1) is out-of-control.
For Gaussian distributions, the Mahalanobis distance isthen the most efficient distance for the k-NND rule. Theresults obtained are close to the Hotelling T 2 . For nonGaussian distributions, we want to use the k-NND rule witha distance that takes into account the patterns of the data,notably the possible correlations, in order to have better result than a k-NND rule applied with Euclidean distance.The Mahalanobis distance is no longer relevant without theGaussian assumption. For example in Figure 2, the observations (clearly non-Gaussian) are separated in two parts:a first part with a high negative correlation and an otherwith a high positive correlation. The estimated variancecovariance matrix S calculated on the totality of the observations indicates that the correlation between the two components XiI and X i 2 is close to zero (that is the reason whythe ellipse obtained with the Hotelling T 2 is parallel to theXiI -axis). Ideally, each part should be studied separatelyin order to avoid a smoothing when the covariance is estimated on the whole sample. This is the idea developed inthe new approach proposed in this paper. We want to studylocal correlation sructures of the learning sample. For anobservation X N + l under monitoring, the method consistsin evaluating the correlation structure of the K nearest observations ofX N + l in the learning sample, and then applying the k-NND rule with a Mahalanobis distance, used withthe covariance matrix estimated on the K observations. Itis therefore an adaptive distance since, for each new dataunder monitoring, a different Mahalanobis distance will beused.
The Algorithm: Like in section 3.1 the algorithm of thek-NND rule with the adaptive distance could be dividedin two parts: the first one for the threshold choice and thesecond one for the monitoring procedure.
Firstly, fix positive integers K and k, with k ~ K.
Part 1 - Threshold choice
1- For all i == 1, ... , N, find the K -nearest neighbors (relatively to the Euclidean distance) of Xi in the learning sample Xl, ... ,Xi-l,Xi+ l, ... ,XN and note these neighborsX(l)" ..,X(K).
2- Estimate the covariance structure S K (Xi) onX(l)' ... ,X(K) by:
(2)
where X K is the empirical mean of the K observationsX(l),···,X(K).
3- Find the k-nearest neighbors (relatively to the Mahalanobis distance used with SK(Xi)) of Xi in the learningsample and calculate the cumulative distance of Xi to its
k-nearest neighbors:
k
di L(Xi - X(j))t X
j=l
SK(Xi)-l(Xi - X(j)),
where X(l), ...,X(k) are the k neighbors.
4- The threshold is chosen as the (1 - a) empirical quantileof the distance distribution:
h == d([N(l-a)])
where [N(l - a)] is the integer part of N(l - a) andd(l)' ... , d(N) are the order statistics of the sample.
Part 2 - Monitoring
1- Find the K -nearest neighbors (relatively to Euclideandistance) of XN+l in the learning sample and note theseneighbors X(l)' ... , X(K).
2- Estimate the covariance structure SK(XN +l) onX(l)' ... , X(K) like in equation (2).
3- Find the k-nearest neighbors (relatively to the Mahalanobis distance used with SK(XN +l)) of X N + l inthe learning sample and calculate the cumulative distancedN +1 of XN +1 to its k-nearest neighbors like in equation(3).
4- Apply the decision rule: if dN+l 2:: h then X N + l isdeclared out-of-control. If dN + l < h, the system in under control and the previous steps are repeated for the nextobservation X N +2 .
In the method proposed above, there are two parametersto choose: k and K. The most important is probably thesecond one since it determines the number of observationsused for the covariance structure estimation. There is noreal rule to choose this parameter and this choice must bemade on a case by case basis, depending on the numberof observations in the learning sample and the general appearance of these data. The parameter K must be largeenough to ensure a good estimation but not too large sincethe estimation must be local. Note that when K == N,the covariance is estimated on the whole sample, and theadaptive distance simply becomes the classic Mahalanobisdistance.
Computation time: The computations required for theuse of the adaptive distance are obviously a little longerthan those of the classical approach since it is necessary toperform an additional step : determination of K neighborsand estimation of a variance-covariance matrix on these Kobservations. But the procedure is still fast enough to beapplied "on-line".
Broadly speaking, the k-NND approach is easily applicable as long as the number of variables is reasonable (forexemple d < 6). When the number of variables is too high,a method for dimension reduction, e. g. a PCA (PrincipalComponent Analysis), is recommended.
1276
Decision rulesk-NND rule (k = 5)
N=250 Hotellling Euclidean Mahalanobis Adaptative distanceT2 distance distance K=20 K=50 K=80 K=1I0 K=140 K=170
Mean 27.28 49.34 39.67 47.56 53.67 55.91 55.72 54.14 49.53Variance 41.89 873.74 629.06 542.74 637.42 670.08 665.22 695.89 722.67
Decision rulesk-NN D rule (k = 5)
N=500 Hotellling Euclidean Mahalanobis Adaptative distanceT2 distance distance K=30 K=70 K=100 K=130 K=160 K=190
Mean 27.23 73.95 67.20 67.82 73.30 74.96 75.59 75.90 75.72
Variance 23.10 383.30 384.66 286.54 255.71 245.3 1 246.53 248.58 261.58
Decision rulesk-NND rule (k = 5)
N=1000 Hotellling Euclidean Mahalanobis Adaptative distanceT2 distance distance K=50 K=140 K=230 K=320 K=41O K=500
Mean 29.40 82.40 78.84 77.11 81.02 82.32 82.56 81.89 81.28
Variance 14.94 122.37 161.44 99.90 99.45 88.64 91.63 106.75 81.28
Tab. 1: Mean and Variance ofthe number of detections for Hotelling T 2 and k-NND rules, for N = 250, N = 500 andN = 1000.
4. EXAMPLES AND APPLICATIONS
In this section, simulation trials are performed to comparethe new adaptive distance with the Euclidean and Mahalanobis distances for the k-NND rule. We consider a learning sample ofN observations generated as shown in Figure2 : N /2 observations are generated by a Gaussian distribution with parameters:
i " ( 0.7 0.5)ant L.l = 0.5 0.7
and N / 2 observations are also generated by a Gaussian distribution , with parameters:
112 = ( ~) et ~2 = ( ~O:5 O~75 ) .
In order to compare the different fault detection methods,a false alarm rate is fixed (ex = 1%) and the decisionrules are all implemented with respect to this rate. Eachrule is then evaluated on the number of faults detected.100 observations out-of-control are simulated outside thetheoritical control limits (see Figure 3). Obviously, if afault simulated consists in a high shift, the k-NND ruledetects the change whatever the distance used. That isthe reason why the fault trajectories are simulated nearthe theoritical tolerance region. The change is then moredifficult to detect. The number of detections is comparedfor the following rules:- Hotelling T 2
- the k-NN D rule (k = 5) with Euclidean distance- the k-NN D rule (k = 5) with Mahalanobi s distance
++.p- 4!;
~37+.. ~+ +
f
-II-+*
+\t
- 1
- 3 - 2 - 1
Xil
Fig. 3: 100 observations out-of-control similated outsidethe theoritical control limits.
1277
- the k-NND rule (k == 5) with the new adaptive distance(several values for K).
The previous simulation is repeated. Table 1 represents themean and variance of the number ofdetections for 200 repetitions and different values for the learning sample sizeN. As expected, Hotelling T 2 rule gives the worst resultssince less than one in three out-of-control trajectory is detected. In the same way, Mahalanobis distance is not veryefficient. More interesting, when the parameter K is wellchosen, the new adaptive distance performs better than theEuclidean distance, notably when the learning sample sizeis quite small. In addition, the variance ofthe number ofdetections is smaller with the adaptive distance, that reflects akind of stability of the method. We can expect to improvethese results by working on the joint selection of k and K.
5. CONCLUSIONS AND FUTURE WORK
We propose here a new adaptive distance for the k nearest neighbors detection rule (k-NND rule) proposed by Heand Wang [5]. The k-NND rule is a nonparametric methodfor fault detection. Based on the learning sample only, thistype of methods has a great interest in industry, where theGaussian assumption is seldom satisfied. The new distanceproposed in this paper is an adaptive Mahalanobis distanceand is based on local correlation structure estimations. Thefirst simulation results are convincing since the k-NND ruleperforms better when it is applied with the new adaptivedistance than when the Euclidean distance is used. It isnow important to study in more detail the choice of the parameters k and K used in the adaptive distance. It dependsmainly on the learning sample: the sample size N and thepossible local correlation structures.
In this paper, the problem is only to decide if one observation XN+l is in control or out-of-control. But as forthe Hotelling T 2
, the k-NND rule can be applied combined with an Exponentially Weighted Moving-Average(EWMA) procedure in order to be more efficient for detecting, as rapidly as possible, a change in a dynamic process.
Future works will be to implement this kind of approacheson a semiconductor manufacturing process and to study thetheoritical properties of these rules.
REFERENCES
[1] A. Baillo and A. Cuevas, "Parametric versus nonparametric tolerance regions in detection problems," Computational Statistics, vol. 21 (3-4), pp. 523-536, 2006.
[2] A. Baillo, A. Cuevas, and A. Justel, "Set estimation andnonparametric detection," Canadian Journal ofStatistics, vol. 28, pp. 765-782,2000.
[3] L. Devroye, L. Gyorffi, and G. Lugosi, A Probabilistic Theory ofPattern Recognition. Springer, Berlin,1996.
[4] L. Devroye and G. L. Wise, "Detection of abnormalbehavior via nonparametric estimation of the support,"SIAMJournal on AppliedMathematics, vol. 38 (3), pp.480-488,1980.
[5] Q. P. He and J. Wang, "Fault detection using the knearest neighbor rule for semiconductor manufacturing
processes," IEEE Trans. Semiconduct. Manufact., vol.20 (4), pp. 345-354, Nov. 2007.
[6] H. Hotelling, "Multivariate quality control, illustratedby the air testing of sample bombsights," in Techniquesofstatistical analysis, C. Eisenhart, M. W. Hastay, andW. A. Wallis, Eds. New York: McGraw-Hill, 1947,pp. 111-184.
[7] D. Montgomery, Introduction to Statistical QualityControl. Wiley, New York, 1996.
[8] L. M. Manevitz and M. Yousef, "One-class SVMs fordocument classification," Journal ofMachine LearningResearch, vol. 2, pp. 139-154,2001.
1278