A nonparametric change point model for multivariatephase-II statistical process control
Mark HollandDouglas Hawkins
School of StatisticsUniversity of Minnesota
May 24, 2011
Mark Holland (UMN) Nonparametric change point model 1
Statistical Process Control (SPC) definitions
Statistical Process Control refers to a collection of tools designed todetect a shift in distribution of a sequence of observations.
Phase-I SPC: Analysis is performed on a fixed set of historical data.
Phase-II SPC: Ongoing analysis is performed on a possiblynever-ending stream of observations.
Common cause variability is inherent variability in a process, evenwhen running as designed.
Special cause variability is not a normal part of the process, but isthe result of the intrusion of an unexpected factor.
A process is in control when only common cause variability exists, but isout of control when special cause variability is introduced.
Mark Holland (UMN) Nonparametric change point model 2
Statistical Process Control (SPC) applications
Traditionally used in manufacturing settings, but developments in modernindustries have created demand for new monitoring techniques
Health care (Thor et al. 2007)I Laboratory setting, e.g. Chemical assay methodsI Direct patient care, e.g. ICU vital signs
Post-market product performance
Groundwater and air quality
Many current applications require multivariate nonparametric methods
Several measurements must be monitored simultaneously
Multivariate normal distribution rarely applies
Difficult to check if a data set follows multivariate normal distribution
Mark Holland (UMN) Nonparametric change point model 3
Aluminum Smelter Data
Aluminum smelting refers to an electrolysis process to reduce refinedaluminum ore into metallic aluminum.
Data set consists of alumina (Al2O3) content of a smelter feed alongwith several impurities: silica (SiO2), ferric oxide (Fe2O3), magnesiumoxide (MgO), and calcium oxide (CaO).
As expected with compositional data, content of compounds arenegatively correlated.
Monitor for change in composition of alumina or any of the impurities.
Mark Holland (UMN) Nonparametric change point model 4
Standard SPC tools
Some traditional phase-II SPC methods include
Shewart Chart, Cumulative sum (CUSUM), Exponentially weightedmoving average (EWMA)
Limitations of traditional methods
In-control distribution including all parameters must be known.
In practice, parameter estimates from a phase-I training sample aretypically substituted for the truth.
In some applications a large historical training sample is not available,so monitoring must begin shortly after data collection begins.
I ICU vital signsI pollution control monitoring
Must be “tuned” to detect a specific size of shift.
Mark Holland (UMN) Nonparametric change point model 5
Change point approach to phase-II SPC
Hawkins, Qiu, and Kang (2003) proposed change point model for phase-II SPC,which does not require knowledge of in- or out-of-control process parameters.
Skeleton of change point approach:
1. Choose two-sample test statistic for comparing left- and right-segments of process readings, {X1, . . . ,Xk} and {Xk+1, . . . ,Xn}.
2. Apply test for all possible split-points, k = 1, 2, . . . , n − 1.3. If maximum test statistic value is outside of control limits, signal that a
shift has occurred. Otherwise, collect another observation and repeat.
Originally implemented with likelihood ratio test for shift in mean for univariatenormal data
Zamba and Hawkins (2006) extended using likelihood ratio test for shift inmultivariate normal data
Deng (2009) extended using univariate Wilcoxon-Mann-Whitney nonparametrictest for difference in location
Mark Holland (UMN) Nonparametric change point model 6
Rank based multivariate change point model
We used existing hypothesis test proposed by Choi and Marden (1997) todesign a change point model for phase-II SPC use.
We observe n random vectors from a multivariate location familydistribution
X1,X2, . . . ,Xk ∼F (µ)Xk+1,Xk+2, . . . ,Xn ∼F (µ + δ).
and we wish to test
H0 : δ = 0 vs.
Ha : δ 6= 0
Mark Holland (UMN) Nonparametric change point model 7
Multivariate nonparametric test (Choi and Marden 1997)
Suppose we observe a sample of p × 1 random vectors X1, . . . ,Xn. For1 ≤ i , j ≤ n, define
Dij =Xi − Xj||Xi − Xj ||
and for 1 ≤ i ≤ n, define
Rn(Xi ) =n∑
j=1
Dij .
Then, Rn(Xi ) is the centered directional rank vector of Xi .
Mark Holland (UMN) Nonparametric change point model 8
Multivariate nonparametric test (cont’d)
Next, let
R̄(k)n =
1
k
k∑i=1
Rn(Xi ).
and define the covariance matrix estimator
Σ̂Rk,n =n − k
(n − 1)nk
n∑i=1
Rn(Xi )Rn(Xi )′.
Finally, define the test statistic
Rk,n = R̄(k)′n Σ̂
−1Rk,n
R̄(k)n .
Under mild conditions, Rk,n has asymptotic null distribution χ2p.
Mark Holland (UMN) Nonparametric change point model 9
Multivariate nonparametric change point model
Test statistic for existence of a change point
Rmax,n = max1≤k≤n−1
Rk,n
Estimate of the location of the change point
τ̂R,n = arg max1≤k≤n−1
Rk,n
Mark Holland (UMN) Nonparametric change point model 10
Fixed-sample size simulation results
When both k and n − k are large, the distribution of Rk,n isapproximately χ2p, as expected (k = 100, n − k = 50).
When k or n − k is small, the distributions of Rk,n, Rmax,n, and τ̂R,nare affected by the dependence structure of the simulated data.
The following plots show the estimated distribution of the location ofthe maximum Rk,n value for a sample of n = 200 equicorrelated MVNrandom vectors with ρ = 0, 0.9.
Mark Holland (UMN) Nonparametric change point model 11
0 50 100 150 200
0.01
0.02
0.03
0.04
Distribution of τ̂T and τ̂R
k
prop
ortio
n
p = 10 , ρ = 0
RknTkn
2
0 50 100 150 200
0.01
0.02
0.03
0.04
Distribution of τ̂T and τ̂R
k
prop
ortio
n
p = 10 , ρ = 0.9
RknTkn
2
190 192 194 196 198
0.01
0.02
0.03
0.04
Distribution of τ̂T and τ̂R
k
prop
ortio
n
p = 10 , ρ = 0 (zoomed in)
RknTkn
2
190 192 194 196 198
0.01
0.02
0.03
0.04
Distribution of τ̂T and τ̂R
k
prop
ortio
n
p = 10 , ρ = 0.9 (zoomed in)
RknTkn
2
Mark Holland (UMN) Nonparametric change point model 12
Quarantine
Problem: Distribution of τ̂ depends on dependence structure of dataI Distribution only depends on dependence structure when split point is
near the boundary of the sequence of data
Solution: Quarantine, that is restrict search for a change point tointerior of sequence.
Mark Holland (UMN) Nonparametric change point model 13
Quarantined Phase-II SPC procedure
To use Rk,n for phase-II SPC:
Collect observation Xn and compute
Rmax,n,c = maxc hn,α,p,c |Rmax,j,α,c ≤ hj,α,p,c ; j < n] = α.
Use Monte Carlo simulation to obtain sequence of control limits, {hn,α,p,c}.
Mark Holland (UMN) Nonparametric change point model 14
Control limits
n
h α
Control limits for phase−II directional rank procedure (p = 5, c = 15)
33 100 200 300 400 500
1214
1618
2022
24
α = 1/100
α = 1/200
α = 1/500
α = 1/1000
α = 1/2000
Mark Holland (UMN) Nonparametric change point model 15
Average run length (ARL) as a performance metric
The average run length (ARL) of a phase-II SPC procedure is theaverage number of observations collected before the first signal occurs.
Design phase-II SPC procedure to control in control ARL to aminimum value, 1/α.
Subject to constraint on in control (IC) ARL, we would like tominimize out of control (OOC) ARL.
Similar to common goal in hypothesis testingI Minimize Type-II error rate given that Type-I error rate is controlled to
level α.
Mark Holland (UMN) Nonparametric change point model 16
In control ARL simulation results
Simulated equicorrelated data with correlation ρ = 0, 0.5, 0.9Default quarantine values: c = 9 for p = 2; c = 15 for p = 5, 10
Multivariate Normal Data:I Default quarantine is sufficient to achieve IC ARL within 10% of
nominal for all values of p and ρ considered
Multivariate Gamma Data:I Positive, right-skewed distribution. Not elliptically symmetric.I Default quarantine is sufficient to achieve IC ARL within 10% of
nominal, except when p = 5, 10 and ρ = 0.9
Multivariate Cauchy Data:I Symmetric distribution, much heavier tails than MVN distribution.I Default quarantine is sufficient to achieve IC ARL within 10% of
nominal, except when p = 10 and ρ = 0.5, 0.9
Mark Holland (UMN) Nonparametric change point model 17
Out of control simulation methodology
1 Simulate n = 32 equicorrelated in control observations from themultivariate normal distribution with p = 5 and mean vector µ = 0.
2 Introduce mean vector shift δ = (δ, . . . , δ)T and begin monitoringwith quarantine c = 15 at observation n = 33.
3 Simulate data sequence until signal occurs using control limits chosento achieve in control ARL 1/α = 500.
4 record run length = number of observations collected sincemonitoring began.
5 Repeat for 100,000 simulated data sequences and compute ARL.
Mark Holland (UMN) Nonparametric change point model 18
Effect of quarantine on out of control ARL
0.0 0.5 1.0 1.5 2.0 2.5 3.0
23
45
6
shift vector length
log(
AR
L)
Quarantined directional rank OOC ARL, p = 5
● ●●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●● ● ● ● ● ●
● c = 0c = 3c = 9c = 15
Mark Holland (UMN) Nonparametric change point model 19
Performance comparison with parametric method
0.0 0.5 1.0 1.5 2.0 2.5 3.0
12
34
56
shift vector length (Mahalanobis distance)
log(
AR
L)
Rkn vs. ZH OOC ARL, p = 5
● ●●
●
●
●
●
●
●
●
●
●●
● ● ● ● ● ● ● ● ● ● ● ●
● Rkn rho = 0, c = 15Rkn rho = 0.9, c = 15ZH parametric
Mark Holland (UMN) Nonparametric change point model 20
Diagnostic to select degree of quarantine
Based on copula function: A copula is a p-dimensional distributionfunction on [0, 1]p with uniform univariate marginal distributions.
Sklar’s theorem: any p-dimensional distribution function isassociated with a unique copula function.
Copula can therefore be used to characterize the dependence betweenthe components of a random vector.
Diagnostic based on Anderson-Darling test for Goodness-of-Fit ofmultivariate normal copula.
Mark Holland (UMN) Nonparametric change point model 21
Analysis of Aluminum Smelter Data
40 50 60 70 80 90
1020
3040
observation
● ●
●●
●● ●
●●
●●
●●
●● ●
●●
● ●● ●
●● ●
●●
●● ● ●
● ●●
● ●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
● ● ● ●
●
●
●● ●
●
Analysis of Aluminum Smelter Data
● Rmaxcontrol limit, hn
In control ARL: 1/α = 500Control limit exceeded at observation n = 71Estimated shift location τ̂R,n = 55Mark Holland (UMN) Nonparametric change point model 22
Analysis of Aluminum Smelter Data
0 10 20 30 40 50 60 70
57.0
57.5
58.0
58.5
Al2O3
observation
%
0 10 20 30 40 50 60 700.
20.
40.
60.
81.
01.
21.
4
SiO2
observation
%
0 10 20 30 40 50 60 70
23.5
24.0
24.5
25.0
25.5
26.0
Fe2O3
observation
%0 10 20 30 40 50 60 70
12.0
12.5
13.0
13.5
14.0
MgO
observation
%
0 10 20 30 40 50 60 70
3.5
4.0
4.5
5.0
CaO
observation
%
Mark Holland (UMN) Nonparametric change point model 23
Summary
Traditional SPC methods are not suitable for some modernapplications
Change point model for phase-II SPC does not require phase-Itraining sample
Nonparametric multivariate change point model:I Does not require assumption of multivariate normalityI Outperforms parametric method for small to moderate shift sizes, even
when data follows multivariate normal distributionI Detects large shifts slower than parametric method
Mark Holland (UMN) Nonparametric change point model 24
References
Choi, K. and Marden, J. (1997). An approach to multivariate rank tests inmultivariate analysis of variance. Journal of the American Statistical Association92(440), pp. 1581 - 1590.
Deng, Q. (2009). A nonparametric change-point model for phase II analysis. PhDthesis. University of Minnesota.
Hawkins, D. M., Qiu, P., and Kang, C. W. (2003). The Changepoint Model forStatistical Process Control. Journal of Quality Technology 35(4), pp. 355-366.
Thor, J., Lundberg, J., Ask, J. Olsson, J. Carli, C., Harenstam, K., Brommels, M.(2007). Application of statistical process control in healthcare improvement:systematic review. Quality and Safety in Health Care 16, pp. 387-399.
Zamba, K. D. and Hawkins, D. M. (2006). A multivariate change-point model for
statistical process control. Technometrics 48(4), pp. 539-549.
Mark Holland (UMN) Nonparametric change point model 25
Assumptions required for asymptotic result for Choi and Marden (1997)test statistic:
Under the Null Hypothesis,
Λ = cov(Dij) and Ω = cov(Dij ,Dil)
are finite and positive definite when i , j , and l are all distinct.
k/n→ λ0 ∈ (0, 1)
Mark Holland (UMN) Nonparametric change point model 26
Multivariate gamma distribution
Let Y0,Y1, . . . ,Yp be independent gamma random variables withpdf’s
pYi (yi ) =1
Γ(θi )e−yi yθi−1i , yi > 0, θi > 0.
Define X = (Y0 + Y1,Y0 + Y2, . . . ,Y0 + Yp)T .
Marginal distribution of each Xi is a univariate gamma distributionwith shape parameter θ0 + θi .
ρij = corr(Xi ,Xj) =θ0√
(θ0 + θi )(θ0 + θj).
Mark Holland (UMN) Nonparametric change point model 27
Multivariate Cauchy distribution
Let Y ∼ Np(µ,Σ) and let w ∼ χ2ν .Define
X =1√w/ν
Y.
Then, X follows the multivariate T distribution.
If ν = 1, X follows the multivariate Cauchy distribution.
Mark Holland (UMN) Nonparametric change point model 28
BackgroundDefinitionsApplicationsStandard tools
Change point models in phase-II statistical process controlGeneral frameworkMultivariate nonparametric location testMultivariate nonparametric change point model
Evaluation of performanceIn control performanceOut of control performance
Analysis of Aluminum Smelter Data