1 Module Five: Outlier Detection for One Sample Case In module Four, we discuss methods for...

1

Module Five: Outlier Detection for One Sample Case

In module Four, we discuss methods for detecting normality of a response variable, and ways of dealing with extremes, if exist.

In this unit, we will discuss methods, both numerical and modern graphical methods for detecting extremes.

We start with one variable case inter-laboratory testing studies, and extend to two-sample cases in Module Six

2

Detecting outliers for one variable case

Consider the TAPPI inter-laboratory testing study, there were 87 labs participated the study to test the Sample GR35. The data reported were the lab averages.

NOTE: it is often the case that each lab test the same sample twice or more for investigating the within-lab variability as well as between lab variability. Before an adequate analysis of within and between lab variability, it is critical the testing procedure for each lab is standardized and the testing process is under statistical control. If there are very unusual testing results found, one should look for possible causes, and decide to either keep or delete the outliers for further analysis.

First thing to do in detecting outliers

The detection of outliers is usually a preliminary analysis to ensure the reliability of the data. Before conducting any numerical or graphical approaches, it is a common practice to do the following for identifying obvious mistakes from sampling or testing:

1. A quick visual check through the data values to identify obvious typos or impossible data values based on the context of the study, for example, a miss of one decimal place, or a data values that are completely out of the possible range of the testing results.

2. A quick computation of descriptive statistics provides minimum and maximum data values that help us quickly check typos or impossible data values as well.

3. Once these are done, we can apply numerical and graphical methods to investigate ‘not-so-obvious’ outliers.

4

Graphical and Numerical methods for Detecting Outliers

1. The use of Empirical Rule for identifying outliers

Empirical Rule: When the distribution of the data is mound-shaped, if a data value is outside two s.d. of the mean, we may say it is a possible outlier (extreme), since there is only about 2.5% of chance to be lower than or higher than the mean.

.025 .025

2x s x 2x s

Note: We replace by and by s for the Empirical rule, since () are not known, and we estimate them by sample information.

2. Box-Plot for detecting outliers

A popular graphical tool for detecting outliers of a variable is the Box-Plot:

Q1 Q3m

Where, Inter-Quartile , IQR = Q3-Q1, the range of the middle 50% of the data values.

NOTE: the *’s are possible outliers.

and are very likely outliers. These data values are very far away from the center (mean and median).

Q3+(1.5)IQR Q1-(1.5)IQRQ1-3IQR Q3+3IQRx

Revisit the blood pressure data for 15-20 years old young adults

100

150

200

Sys

tolic

Blo

od P

ress

ure

70

210

112114.590

Systolic Blood Pressure for 15-20 years old young adults

Sample mean Median

Likely outlier

Possible outliers

Possible outliers

170

Box plot of the systolic blood pressure shows that 210 is a likely outlier, and several others are possible outliers (such as 170, and 70).

This plot can be done by hand easily. Minitab also has this plot. Here are steps of constructing box plot using Minitab:

1. Go to Graph Menu, choose Box Plot.

2. In the Dialog box, enter the variable name in Y. If we want to conduct two box plots based on the gender, then, enter Gender in X.

3. In the Data Display, one can add more displays than the default ones by add the row 3, say, Mean Symbol for displaying sample mean on the plot.

4. Annotation allows to show outlier values, mean and median on the plot.

5. Frame allows to display more than one plot on the same page.

Male Female

100

150

200

SEX

Syst

olic

Blo

od P

ress

ure

7080808080

156160162162166168170170170170170

210

158160166

120

110118.706 110.390

Box PLots for Systolic Blood Pressure,separated by Gender

Box Plots of Blood Pressure, Comparing Male and Female

The likely outlier, 210 is a Male. The distribution for Male is somewhat skewed-to-right. However, excluding 210, it will be pretty much symmetric. However, there are some potential extremes on either end.

The distribution shape for Female is approximately symmetric, and therefore, we can assume Normality for Female.

Hands-on activity

Use the Inter-laboratory testing data – the TAPPI data to construct a box plot for GR-Lab35-Mean variable and GR-Lab35-Mean-1 variable. And identify the likely outliers from each variable.

10

3. Numerical Methods for Detecting Outliers

a. Studentized Residuals (as known as CPV’s(Comparative Performance Values), h-statistics in the literature of Inter-laboratory testing studies).

b. Deleted Studentized Residuals

Consider the TAPPI lab data of sample GR35:

Using the notations {y1, y2, y3, …., yn} to represent the n data values, one from each lab.

When the same testing procedure is applied and each lab process is under statistical control, the expected testing result should be the same.

We will use the notation for the expected measurement.

11

A Simple model for describing the one sample testing

As we demonstrated in the ‘2 cm drawing activity’, there is always some uncertainties above and below the true measurement, and that if there is no special causes or systematic bias, the deviations between each lab’s testing result = yi – should behave at a random fashion.

This suggests that each testing result can be expressed in the following model: yi = +i for i = 1,2,3, …., n labs

This describe the expected situation in one sample testing. We then use the observed lab testing data to estimate the expected testing result and to investigate the random deviation. By using the sample data, this is what we have: yi = +ei .

is the average of all included labs (as known as grand mean).

ei is what we call residual. And we also see that average of ei’s is zero.

y

y

12

If the testing result , yi from a lab is likely an outlier, it’s corresponding ei will be far away from the average, 0. Therefore, one can use the residual to detect labs with extreme testing result.

In stead of using the residual, ei, itself (the value depends on the measurement units), we use some standardized form of ei to detect outliers, so that, it will not be measurement dependent.

A classical one is : Standardized ei (as known as CPV as well as h-

statistics in inter-laboratory studies):

How to compute standardized ei for each lab?

1. Compute , the grand mean of all included labs.

2. Compute ei = yi –

3. Compute the between-lab variance, s2 and standard deviation, s:

s2 = and

4. Standardized ei = ei/s

y

2( ) /( 1)iy y n 2s s

y

13

How to use standardized residual (CPV or h-statistic) to detect outliers?

A quick rule :

•If standardized residual > 2 or < -2 then it is a possible outlier. Since, based on the normal probability, there is approximately only 2.5% of chance to have a standardized residual > 2 or < -2, respectively.

•If standardized residual > 2.6 or < -2.6, then, it is a likely outlier. There is approximately only 0.5% of chance to be > 2.6 or < -2.6, respectively.

NOTE: 2.0 and 2.6 are values from the Z-distribution, N(0,1).

Z-2.576 –1.96 0 1.96 2.576

f(z)

.005 .025 .025 .005

14

A more precise rule:

•Standardized residual > t(.025, n-1) or < -t(.025, n-1), then, it is a probable outlier.

•Standardized residual > t(.005, n-1) or < -t(.005, n-1), then, it is a likely outlier.

NOTE: t(, n-1) is a value of t-distribution. The standardized residual follows a t-distribution with degrees of freedom n-1 in this case. t-distribution is very similar to Z-distribution. T depends on sample size. When sample size is larger, t is eventually the same as Z.

t -t(.005,n-1) –t(.025,n-1) 0 t(.025,n-1) t(.005,n-1)

f(t)

.005 .025 .025 .005

15

A more sensitive measure for detecting outliers:

Deleted Standardized Residual, dj.

The steps for computing this measurement:

1. Delete the jth case,

2. then compute and residual ei(j) = yi - for every case, including the jth case.

3. Compute and s(j) using the (n-1) residuals, excluding jth case.

4. Compute the deleted standardized residual, dj = ej(j)/s(j)

5. Repeat the steps 1-3 for cases j = 1,2,3 …., n.

Since the Deleted Standardized residual for the jth observation estimates all quantities with this observation deleted from the data set, the jth observation cannot influence these estimates. Therefore, unusual Y values clearly stand out. It is more sensitive than the classical standardized residual.

( )jy ( )jy

2( )js

16

How to use the Deleted Standardized Residual to detect outliers?

The same quick rule as the standardized residual applies here.

However, if we are to be more precise, we need to use the t-distribution. In applying the t-distribution, the degrees of freedom is now (n-2).

For most of applications, the rule QUICK RULE is sufficient. Unless the sample size n is very small. A common wisdom is that n < 30 is small. However, for practical reason in outlier detection, it is appropriate to consider n < 20 to be small, and that the t-distribution should be applied.

The key issue after detecting the outliers is to find out the possible causes of these outliers.

17

The h-plot for Inter-laboratory Testing

The h-plot plots the CPV values on a two dimensional plot with a center line and upper and lower limits along the X-axis. The X-axis is the Lab ID. The CPV values of replications within each lab, if existed, are grouped together. The Y-axis is the standardized (or deleted studentized residuals). An example is given in the following:

1 2 3 4 5 6 7 8 9 10 11 12

0

2

-2

One may use the more precise t-values for the upper and lower bounds :

In this plot, there are 12 labs. Each lab has two replications. The length of each line is the standardized residual (h-value or CPV) or deleted studentized residual.

),025(. dft

18

The h-plot is a graphical view of the standardized residuals or deleted studentized residuals. The same plot is not available in Minitab. However, Minitab does provide all needed numerical measurements. We can create a similar graph using Minitab as well.

The outlier detection using residuals is a very useful tool. In the above case, we consider the simplest model that describe one sample data as y = + e. This model assumes

• Each lab is similar in its operation,

•The testing procedure is standardized,

•The operators have similar quality,

•The testing material is similar.

If any of these assumptions is seriously violated, this model is not adequate. A more complicated model should be considered. The outliers detection should not be applied to response variable directly if we know in advance the violation of these assumptions.

19

Use Minitab to compute numerical measurements for conducting outlier detection for one sample case

NOTE: This process involves a lot of computations. We do not do this by hand. Here is the steps of using Minitab to compute residuals, standardized residuals, and deleted standardized residual.

The TAPPI study is used for demonstration here.

1. Create a column of 1’s, say, in C7:

a. Go to Calc, choose ‘Make Patterned Data, select ‘Arbitrary Set of Numbers’, in the Dialog box, enter C7 to store the data, enter ‘1’ in the ‘Arbitrary set of Numbers’, List each value ’87’ times, the sample size, and List the whole sequence ‘1’ times.

2. Go to Stat, choose Regression, then select ‘Regression’.

3. In the Dialog box, enter the response variable, say C5, and enter predictors C7, the column with all ‘1’.

4. Click on ‘Options’, and deselect ‘Fit Intercept’.

20

Steps- Continued:

5. Click on ‘Storage’, and select Residuals, Standardized Residuals, Deleted Studentized Residuals, and Fits. Each of these will appear as a column is the worksheet.

Residuals is named: RESI1,

Standardized Residual is named: SRES1,

Deleted Studentized Residual is named:TRES1

The Fitted Value is named: FITS1. In the one sample case, this is exactly the Grand Mean of all included labs.

The number at the end of each variable will increase by one, such as RESI2, SRES2, for additional storage in the later analysis.

We can change the variable names as we wish.

21

There are two additional selections in the Regression Procedure: Graphs, Results.

6. Click on ‘Graphs’, it allows you to conduct graphical detection of these residuals. Choose some graphs as you wish to see. For example, one may choose ‘Standardized’ choose ‘Normal Plot of Residuals’ to conduct a normal probability plot for standardized residuals.

The Graphs will appear in the graph window.

7. Click on ‘Results’, it allows to choose the amount of computer output as needed. The last one gives the most extensive output.

The results will appear in the Session Window.

22

Use Minitab to construct the h-plot

Since Minitab does not have the same plot as h-plot shown before, I will demonstrate how to use other procedure to construct a plot that is similar to the h-plot using the TAPPI data.

1. Go to Stat, choose Control Charts, then select ‘Individuals’.

2. In the Dialog box, enter ‘SRES1’ into the Variable box (or any variable of interest such as deleted studentized residuals.

3. Enter 0 for Historical Mean. This will be the center line on the plot.

4. There are five additional selections and three graph editing selections. Leave Test and Estimate as default.

Click on ‘S-Limit’ selection, and enter 2 for upper sigma limit and –2 for lower sigma limit. You can also change the line color and line type.

5. Click on ‘Stamp’ selection, enter C1 as the Tick Labels. This will define the ticks on the X-axis using the laboratory names.

6. Click on ‘Options’ selection, you can change the symbol attributes and connection line attributes.

23

Case Example: TAPPI Inter-laboratory Study

Let’s start with the SAMPLE GR35.

1. A quick eye-checking immediately suggest the following cases are clear outliers, and they are removed from the outlier detection analysis immediately:

U3438: Lab mean = 80.55 , U3531: Lab mean = 85.75

2. Now, we follow the procedure described above to compute the standardized residuals and deleted studentized residuals using the remaining data and normal plot analysis.

The unusual observations are Unusual ObservationsLab Code GR35-Lab Fit SE Fit Residual St Resid

U2415 1.00 76.0630 77.5273 0.0652 -1.4643 -2.45R

U3154 1.00 79.5500 77.5273 0.0652 2.0227 3.39R

U3185 1.00 79.1000 77.5273 0.0652 1.5727 2.63R

U3216 1.00 79.1620 77.5273 0.0652 1.6347 2.74R

U3249 1.00 76.2630 77.5273 0.0652 -1.2643 -2.12R

U3292 1.00 79.1380 77.5273 0.0652 1.6107 2.70R

U3334 1.00 78.7750 77.5273 0.0652 1.2477 2.09R

24

Average: -0.0000000StDev: 1.00593N: 85

Anderson-Darling Normality TestA-Squared: 3.044P-Value: 0.000

-2 -1 0 1 2 3

.001

.01

.05

.20

.50

.80

.95

.99

.999

Pro

babi

lity

Standardized Residual

Normal Probability Plot

The normal probability plot and Normality test for the Standardized Residuals

The pattern does not follow a straight line well. The Normality Test suggests the lab testing results clearly do not follow normal.

25

•The quick rule is used to detect the outliers in this case, since the sample size is large.

•Both standardized residuals and deleted studentized residuals give the same group of unusual labs.

•These labs of which the testing results are found unusual will be notified. Further analysis is then taken to find out if there are any special causes or reasons for these unusual lab results.

•NOTE, the result using one sample detection technique is somewhat different from the two-sample plot approach. Since some labs which do not show outliers from this sample may show outliers when testing another sample. This is one reason why we should also conduct two-sample plots.

26

0Subgroup 10 20 30 40 50 60 70 80 90

-3

-2

-1

0

1

2

3

4

CP

V V

alue

s

C1

h-Plot for Standardized Residuals

1

Mean=0

UCL=2

LCL=-2

This is created by Minitab. It is not quite the same as the h-plot. It does the same function as the h-plot and more. The mark ‘1’ is the lab which is over 3, a definite outlier. The labs outside the upper and lower limit of 2 are considered as outlier.

One can choose to use different upper and lower bounds.

27

Hands-on Activity

Detect labs which result outliers in testing Sample GR 36 of the TAPPI study.

28

Use of Basic Quality Control Chart Techniques for monitoring laboratory performances

Quality Control charts were originally developed to monitor the mean shift and and the variation changes along the time domain in manufacturing process. For the inter-laboratory performance of testing a given material, we can apply the same charting method to monitor the performance of laboratories based on two measurements:

1. laboratory measurement means and

2. within-lab measurement variations.

The control charts to be discussed are called

chart for monitoring between-laborotary meansrement means.

R-chart for monitoring within-laboratory measurement variations.

X

Example: A study of a chromatographic method was conducted for determining malathion. Ten labs participated in the study; each lab received a subsample of a technical grade malathion (Tech), two wettable powders (25% WP and 50% WP), and an emulsifiable concentrate (58% EC), and a dust. Each participant also received an internally tested standard of malathion (99.1%) along with the analytical method. (Wernimont, 1985).

29

Row lab Rep WP25 WP50

1 1 1 26.17 50.76

2 1 2 26.22 50.67

3 1 3 25.85 50.81

4 1 4 25.80 50.72

5 2 1 26.44 50.82

6 2 2 26.57 50.90

7 2 3 25.80 51.04

8 2 4 26.06 50.96

9 3 1 26.95 52.53

10 3 2 26.91 52.54

11 3 3 26.98 52.55

12 3 4 26.91 52.47

13 5 1 26.23 50.20

14 5 2 26.00 50.47

15 5 3 26.22 50.39

16 5 4 26.18 50.43

17 6 1 25.45 51.65

18 6 2 25.62 51.67

Row lab Rep WP25 WP50

19 6 3 27.01 51.72

20 6 4 25.72 52.07

21 7 1 26.14 50.53

22 7 2 26.78 50.75

23 7 3 26.04 49.99

24 7 4 25.97 50.92

25 8 1 25.70 50.00

26 8 2 25.90 50.30

27 8 3 25.80 50.50

28 8 4 25.70 50.60

29 9 1 26.13 50.26

30 9 2 26.13 50.36

31 9 3 25.91 50.97

32 9 4 25.86 50.44

33 10 1 26.22 50.23

34 10 2 26.20 50.27

35 10 3 25.84 50.29

36 10 4 25.84 49.97

30

Construction of and R-chartX chart

Lab ID Rep1 Rep2 Rep3 Rep4 Sample mean,

Range,

1 x11 x12 x13 x14 R1

2 x21 x22 x23 x24 R2

3

5

6

7

8

9

10 X10,1 X10,2 X10,3 X10,4 R10

Average

Consider the above Malathion testing study. Ten labs particilated in the study. Each Lab tested material WP50 for four replications. Lab 4 was excluded since it did not complete the testing.

ix iR

1x

Range = Largest – Smallest in each Lab.

2x

10x

x R

31

An X-bar chart is to monitor the laboratory mean. If labs are consistent, then, the average of each lab should be close. If all of them the equal, then, the grand average is the same of lab average. If lab averages are very different (that is some lab systematic biases exist), then there will have deviation between grant mean and lab mean. This provides the basis of the X-bar chart.

Upper Limit

Grand Mean

Lower LimitLab order (or Time order)

3 x

3 x

The lab averages are then plotted along the lab order.

The multiple ‘3’ is applied commonly in process control. Under the normality assumption, there is 99.7% of chance the lab sample mean should be within the interval.

As the chart indicates, we need to estimate the grand mean and SE of lab mean.

Since range is usually easier to compute, the estimate of the population variance and, hence the SE of lab mean can also be estimated, using the distribution of Range.

32

2

2

2

2

2 2

3Upper Control Limit is :

Center Line is :

3Lower Control Limit is :

The multiple A and d depend on sample size, the # of replications, r.

They will be provided in class.

x R x A Rd r

x

x R x A Rd r

The expected value of Range: E(R) = d2x , where d2 depends on sample size (in the lab testing case, it is the # of replications conducted by each lab. The values of d2 will be provided in the class.

Therefore, the estimate of x is given by

And the SE of sample mean is

2ˆ /x R d

2

ˆ ˆ /x x

Rr

d r

33

3

3

Construction of R-chart

Similar to X-chart, R-chart has the form: R 3

To estimate , we use the following fact about Range: = d .

ˆAn estimate from data is = d

R

R R x

R

3

2

34

2

33

2

3 4

dˆ =

dThe Upper Limit of the R-chart is R+3

dThe Lower Limit of the R-chart is R-3

The multiples, D and D depend on the sample size, and will be provided in cla

x Rd

R D Rd

R D Rd

ss.

34

Analyzing the malathion data – the WP50 variable

0Subgroup 1 2 3 4 5 6 7 8 9

50

51

52

Sam

ple

Mea

n

1 2 3 5 6 7 8 9 10laboratory_1

1

1

1

11

11

Mean=50.88UCL=51.18

LCL=50.58

0.0

0.5

1.0

Sam

ple

Ran

ge

R=0.41

UCL=0.9353

LCL=0

Xbar/R Chart for WP50%

•X-bar chart suggests that there exists a very large mean differences among labs. This is an indication of systematic lab bias. When comparing with the standard proportion of 50%, Lab 3 shows much higher lab average than others. Some attention to Lab 3 should be taken.

•R-chart indicates, in general, no lab has dramatically high within-lab variation. However, Lab 7 has somewhat higher within-lab variation.

35

Analyzing the Malathion Data – 25% Variable

0Subgroup 1 2 3 4 5 6 7 8 9

26.0

26.5

27.0S

ampl

e M

ean

1 2 3 5 6 7 8 9 10lab

1

Mean=26.15

UCL=26.53

LCL=25.76

0.0

0.5

1.0

1.5

Sam

ple

Ran

ge

1

R=0.5233

UCL=1.194

LCL=0

Xbar/R Chart for WP25

X-bar chart for the WP25% variable also show that Lab 3 has a significantly high lab average. A closer check is necessary.

The R-chart indicates the within-lab variation exceeds the upper limit. A review of Lab 6 for special causes would be recommended.

36

Some General Comments of applying the control charts for monitoring laboratory means and within-lab variations

This X-bar, R-chart technique is valid under the assumptions:

• The response variable follows a normal distribution.

• The same or very similar material is tested by every participated lab.

• The operation of each lab is independent of others.

In most laboratory studies,

condition (3) is usually satisfied.

Condition (2) may be satisfied if the preparation and distribution of material and the time period of conducting the lab testing is within a reasonable time period.

If there are more than one material tested by participated labs, we can conduct a series of control charts to monitor each material. There are also multivariate control charts that can be applied to monitor more than one material at a time and take into account the laboratory systematic biases into account.

The Youden’s two-sample plots can be applied (to be discussed later) to diagnose the lab performance based on two samples at a time.

37

chart for monitoring between-laborotary meansrement means.

R-chart for monitoring within-laboratory measurement variations.

There are many other types of cnotral charts, each is developed for

some spe

X

cific purposes.

X-,S-charts are similar to X- , R-charts for continuous, normal responses.

p-cahrt for binomial data, particularly for monitoring proportion of defectives in a

sample along the time domain.

U- or C-charts are for Poisson data, particularly for monitoring # of defects in

sampled parts along time domain.

These charts can be handy for monitoring laboraty testing.

If the proportion of a certain property is the measurement, then, p-chart can be applied.

If the # of times that a targeted property is measured in each lab, u- or c-carts can be applied.

Other Control Charts that may be useful for monitoring inter-laboratory testing study

38

How to use Minitab to conduct control chart analysis?

Constructing X-bar and R-charts is straightforward even by hand. However, Minitab can do the charting and much much more for us. There are steps are constructing the X-bar and R-charts:

1. GO to Stat, choose Control Charts, select Xbar-R…

2. In the dialog box, depending on the data arrangement in the worksheet. If response is in one column and lab# in another column, enter response and lab id columns into ‘single column’ and ‘sub-group size’.

3. There are four selections. We have shown these before. Click on ‘Stamp’ selection, and enter the column that consists of the correct ‘Lab ID or Name’ . The correct ID or Lab Name will show on the X-ticks for easier reading.

39

Hands-on Activity

Analyze the other variables in the Malathion data, and draw your final conclusion about the lab consistency with regards to

(a) Lab averages,

(b) Within-lab variations

40

Date post:	02-Jan-2016
Category:	Documents
Upload:	joseph-malone
View:	213 times
Download:	0 times

1 Module Five: Outlier Detection for One Sample Case In module Four, we discuss methods for...

Documents