Measurement uncertainty in sensory analysis - · PDF fileMeasurement uncertainty in ... of...

transcript

NMKL PROCEDURE

No. 27 (2013)

Measurement uncertainty in sensory analysis

Page: 1 of 24

Version: 1

Date: April 2013

NORDIC COMMITTEE ON FOOD ANALYSIS

Contents

INTRODUCTION .......................................................................................................................... 3

SENSORY ANALYSIS .................................................................................................................... 4

SENSORY DATA ........................................................................................................................... 4

PROFILING DATA ........................................................................................................................ 5

BINOMIAL DATA ......................................................................................................................... 8

QUALITY CONTROL DATA ........................................................................................................... 8

STATISTICAL CONCEPTS AND DEFINITIONS ................................................................................ 8

MEAN VALUE .......................................................................................................................... 9

STANDARD DEVIATION ........................................................................................................... 9

ERROR ................................................................................................................................... 10

MEASUREMENT UNCERTAINTY ............................................................................................ 12

TRUENESS ............................................................................................................................. 12

ACCURACY ............................................................................................................................ 14

ANOVA NOTATION ............................................................................................................... 15

REPEATABILITY ..................................................................................................................... 16

REPRODUCIBILITY ................................................................................................................. 17

PRESENTATION OF MEASUREMENT UNCERTAINTY IN PRACTICE ........................................... 18

REFERENCES ............................................................................................................................. 23

ABBREVIATIONS ....................................................................................................................... 24

NMKL PROCEDURE

No. 27 (2013)

Page: 2 of 24

Version: 1

Date: April 2013

This procedure has been prepared by a project group consisting of:

Gunnar Forsgren Iggesund Paperboard, Sweden

Grethe Hyldig DTU Food, National Food Institute, Division of Industrial Food

Research, Technical University of Denmark

Päivi Kähkönen FINAS, Finnish Accreditation Service

Per Lea (project manager) Nofima, Norwegian Institute of Food, Fisheries and Aquaculture

Aðalheiður Ólafsdóttir Matís – Icelandic Food Research

Steffen Solem Eurofins, Norway and Vinmonopolet, Norway

Kolbrún Sveinsdóttir Matís – Icelandic Food Research

This publication may be obtained from the General Secretariat, Nordic Committee on Food Analysis

(NMKL) c/o Norwegian Veterinary Institute, P.O. Box 750 Sentrum, 0106 Oslo, Norway.

NMKL invites all readers of this NMKL publication to submit their points of view on its content to the

General Secretariat.

©NMKL

NMKL PROCEDURE

No. 27 (2013)

Page: 3 of 24

Version: 1

Date: April 2013

INTRODUCTION

The importance of measurement uncertainty has gained increased acceptance within most

fields of metrology – of main interest to NMKL are the fields of chemistry and microbiology.

Measurement uncertainty within NMKL has been dealt with in two procedures: NMKL

Procedure No 5 (2nd

Ed., 2003) and NMKL Procedure No 8 (3rd

Ed., 2008), and there is also a

vast amount of literature on the subject within other areas. For sensory analysis, however, the

literature is infinitesimal in comparison, in fact this is a subject matter that is only rarely

touched upon among sensory scientists. One explanation for this can be that the sensory

community considers their data to be of a kind that does not easily lend itself to the treatment

laid down in the NMKL procedures above. The present work is an attempt to rectify this, and

to assure that NMKL has procedures for measurement uncertainty for all its three main areas

where measurements are performed: chemistry, microbiology and sensory.

EA-4/09 quotes ISO 17025 (5.4.6) on the subject of measurement uncertainty:

‘Sensory tests are usually supported by statistical data elaboration which establishes the level

of significance of the results.

Moreover, sensory tests come into the category of those that preclude the rigorous,

metrologically and statically (sic) valid calculation of uncertainty of measurement.

In some cases, when a numerical result is expressed, it could be possible to base the

estimation of uncertainty on repeatability and reproducibility data alone. In these cases the

individual components of uncertainty should be identified and demonstrated to be under

control. The estimation of the uncertainty depends on the method used and the objectives

evaluated and their importance in the quality and significance of the final result.’

One important aspect of the present procedure is to make sensory analysts conscious of the

fact that their data are encumbered with errors, and that this has consequences for the

conclusions drawn. When the concept of uncertainty is introduced to new audiences, a more

or less automatic response is “this is too difficult to take into account”, “this is far too

expensive”, “this is impractical” or similar excuses. We believe – no: we claim – that it is

extremely important to break down the opposition to these views. Even if it takes time to

introduce our recommendations in practice, a small victory is won if we can persuade the

sensory community that the concept of uncertainty plays an overwhelming importance in

sensory analysis – as it does in all other areas of data interpretation.

It is important to note that the word error in its scientific meaning is not synonymous with

mistake, fault, defect or flaw. Instead of error, the word residual could – and often is – used.

Based on a survey conducted among sensory laboratories prior to the writing of the present

guidelines, we can report that the sensory community is not as a whole averse to the concept

of measurement uncertainty. This is not very surprising, seeing that sensory analysts are quite

often recruited from fields of analysis where this concept is well-known.

NMKL PROCEDURE

No. 27 (2013)

Page: 4 of 24

Version: 1

Date: April 2013

It is our hope that the present guidelines will provide the help and inspiration for sensory

analysts to take measurement uncertainty into consideration in their work.

SENSORY ANALYSIS

Sensory analysis, or sensory evaluation, is evaluating something using one or more of the

human senses: taste, smell, sight, hearing and feeling (tactility). Although the present

procedure is written with analysis of edibles in mind, sensory analysis is also used within

industries as far apart as cars (assessing the odours inside a car, the sound of the engine),

mobile phones (aspects of sound quality), TV (different aspects of picture quality), fabric

(smoothness) and so on.

What is evaluated is termed measurand by GUM (ISO Guide to the Expression of Uncertainty

in Measurement) and in other sources dealing with metrology. In sensory analysis we find it

more natural – and convenient – to use the word sample instead of measurand.

Sensory analysis is performed by a panel of trained panellists, usually 6-15 for profiling,

(descriptive analysis), normally more than 15 for binomial (triangle, duo-trio) tests, and 3-5

for quality control. In profiling the measurements from trained panellists are considered to be

objective measurements, as opposed to the subjective measurements from surveys, where an

assessment of liking or preference for one product to another is measured.

In quality control, the panellists are not necessarily trained in general sensory evaluation, but

rather specialised in a particular food or group of foods: a quality control panellist working in

a beer brewery will normally judge only beer.

Binomial tests are mainly used in consumer analysis, but to some extent also in connection

with sensory panels - particularly when checking new panellists for their ability of recog-

nising the basic tastes (sweet, salty, bitter, acidic, umami). Consumer tests as such are beyond

the scope of this procedure.

SENSORY DATA

Sensory analysis differs from traditional analyses (chemistry, microbiology, physics, …) in

that there are no true or strictly defined values as is the case when we measure temperature

with a thermometer or the length of an object using a ruler. Still, collaborative tests and

proficiency tests give valuable insight into a panel’s performance, and participation in such

programs is highly recommended.

The concepts, or attributes, such as sweetness, chewing resistance and whiteness can be

measured indirectly as sucrose, breaking force and NCS (Natural Colour System, Skandinav-

iska Färginstitutet AB). Although such measurements are strictly defined chemically or

NMKL PROCEDURE

No. 27 (2013)

Page: 5 of 24

Version: 1

Date: April 2013

physically, they are not necessarily relevant for the sensory analyst. One problem is

interaction: mixtures of, for example, sucrose and salt (whether as NaCl or KCl) may be

easily analysed chemically for each of these components, but in humans they will interact.

Also, sensory analysis is concerned with perceived sweetness and saltiness rather than the

amounts of sugar and salt.

The sensory characteristics of food can be complex, so sweetness can be perceived differently

in different foodstuffs although objectively the sucrose content is the same. The complicated

role of interaction between the food components and the presence of small amounts of

strongly flavoured contaminants may account for some of this. Therefore a person’s percept-

ion of sweetness is of interest, although it cannot be stressed often enough that a trained

sensory panel is very much different from a group of consumers. In this procedure we will

concentrate exclusively on sensory panels and leave validation and verification of consumer

tests to others.

Sensory data can be divided into three main groups: profiling data, binomial data and quality

control data.

PROFILING DATA

Profiling data are data evaluated on a pre-defined interval, for example 1-5, 1-7, 1-9, 0-15 or

0-100. The evaluations can be given as numbers in the interval, either by entering the number

itself in a computerized data collection system, or marking a point on a line interval on a

screen. Sometimes even traditional pen-and-paper systems are used for the data collection.

Independently of the actual system used, the data end up consisting of something like 1, 2, 3,

4, 5, 6, 7, 8, 9 (9 separate values); 1.0, 1.1, 1.2, ..., 8.8, 8.9, 9.0 (81 separate values) or

similarly, depending on the partition and the end-points of the interval. Although the data set

potentially consists of a finite number of different values it is assumed that the underlying

distribution is continuous.

A typical sensory data set may look like the one presented in Table 1.

The dataset can be downloaded from >Software & Downloads and www.nofimasensory.org

> Download Excel Spreadsheet. www.nmkl.org

NMKL PROCEDURE

No. 27 (2013)

Page: 6 of 24

Version: 1

Date: April 2013

Table 1: Sample data set, with three product varieties (A, B, C), nine panellists (1 – 9), two

replicates (1, 2) and 10 sensory attributes (A1 – A10)

Prod. Pan. Rep. A1 A2 A3 A4 A5 A6 A7 A8 A9 A10

A 1 1 6,4 4,8 3,7 4,8 5,5 4,7 5,6 1,0 6,8 5,8

A 1 2 5,6 4,2 4,2 5,2 6,1 4,8 6,2 1,0 6,3 6,3

A 2 1 4,8 4,7 3,9 5,8 4,1 4,7 4,0 1,8 6,4 4,5

A 2 2 3,3 5,4 4,2 5,6 4,0 4,0 4,1 1,0 7,6 4,8

A 3 1 3,7 4,9 4,6 5,6 2,1 1,8 2,7 1,0 4,9 4,5

A 3 2 4,6 3,1 3,5 5,8 1,0 1,8 2,0 1,0 4,8 1,0

A 4 1 3,3 2,3 3,9 6,3 3,4 2,2 4,0 1,0 4,0 3,5

A 4 2 2,8 3,7 4,0 8,2 3,4 2,9 2,5 1,0 4,6 3,9

A 5 1 4,4 4,8 5,2 7,0 5,0 3,7 3,6 1,0 4,2 3,8

A 5 2 4,5 5,4 5,4 7,2 3,6 3,7 2,6 1,0 3,4 4,9

A 6 1 4,0 5,8 5,7 5,7 2,7 2,1 2,2 1,7 3,0 3,1

A 6 2 2,7 5,0 4,2 5,6 2,3 2,9 2,0 4,9 4,9 2,6

A 7 1 3,3 4,8 6,0 5,4 3,5 4,5 2,3 2,4 4,9 2,8

A 7 2 3,3 5,1 5,0 5,5 2,5 3,0 1,0 2,5 5,7 2,3

A 8 1 3,0 5,4 4,2 4,7 3,6 4,0 2,1 2,9 3,8 2,0

A 8 2 3,0 4,6 3,9 4,6 1,5 3,3 1,0 3,4 5,6 1,0

A 9 1 4,3 4,4 3,4 6,5 5,2 2,3 4,4 1,0 4,0 3,6

A 9 2 3,4 3,8 4,5 7,0 5,6 2,4 4,9 1,0 4,4 4,6

B 1 1 5,6 2,1 2,1 5,5 6,3 4,2 6,4 1,0 6,6 4,8

B 1 2 5,7 3,0 2,3 4,8 6,4 4,8 6,4 1,0 6,4 4,2

B 2 1 5,8 2,2 3,5 5,0 5,0 1,8 4,0 1,0 5,8 5,6

B 2 2 4,6 3,5 2,8 5,8 4,0 4,7 4,4 1,0 6,1 4,8

B 3 1 4,0 2,7 3,4 4,5 1,9 2,2 1,0 2,0 4,8 1,0

B 3 2 5,6 3,5 3,3 4,6 4,4 1,5 4,0 1,0 4,2 4,2

B 4 1 3,8 1,3 2,8 5,7 2,9 2,6 3,0 1,0 3,9 3,7

B 4 2 3,0 3,1 2,9 7,0 3,2 2,8 2,8 1,0 4,5 4,1

B 5 1 5,1 3,8 3,7 6,2 5,6 4,1 3,4 1,0 4,9 5,3

B 5 2 5,8 3,8 3,5 7,1 4,4 3,4 3,2 1,0 3,7 4,9

B 6 1 3,3 1,8 3,2 5,8 3,9 2,9 3,9 1,0 2,7 2,6

B 6 2 3,8 1,4 2,7 5,7 2,5 1,7 1,7 4,8 4,8 4,9

B 7 1 4,5 4,7 4,6 4,3 4,2 4,4 2,9 2,1 5,9 3,9

B 7 2 5,7 4,6 4,2 4,6 3,9 3,7 3,2 2,0 4,1 3,6

B 8 1 5,0 3,3 2,7 5,0 3,5 4,0 2,5 1,8 5,4 1,0

B 8 2 4,8 3,2 2,6 4,6 2,1 2,8 1,6 2,2 7,0 1,0

B 9 1 4,8 2,6 2,3 6,6 3,1 2,0 2,5 1,0 1,4 3,2

B 9 2 4,7 3,1 1,7 7,8 3,2 2,2 2,2 1,0 2,8 2,4

C 1 1 6,4 4,5 3,5 4,8 3,3 3,8 3,7 2,7 6,8 2,4

C 1 2 6,5 4,1 4,0 4,4 2,5 4,9 2,6 5,0 6,8 2,7

C 2 1 4,4 6,0 6,2 5,9 3,2 3,0 3,2 2,9 6,1 2,9

C 2 2 4,6 5,6 6,0 6,0 3,9 3,0 3,9 2,0 5,5 3,8

C 3 1 6,2 5,6 4,0 4,8 1,7 1,6 1,0 1,0 3,3 2,0

C 3 2 6,0 3,5 4,4 5,0 1,0 1,7 1,0 2,3 4,2 1,0

C 4 1 4,9 2,5 3,7 7,0 1,0 2,3 1,0 5,1 4,6 1,0

C 4 2 6,4 2,8 2,7 7,3 1,0 1,8 1,0 3,1 3,1 1,0

C 5 1 5,1 5,0 4,7 5,7 4,3 2,6 2,7 2,1 1,8 2,7

C 5 2 5,0 5,0 5,3 5,1 2,8 2,3 1,0 3,4 2,3 2,1

C 6 1 4,4 5,4 4,6 4,1 2,9 1,8 1,0 4,8 3,2 3,1

C 6 2 4,8 3,5 4,8 4,8 3,0 1,8 1,8 5,0 2,3 4,9

C 7 1 4,4 5,5 5,3 5,5 2,5 3,3 1,0 4,2 4,3 1,0

C 7 2 4,3 5,5 5,3 5,6 1,0 4,0 1,0 5,0 4,6 1,0

C 8 1 6,2 6,5 2,2 5,0 1,0 3,9 1,0 6,8 4,9 1,0

C 8 2 5,2 6,2 3,8 4,6 1,0 3,3 1,0 5,0 4,7 1,0

C 9 1 4,9 4,9 4,9 5,1 2,4 3,8 2,5 2,5 2,5 2,4

C 9 2 5,1 4,9 3,7 4,6 4,2 4,3 3,1 1,0 5,6 3,0

NMKL PROCEDURE

No. 27 (2013)

Page: 7 of 24

Version: 1

Date: April 2013

A data-set such as the one in Table 1 is sometimes referred to a ‘standard sensory data set’.

The structure is simple and is commonly encountered in sensory analyses. In addition to

product varieties, the data could also be classified into a large number of groups: storage

temperature, storage time, recipes, type of feed (for experiments involving animals), type of –

and quantity of – fertilizer (in experiments involving crops), and many others. In addition,

these groups, in statistical terms called effects or factors, can be combined in different ways.

The simplest model is a complete factorial design where all levels of one factor are combined

with each level of all the other factors. Although this has an intuitive appeal and easily lends

itself to simple interpretations, it is restricted by time and cost limitations placed on most

projects. As an example, seven factors, each with only two levels, involves 27 = 128

combinations. This means that 128 different samples will have to be tested, and with 2

replicates and 12 panellists the panel leaders will have to prepare 3072 samples for

evaluation. In other words, the data in Table 1 would have 3072 rows. In addition to the

practical problem of handling many samples, there is also the issue of panellist fatigue.

Consequently the sensory scientist is often faced with the need of reducing a planned

experiment. As an alternative to reducing the number of factors and/or factor levels, this can

also be achieved by using a fractional factorial design. We will not go into the details here,

but only refer the reader to the vast literature on experimental design available in statistical

text-books. Suffice it to say that if one wants to exclude certain factor combinations, this

should not be done in a haphazard way, but only after consulting statistical theory.

Sensory profiling data fall in the category interval data, meaning that there is no natural zero

as in a data-set of, for example, lengths, volumes or masses. But we do assume that sensory

differences are meaningful, e.g. the distance between 4 and 3 is the same as the distance

between 7 and 6, since the panellists have been trained to perceive them as such. It follows

that concepts such as mean values and standard deviations, and statistical tests involving

distributions such as the Normal (Gaussian), T-, χ2- and F-distributions, are all applicable.

The temperature scales of Kelvin and Celsius are often used to exemplify the difference

between data on a ratio and on an interval scale: C is an interval scale, and it is meaningless

to state “10 C is twice as hot as 5 C”. On the other hand, stating that “100 K is twice as hot

as 50 K”, is meaningful, since K does indeed have a natural zero (-273.15 C).

Although a sweetness score of 4 is not “twice as sweet” as a score of 2 we are still able to do

computations involving multiplications, e.g. transforming the data using a linear trans-

formation such as for appropriate values of a and b. Occasionally this has to

be done when due to technical problems some of the data have to be recorded using pen and

paper. Then the manual data are represented by lengths on a line (e.g. between 0 mm and 150

mm) and are transformed into values on the scale used by the computer system. The

transformation above is also used when combining data sets from different panels using

different end points on their scales.

NMKL PROCEDURE

No. 27 (2013)

Page: 8 of 24

Version: 1

Date: April 2013

BINOMIAL DATA

Binomial data are data on the form «A is different from B» (Triangle test, Duo-trio test) or «A

is sweeter than B» (Pair test). These tests are often used in sensory evaluation to analyse if

small differences are expected between samples, during training of panellists to test their

ability to detect small differences, and also within the field of consumer tests.

QUALITY CONTROL DATA Quality control data can sometimes come in a form that may look like profiling data (numbers

on a scale), but actually the numbers can stand for:

OK product (0)

Minor, insignificant deviation from the standard (1)

Small deviation from the standard (2)

Major deviation from the standard (3)

Gross deviation from the standard (4).

Although these data may be confused with profiling data, we notice that this is not an interval

scale: it is hard to argue that the difference between 0 and 1 is the same as between 2 and 3.

Therefore, the classical statistical theory based on ratio or interval data is no longer

applicable.

STATISTICAL CONCEPTS AND DEFINITIONS

Although they probably are well-known to most analysts, we feel it is relevant to repeat some

important concepts and definitions.

Measurements are designated by capital, indexed letters: the sequence X1, X2, ... , X10 tells us

that there are 10 measurements, for example the following 10 assessments of sweetness of a

juice sample, performed by 10 sensory panellists: 6.4, 6.7, 7.3, 8.6, 5.7, 7.7, 7.4, 6.9, 7.0, 7.1.

Since the amount of data may vary from situation to situation, statisticians prefer to express

this in a more general way: “We have n measurements”, where n may be any number. In the

example above n=10.

NMKL PROCEDURE

No. 27 (2013)

Page: 9 of 24

Version: 1

Date: April 2013

MEAN VALUE

Since single values seldom are of interest by themselves, the most common way of summa-

rising them, are by computing the mean value, or more precisely: the arithmetic mean1. This

is defined as “the sum of the data divided by the number of them”, or in mathematical terms:

where X1, X2, ..., Xn are the individual values in the data set.

Alternatively, particularly in connection with ANOVA, where several indices are used, the

notation

ijkij XK

is often practical, the dot in the subscript indicating which index is averaged over. In the

present formula, the mean value over the K replicates for pannellist i and variety j is

computed.

A different kind of mean – the median – is sometimes used in consumer studies, where the

data often are considered to be on the ordinal level. The median is defined as the middle

observation, in the sense that one half of the data is larger than the median, one half is

smaller. Normally, the ordinary arithmetic mean is used on data from trained panellists.

STANDARD DEVIATION

The standard deviation is a measure of the spread of the data: each value in the data set is

compared to the mean value and the differences are squared and summed, divided by one less

than the number of observations and then the square root is computed:

i )X(X1n

For computational and practical reasons, the formula is often given in this form:

1 There are also geometric and harmonic means, defined as the n´th root of the product of all (n) observations,

and the inverse of the arithmetic mean of the inverse values, respectively

NMKL PROCEDURE

No. 27 (2013)

Page: 10 of 24

Version: 1

Date: April 2013

2 XnX1n

When all values are close together (Figure 1), the standard deviation is small

when they vary over a large range (Figure 2), the standard deviation becomes large

Variance is often used to describe variation in a data set, but this is just the square of the

standard deviation: V=S2. One reason for using the standard deviation is that this is on the

same measurement scale as the original data, whereas the variance is measured in squared

units (m2 if the original data set is in meters; g

2 if the original data set is in grams). For many

units, square units make no physical sense. Admittedly, this is no serious problem in sensory

analysis, since the scale used in profiling is abstract anyway, so it does not matter if the

squares of these units also are difficult to relate to. In keeping up with traditions, we will be

using the standard deviation in the following.

In statistics, error is defined as the difference between the measurement result and the true

value, and it is divided into two parts: systematic error and random error. Systematic error

may be caused by:

Readings and/or recordings

A value is erroneously recorded (4.22 instead of 4.2; 6.5 instead of 5.6)

Misplaced decimal point: 14.5 instead of 1.45

Computing errors while transforming from markings on a line to a 1-9 scale

-10 -5 0 5 100

Figure 1

-10 -5 0 5 100

Figure 2

NMKL PROCEDURE

No. 27 (2013)

Page: 11 of 24

Version: 1

Date: April 2013

Samples

Coding errors (mix-up of samples and/or coding labels)

Unintentional treatment effects (bacterial attacks during storage; cooler breaking down)

Panellists

Different use of the scale

Errors in readings or recordings are not very relevant, since the use of pen and paper has been

substituted by electronic recording systems in nearly all sensory laboratories. However, cases

have been known where the electronic equipment has broken down, and some evaluations had

to be recorded using pen and paper followed by manually transferring them to a computer,

often after transformation from millimetres to some specified scale2.

Depending on their nature, systematic errors can in some cases be corrected, or new samples

can be provided for additional analyses.

One type of systematic error which has been the subject of many discussions is panellists who

do not use the scale in the same way. One aspect to bear in mind is that in certain situations

this is not something to be bothered with: if we have a panellist who systematically gives

values a specified number of units above (or below) the others, this is a not a problem at all.

Those who are used to the thought process put forth in the ISO Guide to the Expression of

Uncertainty in Measurement (GUM) would probably consider this a serious systematic error.

But since sensory data often are analysed statistically by Analysis of Variance (ANOVA) or

by some multivariate techniques such as Principal Component Analysis (PCA) or Partial

Least Squares (PLS), the actual level of the single panellists are not really important. The

ANOVA is robust to level differences between the panellists, as long as they rank the samples

in (approximately) the same order. This is a consequence of the fact that the ANOVA

formulas are made up of mean values and other linear combinations of the original data, and

constant level differences do not influence the relationships between the groups of data. The

same goes for PCA and PLS models when these are based on mean values over panellists and

replicates, and when based on individual scores, the data are often standardised in the

software anyway. Therefore, as long as the results are used only for internal comparisons of

groups and not for estimates to be compared with other panels, the effect of this particular

type of systematic error is no impediment for a satisfactory analysis.

Random error is defined in GUM as the “result of a measurement minus the mean that would

result from an infinite number of measurements of the same measurand carried out under

repeatability conditions.”

Random errors are caused by the fact that the panellists are living persons and cannot be

expected to evaluate the same sample identically on two or more occasions. Also: the samples

2 Data measured in mm along a line of length 150 mm are transformed to a 1-9 scale by the formula

150/)150(8XX OrigTrans , where XTrans and XOrig (measured in mm) are the transformed and original

data, respectively. The general formula for transforming from [a,b] to[A,B] is:

a)Ba)/(bAbA)X((BX OrigTrans

NMKL PROCEDURE

No. 27 (2013)

Page: 12 of 24

Version: 1

Date: April 2013

are themselves representing a larger population, and apples (or sausages, or bottles of wine,

etc.,) are not 100% identical although they have the same origin and/or have received the

same treatment. The random error (or residual) helps us express the level of uncertainty in the

data – noting in passing that uncertainty is yet another word whose scientific meaning is

different from its everyday meaning.

MEASUREMENT UNCERTAINTY

‘The word “uncertainty” means doubt, and thus in its broadest sense “uncertainty of measure-

ment” means doubt about the validity of the results of a measurement. Because of the lack of

different words for this general concept of uncertainty and the specific quantities that provide

quantitative measures of the concept, for example, the standard deviation, it is necessary to

use the word “uncertainty” in these two different senses.’ (GUM, 2.2.1)

GUM also states (in Chapter 1 Scope):

'1.2 This Guide is primarily concerned with the expression of uncertainty in the measurement

of a well-defined physical quantity – the measurand – that can be characterised by an

essentially unique value.'

Bearing in mind our description of sensory data above, GUM at first glance does not cover

the field of sensory analysis. However, we seek comfort in:

'1.4 This Guide provides general rules for evaluating and expressing uncertainty in measure-

ments rather than detailed, technology-specific instructions. (...) It may therefore be necessary

to develop particular standards based on this Guide to deal with the problems peculiar to

specific fields of measurement or with the various uses of quantitative expressions of

uncertainty. These standards may be simplified versions of this Guide, but should include the

detail that is appropriate to the level of accuracy and complexity of the measurements and

uses addressed.'

The sentence ‘problems peculiar to specific fields of measurement or with the various uses of

quantitative expressions of uncertainty’ referred to above is actually what prompted the

present work.

The definition of uncertainty of measurement used in GUM is:

‘Parameter, associated with the result of a measurement that characterizes the dispersion of

the values that could reasonably be attributed to the measurand.’ (GUM, 2.2.3)

TRUENESS

ISO 5725-1 defines trueness as the closeness of agreement between the average value

obtained from a large series of test results and an accepted reference value. Almost the same

NMKL PROCEDURE

No. 27 (2013)

Page: 13 of 24

Version: 1

Date: April 2013

thing is formulated in VIM (ISO International Vocabulary of Basic and General Terms in

Metrology, 2nd

Ed, 1993): ‘Closeness of agreement between the average of an infinite number

of replicate measured quantity values and a reference quantity value.’ The term trueness is

only seldom used in statistical texts, where the term bias is preferred. But because of the

negative connotations attached to this word, the more positive term trueness has won

precedence within most professions working with practical measurements.

A well-known figure demonstrating the difference between trueness and precision is the

shooting target depicted in Figure 3

Figure 3: The relationship between precision and trueness

Here, the shots in a) are not very precise, but on average quite close to the bull’s eye. The

shots in b) illustrates a situation with shots both precise and on the mark. In c) the shots are

widely spread and off the mark, the values are neither true nor precise. In d) they are still off

the mark, but rather precise. Figure 3 also illustrates another important aspect with analyses,

whether in sensory or other scientific disciplines: taking only one, or only a few, measure-

ments may give very misleading results. Each of the single shots in a) taken by itself gives a

rather poor result. As a consequence, a sensory panel with panellists not necessarily agreeing

in detail, may still give acceptable results (again referring to a)).

Figure 3 can also be used to illustrate other concepts: in a) the distances between the bull’s

eye and each of the shots are defined as the shots’ error. The trueness is best illustrated in d)

Trueness

Precision

NMKL PROCEDURE

No. 27 (2013)

Page: 14 of 24

Version: 1

Date: April 2013

as the distance from the bull’s eye to the average of the shots. Thus: the concept of error is

related to a single measurement, whereas trueness is related to an average of several

measurements.

It may be argued that the distinction between trueness and precision that holds for chemical

and microbiological data is irrelevant for sensory profiling data since there are no true values

of a sensory attribute. Although the profiling scale may seem comparable to chemical or

physical measurements, the sensory profiling scale is quite abstract and cannot even be given

absolute definitions (although this statement is disputed by some): it is meaningless to talk

about an “absolute sweetness” that can be used on all kinds of products. Given this argument,

it may still be relevant to talk about true values in one specific field of sensory analysis,

namely quality control.

Also with some binomial data trueness is a meaningful concept: in a triangle test a panellist is

asked to tell which of three samples is different from the other two, so the panellist is either

right or wrong. Precision can be defined in relation to how many such correct decisions the

panellist makes.

ACCURACY

The standard definition (GUM) is: ‘B.2.14 closeness of the agreement between the result of a

measurement and a true value of the measurand.’

Not surprisingly, this is often confused with the definition of trueness. Again referring to

Figure 3, going in the direction from c) to b), that is, from lower left to upper right, we have

increasing accuracy and decreasing uncertainty. Figure 4 is highlighting this fact and could be

superimposed on Figure 3.

Consequently, accuracy can be decomposed into two components: trueness and precision.

Unfortunately, numerical values cannot be attached to either of these concepts.

NMKL PROCEDURE

No. 27 (2013)

Page: 15 of 24

Version: 1

Date: April 2013

Figure 4: Conceptual representation of trueness, accuracy, uncertainty and precision.

ANOVA NOTATION

At this point it is convenient to introduce the usual univariate ANOVA model related to data

sets like the one found in Table 1, taking into account only one sensory attribute. The follow-

ing notation will be used, with I panellists, J varieties and K replicates (in Table 1: I=9, J=3,

KkJjIieX ijkijjiijk ,...,1;,...,1;,...,1

Xijk is the evaluation from panellist i for variety j in replicate k, μ represents the total mean, αi

the effect of panellist i, βj the effect of variety j, αβij the effect of the panellist × variety

interaction and eijk the error (or rather: the residual), in plain words: the variation in the data

that cannot be attributed to any of the effects in the model. The notation used in different

statistical textbooks may vary in details, but basically they are equivalent; we follow Lea,

Næs and Rødbotten (1997).

If more factors are added to the experiment, the model is extended correspondingly, with 3-

factor or higher-order interactions and/or nested factors, according to the specific design.

An important distinction in ANOVA models that influences the computations of the F-tests is

that of random and fixed effects. An effect is considered fixed if we are interested in the

particular levels as such. In an experiment including different varieties of carrots, we are

normally interested in comparing just those particular varieties – that is why they are included

Precision

NMKL PROCEDURE

No. 27 (2013)

Page: 16 of 24

Version: 1

Date: April 2013

in the experiment. Thus, the variety effect is a fixed effect. On the other hand, we are

normally not interested in the particular assessors in the panel, we want to generalise the

results outside the particular assessors used. Thus, the assessor effect is a random effect. The

assessors are normally drawn, although not completely at random, from a population of

potential assessors, all having demonstrated a minimum level of competence in assessing

relevant products. Although this does not qualify for being random in the strictest sense of the

word, it is considered sufficient for practical purposes. The simplest way of handling fixed

versus random effects, is to define as fixed those effects where our interest is in the particular

levels of the effect, for instance the different carrot varieties in the example above. All other

effects (and the interactions where they appear) are random.

Another import distinction is that of crossed and nested (or hierarchical) effects. Normally the

assessor and variety effects are crossed, meaning that each assessor evaluates each variety. In

an experiment comparing feeding regimes for farmed fish, one fish only receives one type of

feed, and so the fish effect is nested within the feed effect. It is important to note that it is not

always obvious just by looking at the data to determine whether an effect is nested or not. The

analyst must have access to additional information about the design and the practical aspects

of the experiment to decide on the proper statistical analysis. Of course, all these details

should be decided upon prior to the sensory sessions.

REPEATABILITY

Repeatability is a measure of how well the analysis can be repeated. In chemistry, this is

defined in GUM as:

'B.2.15 Closeness of the agreement between the results of successive measurements of the

same measurand carried out under the same conditions of measurement.'

In sensory analysis this is what is called “proper replicates”. Ideally, this is obtained for a

panellist when (s)he evaluates the same physical unit two or more times over a short period of

time. GUM does not clearly state what is meant by ‘a short period of time’, but in sensory

analysis this is often interpreted as ‘on the same day’, or even ‘within a few days’ if the

experiment is large. GUM also states that the same measurand should be evaluated. Since

sensory measurements involving taste and some textural attributes are destructive by nature,

the same physical sample cannot be tested more than once. Although the same sample in

principle could be evaluated repeatedly, and even by all the assessors in turn, this is seldom

done. In most situations it is more practical to have the assessors evaluating the same sample

for all the attributes in one sitting.

A problem might be that “the same sample” is close to impossible to obtain. Ideally a sample

consists a homogeneous unit which can be divided into any number of subsamples, all

identical, to be served to the panellists in as many replicates as one wishes. However, this is

almost only possible with liquids (properly stirred), and not even with all liquids. If there are

no limitations and the sample can be prepared in any size or volume, we are home free. But

NMKL PROCEDURE

No. 27 (2013)

Page: 17 of 24

Version: 1

Date: April 2013

this is not always the case. In addition, with for example wine, it may be important to evaluate

several bottles, as there often is between-bottle variation. To get an estimate of this variation

we cannot take several bottles and stir them into one large, homogeneous sample. With

coffee, it may be of interest to estimate the variation between different brews or different

coffee machines. With many foods, it is natural for the panellists to evaluate one complete

physical unit: grapes, nuts and the like are examples that spring to mind. Apples, hot dogs,

hamburgers, are foods where usually the complete unit is served, but it is not required that the

panellist should devour all of it. In this case, the unit may or may not be divided into

subsamples, served to different panellists. Sometimes it is possible to mash or grind the

sample into a relatively homogeneous patty, but again: it is often of interest to estimate the

variation between units. When the units are so small that only one panellist can evaluate it, the

variation between the units is confounded with the variation between the panellists. This

means that the variation between the units may be caused by the units themselves, by the

panellists, or both, and unfortunately we have no way of knowing which.

Since the measure of importance really is the average value over the panellists, the panellist

repeatability is not our main concern in daily work. It can be argued that what we would like

to keep under control is the repeatability of the panel. In an ideal set of circumstances we

would like the panel to perform in the same manner over time. This is particularly a major

problem with an internal panel, a panel consisting of people with other main duties within the

organisation. For apparent economic reasons, many panels consist of laboratory technicians,

secretaries, scientists, students, and so on, who are called for when a sensory evaluation will

take place. Some of them are able to leave their daily tasks to serve on the panel, others are

not. Consequently, the composition of the panel will vary from experiment to experiment, and

often within an experiment.

On the basis of the definition of a measurement in GUM (‘B.2.5 Measurement. Set of

operations having the object of determining the value of a quantity’), both a single assessor’s

data as well as an average over all assessors’ data can be viewed as a measurement.

Depending on the design, these averages may or may not be computed separately for each

replicate. If the replicate factor is related to time, so that the assessors always evaluate

replicate 1 before replicate 2, this computation makes sense. But if the replicates are randomly

assigned over the whole experiment, then replicate 1 for one assessor may have nothing in

common with replicate 1 for another assessor. Consequently these averages are meaningless

and arbitrary.

There are several intuitive ways of expressing the standard deviation – all emphasising

different aspects of uncertainty, and they are in general not available from the ANOVA table.

These measures will be treated in later sections.

REPRODUCIBILITY

Reproducibility can be seen as a kind of extended repeatability and always comes with some

kind of qualification such as: reproducibility over panellists (within a given panel), or repro-

ducibility over panels being the most relevant. The GUM definition is:

NMKL PROCEDURE

No. 27 (2013)

Page: 18 of 24

Version: 1

Date: April 2013

‘B.2.16 Closeness of the agreement between the results of measurements of the same

measurand carried out under changed conditions of measurement.’

GUM gives a list of possible changed conditions, of which only measuring instrument and

location (both represented by the panel in sensory analysis) and time are relevant to us. The

reproducibility can only be properly estimated through a proficiency testing scheme.

PRESENTATION OF MEASUREMENT UNCERTAINTY IN PRACTICE

To explore the standard deviation, repeatability and reproducibility further, we will look at

different ways by which standard deviations of data from a sensory experiment can be

computed.

Treating sensory replicates as if they were proper replicates in the sense that they are

evaluated within a short time range and on identical material, and in the framework of the

ANOVA model defined earlier, a specific panellist's repeatability is measured by the standard

deviation over replicates. The repeatability is connected to a specified attribute and specified

variety. Thus, for panellist i and variety j,

.ijijk )XX(1K

1)j,i(S

is the standard deviation expressing the repeatability according to GUM’s basic definition.

Except for the notation that reflects the fact that we are considering the “standard sensory data

set” of Table 1, the formula above is identical to the definition in the chapter STANDARD

DEVIATION. In passing, we note that for K=2, i.e. 2 replicates, the formula simplifies to:

( ) | |

- the absolute value (the positive value) of the difference between the two replicates, divided

by √ .

For attribute A5 in Table 1, Table 2 shows the standard deviations for the different panellists

and products:

NMKL PROCEDURE

No. 27 (2013)

Page: 19 of 24

Version: 1

Date: April 2013

Table 2: Standard deviations for each product and panellist separately

Panellist

Product

1 0,42 0,07 0,57

2 0,07 0,71 0,49

3 0,78 1,77 0,49

4 0,00 0,21 0,00

5 0,99 0,85 1,06

6 0,28 0,99 0,07

7 0,71 0,21 1,06

8 1,48 0,99 0,00

9 0,28 0,07 1,27

Similar tables could be computed for all attributes.

Although not touched upon in GUM, we may extend the repeatability to include the different

varieties, resulting in:

ij.ijk )X(X1)J(K

which is the average repeatability for panellist i, averaged over products. (Note that this

average is formed by averaging the variances S2(i,j) and then extracting the square root, not

averaging the standard deviations directly.) This is a summary measure indicating how well

an assessor can repeat him/herself, averaged over the products.

With only 2 replicates, Si) reduces to:

√ √∑( )

For attribute A5, we get Table 3:

Table 3: Average standard deviations for each panellist: Panellist S(j)

1 0,41

2 0,50

3 1,15

4 0,12

5 0,97

6 0,60

7 0,75

8 1,03

9 0,75

NMKL PROCEDURE

No. 27 (2013)

Page: 20 of 24

Version: 1

Date: April 2013

Performing an ANOVA for each panellist separately, we can obtain Table 3 by computing

)Error(SS)j(S

where SS(Error) – the error sum-of-squares, or residual sum of squares – is taken from the

ANOVA output available in all statistical software providing traditional statistical analyses.

Averaging the squares of the entries in Table 3 will give us the error term from an ANOVA

model incorporating varieties, assessors and replicates.

Normally the repeatability on the panellist level is not provided in sensory reports, but its

main interest lies in the fact that it can be used during training and as a quality control of the

panellists. This last application assumes that the assessors evaluate a narrow selection of

products, since comparing the assessors’ performance for an attribute over different food cate-

gories is hardly relevant.

Other intuitive ways of looking at the measurement uncertainty exist. One approach which is

relevant if the replicates are related to time (and should be treated as a specific factor in the

ANOVA model, not only as a part of the residual) is to average over assessors in each

replicate and compute the standard deviation over the replicates:

( ) √

∑( )

giving:

Variety S(j)

A 0,40

B 0,18

C 0,15

Averaging over varieties, we get:

( ) ∑∑( )

Or, in this particular example: S=0,27 ( √

This uncertainty measure is, however, not often seen in practical work.

NMKL PROCEDURE

No. 27 (2013)

Page: 21 of 24

Version: 1

Date: April 2013

A slightly different approach is to use the ANOVA table to estimate some of the measurement

uncertainties above. Analysing attribute A5 in Table 1 we end up with the ANOVA table

(Table 4):

Source DF SS MS F p E(MS)

Variety 2 24.144 12.072 10.079 0.0015 2

Panellist 8 53.378 6.672 11.456 <0.0005 2

Panellist Variety 16 19.163 1.198 2.056 0.048 2

Error 27 15.725 0.582 2

where:

DF = Degrees of freedom

SS = Sum of squares

MS = mean Sum of Squares ( = SS/DF)

F = MS for any Source (Variety, Panellist or Panellist Variety) divided by the appropriate

error term (the error term being the Panellist Variety interaction for the Variety F-value,

Error for the Panellist and Panellist Variety F-values3)

p = The probability of obtaining an F-value at least as large as the one actually observed,

given that the hypothesis (‘null-hypothesis’) of no influence of the Source in question, is true.

E(MS) = Expected Mean Squares, the expected value expressed in terms of variance

components4.

This type of table is produced by any run-of-the-mill general statistical package, except for

the E(MS) column which is only featured in a few packages.

The standard deviations, and variances, for the different Sources, are estimated by starting

with the bottom line of the table and replacing all the ’s in the E(MS) column with their

estimates:

The variance representing the extent to which the panellists repeat themselves is then

estimated by MS(Error), or 0.582. Or, if we want to put it on a standard deviation form:

0.763. Note that this standard deviation, as well as all other conclusions derived from Table 4,

apply to all the assessors. To draw similar conclusions based on a single assessor the ANOVA

must be performed for each assessor separately.

In the line corresponding to Panellist Variety, we replace 2

by the value just found

(0.582) to obtain the estimate 0.308 for MS(VA), or on standard deviation form: 0.555. In

3 A certain indetermination exists: some software programs use the Variety Panellist interaction as the error

term also for the Panellist effect. We will not probe further into this material here but only refer to the

statistical literature. 4 V, A, AV, E represent the square roots of the variance of the Variety, Panellist, Variety Panellist and

Error, respectively. How to find the E(MS) values is outside the scope of this NMKL Procedure.

NMKL PROCEDURE

No. 27 (2013)

Page: 22 of 24

Version: 1

Date: April 2013

similar fashion we get the estimate 1.015 for MS(A) (on standard deviation form: 1.007). For

MS(V) we get 0.604 (on standard deviation form: 0.777)

Still another measure of uncertainty is to consider the error term for the F-test in the ANOVA

output – this depends on the model selected, but equals the variety by panellist interaction in

the default ANOVA model used here. In more complicated models, incorporating other

random effects in addition to the assessor effect, and/or nested effects, this error term may

become quite complex and even impossible to compute. In such cases it can often be

approximated by a linear combination – the Satterthwaite approximation (Satterthwaite,

1946).

In addition to the variations due to the effects in the model, other sources must also be taken

into account. Sampling is one important source, both selection from a larger population of the

test samples to be used in the sensory laboratory, as well as the sub-sampling and preparations

taking place in the laboratory. The variations in the material are heavily depending on the

nature of the test samples. The ideal material for a sensory test (on foods) is a homogenous

liquid which can be divided into as close to identical subsamples as is possible before being

served. Looking at non-food materials, this homogeneity is probably easier to obtain than in

most food samples. At the other extreme we find materials where the panellists have to

evaluate different physical units (such as carrots). In such cases, uncertainty may be reduced

by taking several units, cutting them into smaller units, mixing them all together and taking

samples from this mixture to present to the panellists. In the carrot example, we could cut the

carrots into small cubes (approximately 1cm 1cm in size) and serve them in beakers of

appropriate size. That way, the carrots are in a manner of speaking being averaged in the

mouth of the panellists.

Another problem that is hard to safeguard against – and which usually follows Murphy’s Law

of appearing when it is at its most inconvenient – is physical disturbances such as noise, foul

odours, or a fire alarm going off by mistake. This list can be extended indefinitely.

NMKL PROCEDURE

No. 27 (2013)

Page: 23 of 24

Version: 1

Date: April 2013

REFERENCES

EA-4/09, 2003: Accreditation for Sensory Testing Laboratories

ISO Guide to the Expression of Uncertainty in Measurement, 1993

ISO International Vocabulary of Basic and General Terms in Metrology, 2nd

Ed, 1993

ISO 5725-1:1994: Accuracy (Trueness and Precision) of Measurement Methods and Results,

Part 1: General Principles and Definitions

Joint Committee for Guides in Metrology (JCGM): Evaluation of measurement data – Guide

to the expression of uncertainty in measurement (GUM with minor corrections), 2008

NMKL Procedure No. 4, 2nd

Ed., 2005: Validation of chemical analytical methods

NMKL Procedure No. 5, 2nd

Ed., 2003: Estimation and expression of measurement

uncertainty in chemical analysis

NMKL Procedure No. 6, 1998: Generelle retningslinier for kvalitetssikring af sensoriske

laboratorier (Available in Danish only) [Eng.: General guidelines for quality assurance of

sensory laboratories ]

NMKL Procedure No. 8, 3rd

Ed., 2008: Measurement of uncertainty in quantitative

microbiological examination of foods

NMKL Procedure No.14, 2004: SENSVAL: Guidelines for internal control in sensory

analysis laboratories

NMKL Procedure No. 16, 2005: Sensory quality control

NMKL Procedure No. 20, 2007: Evaluation of results from qualitative methods

Lea P, Næs T, Rødbotten M (1997): Analysis of Variance for Sensory Data.

John Wiley & Sons. ISBN 0-471-96750-5

Satterthwaite F E (1946): An approximate distribution of estimates of varaiance components.

Biometrics Bulletin, 2, 110-114.

NMKL PROCEDURE

No. 27 (2013)

Page: 24 of 24

Version: 1

Date: April 2013

ABBREVIATIONS ANOVA: Analysis of Variance

EA: European Accreditation

GUM: Guide to the Expression of Uncertainty in Measurement

ISO: International Organization for Standardization

JCGM: Joint Committee for Guides in Metrology

NCS: Natural Colour System

NMKL: Nordisk MetodikKomite for Levnedsmidler/Nordic Committee on Food Analysis

PCA: Principal Component Analysis

PLS: Partial Least Squares

Measurement uncertainty in sensory analysis - · PDF fileMeasurement uncertainty in ... of...

Documents