Post on 15-Mar-2018
transcript
NMKL PROCEDURE
No. 27 (2013)
Measurement uncertainty in sensory analysis
Page: 1 of 24
Version: 1
Date: April 2013
NORDIC COMMITTEE ON FOOD ANALYSIS
Measurement uncertainty in sensory analysis
Contents
INTRODUCTION .......................................................................................................................... 3
SENSORY ANALYSIS .................................................................................................................... 4
SENSORY DATA ........................................................................................................................... 4
PROFILING DATA ........................................................................................................................ 5
BINOMIAL DATA ......................................................................................................................... 8
QUALITY CONTROL DATA ........................................................................................................... 8
STATISTICAL CONCEPTS AND DEFINITIONS ................................................................................ 8
MEAN VALUE .......................................................................................................................... 9
STANDARD DEVIATION ........................................................................................................... 9
ERROR ................................................................................................................................... 10
MEASUREMENT UNCERTAINTY ............................................................................................ 12
TRUENESS ............................................................................................................................. 12
ACCURACY ............................................................................................................................ 14
ANOVA NOTATION ............................................................................................................... 15
REPEATABILITY ..................................................................................................................... 16
REPRODUCIBILITY ................................................................................................................. 17
PRESENTATION OF MEASUREMENT UNCERTAINTY IN PRACTICE ........................................... 18
REFERENCES ............................................................................................................................. 23
ABBREVIATIONS ....................................................................................................................... 24
NMKL PROCEDURE
No. 27 (2013)
Measurement uncertainty in sensory analysis
Page: 2 of 24
Version: 1
Date: April 2013
This procedure has been prepared by a project group consisting of:
Gunnar Forsgren Iggesund Paperboard, Sweden
Grethe Hyldig DTU Food, National Food Institute, Division of Industrial Food
Research, Technical University of Denmark
Päivi Kähkönen FINAS, Finnish Accreditation Service
Per Lea (project manager) Nofima, Norwegian Institute of Food, Fisheries and Aquaculture
Aðalheiður Ólafsdóttir Matís – Icelandic Food Research
Steffen Solem Eurofins, Norway and Vinmonopolet, Norway
Kolbrún Sveinsdóttir Matís – Icelandic Food Research
This publication may be obtained from the General Secretariat, Nordic Committee on Food Analysis
(NMKL) c/o Norwegian Veterinary Institute, P.O. Box 750 Sentrum, 0106 Oslo, Norway.
NMKL invites all readers of this NMKL publication to submit their points of view on its content to the
General Secretariat.
©NMKL
NMKL PROCEDURE
No. 27 (2013)
Measurement uncertainty in sensory analysis
Page: 3 of 24
Version: 1
Date: April 2013
INTRODUCTION
The importance of measurement uncertainty has gained increased acceptance within most
fields of metrology – of main interest to NMKL are the fields of chemistry and microbiology.
Measurement uncertainty within NMKL has been dealt with in two procedures: NMKL
Procedure No 5 (2nd
Ed., 2003) and NMKL Procedure No 8 (3rd
Ed., 2008), and there is also a
vast amount of literature on the subject within other areas. For sensory analysis, however, the
literature is infinitesimal in comparison, in fact this is a subject matter that is only rarely
touched upon among sensory scientists. One explanation for this can be that the sensory
community considers their data to be of a kind that does not easily lend itself to the treatment
laid down in the NMKL procedures above. The present work is an attempt to rectify this, and
to assure that NMKL has procedures for measurement uncertainty for all its three main areas
where measurements are performed: chemistry, microbiology and sensory.
EA-4/09 quotes ISO 17025 (5.4.6) on the subject of measurement uncertainty:
‘Sensory tests are usually supported by statistical data elaboration which establishes the level
of significance of the results.
Moreover, sensory tests come into the category of those that preclude the rigorous,
metrologically and statically (sic) valid calculation of uncertainty of measurement.
In some cases, when a numerical result is expressed, it could be possible to base the
estimation of uncertainty on repeatability and reproducibility data alone. In these cases the
individual components of uncertainty should be identified and demonstrated to be under
control. The estimation of the uncertainty depends on the method used and the objectives
evaluated and their importance in the quality and significance of the final result.’
One important aspect of the present procedure is to make sensory analysts conscious of the
fact that their data are encumbered with errors, and that this has consequences for the
conclusions drawn. When the concept of uncertainty is introduced to new audiences, a more
or less automatic response is “this is too difficult to take into account”, “this is far too
expensive”, “this is impractical” or similar excuses. We believe – no: we claim – that it is
extremely important to break down the opposition to these views. Even if it takes time to
introduce our recommendations in practice, a small victory is won if we can persuade the
sensory community that the concept of uncertainty plays an overwhelming importance in
sensory analysis – as it does in all other areas of data interpretation.
It is important to note that the word error in its scientific meaning is not synonymous with
mistake, fault, defect or flaw. Instead of error, the word residual could – and often is – used.
Based on a survey conducted among sensory laboratories prior to the writing of the present
guidelines, we can report that the sensory community is not as a whole averse to the concept
of measurement uncertainty. This is not very surprising, seeing that sensory analysts are quite
often recruited from fields of analysis where this concept is well-known.
NMKL PROCEDURE
No. 27 (2013)
Measurement uncertainty in sensory analysis
Page: 4 of 24
Version: 1
Date: April 2013
It is our hope that the present guidelines will provide the help and inspiration for sensory
analysts to take measurement uncertainty into consideration in their work.
SENSORY ANALYSIS
Sensory analysis, or sensory evaluation, is evaluating something using one or more of the
human senses: taste, smell, sight, hearing and feeling (tactility). Although the present
procedure is written with analysis of edibles in mind, sensory analysis is also used within
industries as far apart as cars (assessing the odours inside a car, the sound of the engine),
mobile phones (aspects of sound quality), TV (different aspects of picture quality), fabric
(smoothness) and so on.
What is evaluated is termed measurand by GUM (ISO Guide to the Expression of Uncertainty
in Measurement) and in other sources dealing with metrology. In sensory analysis we find it
more natural – and convenient – to use the word sample instead of measurand.
Sensory analysis is performed by a panel of trained panellists, usually 6-15 for profiling,
(descriptive analysis), normally more than 15 for binomial (triangle, duo-trio) tests, and 3-5
for quality control. In profiling the measurements from trained panellists are considered to be
objective measurements, as opposed to the subjective measurements from surveys, where an
assessment of liking or preference for one product to another is measured.
In quality control, the panellists are not necessarily trained in general sensory evaluation, but
rather specialised in a particular food or group of foods: a quality control panellist working in
a beer brewery will normally judge only beer.
Binomial tests are mainly used in consumer analysis, but to some extent also in connection
with sensory panels - particularly when checking new panellists for their ability of recog-
nising the basic tastes (sweet, salty, bitter, acidic, umami). Consumer tests as such are beyond
the scope of this procedure.
SENSORY DATA
Sensory analysis differs from traditional analyses (chemistry, microbiology, physics, …) in
that there are no true or strictly defined values as is the case when we measure temperature
with a thermometer or the length of an object using a ruler. Still, collaborative tests and
proficiency tests give valuable insight into a panel’s performance, and participation in such
programs is highly recommended.
The concepts, or attributes, such as sweetness, chewing resistance and whiteness can be
measured indirectly as sucrose, breaking force and NCS (Natural Colour System, Skandinav-
iska Färginstitutet AB). Although such measurements are strictly defined chemically or
NMKL PROCEDURE
No. 27 (2013)
Measurement uncertainty in sensory analysis
Page: 5 of 24
Version: 1
Date: April 2013
physically, they are not necessarily relevant for the sensory analyst. One problem is
interaction: mixtures of, for example, sucrose and salt (whether as NaCl or KCl) may be
easily analysed chemically for each of these components, but in humans they will interact.
Also, sensory analysis is concerned with perceived sweetness and saltiness rather than the
amounts of sugar and salt.
The sensory characteristics of food can be complex, so sweetness can be perceived differently
in different foodstuffs although objectively the sucrose content is the same. The complicated
role of interaction between the food components and the presence of small amounts of
strongly flavoured contaminants may account for some of this. Therefore a person’s percept-
ion of sweetness is of interest, although it cannot be stressed often enough that a trained
sensory panel is very much different from a group of consumers. In this procedure we will
concentrate exclusively on sensory panels and leave validation and verification of consumer
tests to others.
Sensory data can be divided into three main groups: profiling data, binomial data and quality
control data.
PROFILING DATA
Profiling data are data evaluated on a pre-defined interval, for example 1-5, 1-7, 1-9, 0-15 or
0-100. The evaluations can be given as numbers in the interval, either by entering the number
itself in a computerized data collection system, or marking a point on a line interval on a
screen. Sometimes even traditional pen-and-paper systems are used for the data collection.
Independently of the actual system used, the data end up consisting of something like 1, 2, 3,
4, 5, 6, 7, 8, 9 (9 separate values); 1.0, 1.1, 1.2, ..., 8.8, 8.9, 9.0 (81 separate values) or
similarly, depending on the partition and the end-points of the interval. Although the data set
potentially consists of a finite number of different values it is assumed that the underlying
distribution is continuous.
A typical sensory data set may look like the one presented in Table 1.
The dataset can be downloaded from >Software & Downloads and www.nofimasensory.org
> Download Excel Spreadsheet. www.nmkl.org
NMKL PROCEDURE
No. 27 (2013)
Measurement uncertainty in sensory analysis
Page: 6 of 24
Version: 1
Date: April 2013
Table 1: Sample data set, with three product varieties (A, B, C), nine panellists (1 – 9), two
replicates (1, 2) and 10 sensory attributes (A1 – A10)
Prod. Pan. Rep. A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
A 1 1 6,4 4,8 3,7 4,8 5,5 4,7 5,6 1,0 6,8 5,8
A 1 2 5,6 4,2 4,2 5,2 6,1 4,8 6,2 1,0 6,3 6,3
A 2 1 4,8 4,7 3,9 5,8 4,1 4,7 4,0 1,8 6,4 4,5
A 2 2 3,3 5,4 4,2 5,6 4,0 4,0 4,1 1,0 7,6 4,8
A 3 1 3,7 4,9 4,6 5,6 2,1 1,8 2,7 1,0 4,9 4,5
A 3 2 4,6 3,1 3,5 5,8 1,0 1,8 2,0 1,0 4,8 1,0
A 4 1 3,3 2,3 3,9 6,3 3,4 2,2 4,0 1,0 4,0 3,5
A 4 2 2,8 3,7 4,0 8,2 3,4 2,9 2,5 1,0 4,6 3,9
A 5 1 4,4 4,8 5,2 7,0 5,0 3,7 3,6 1,0 4,2 3,8
A 5 2 4,5 5,4 5,4 7,2 3,6 3,7 2,6 1,0 3,4 4,9
A 6 1 4,0 5,8 5,7 5,7 2,7 2,1 2,2 1,7 3,0 3,1
A 6 2 2,7 5,0 4,2 5,6 2,3 2,9 2,0 4,9 4,9 2,6
A 7 1 3,3 4,8 6,0 5,4 3,5 4,5 2,3 2,4 4,9 2,8
A 7 2 3,3 5,1 5,0 5,5 2,5 3,0 1,0 2,5 5,7 2,3
A 8 1 3,0 5,4 4,2 4,7 3,6 4,0 2,1 2,9 3,8 2,0
A 8 2 3,0 4,6 3,9 4,6 1,5 3,3 1,0 3,4 5,6 1,0
A 9 1 4,3 4,4 3,4 6,5 5,2 2,3 4,4 1,0 4,0 3,6
A 9 2 3,4 3,8 4,5 7,0 5,6 2,4 4,9 1,0 4,4 4,6
B 1 1 5,6 2,1 2,1 5,5 6,3 4,2 6,4 1,0 6,6 4,8
B 1 2 5,7 3,0 2,3 4,8 6,4 4,8 6,4 1,0 6,4 4,2
B 2 1 5,8 2,2 3,5 5,0 5,0 1,8 4,0 1,0 5,8 5,6
B 2 2 4,6 3,5 2,8 5,8 4,0 4,7 4,4 1,0 6,1 4,8
B 3 1 4,0 2,7 3,4 4,5 1,9 2,2 1,0 2,0 4,8 1,0
B 3 2 5,6 3,5 3,3 4,6 4,4 1,5 4,0 1,0 4,2 4,2
B 4 1 3,8 1,3 2,8 5,7 2,9 2,6 3,0 1,0 3,9 3,7
B 4 2 3,0 3,1 2,9 7,0 3,2 2,8 2,8 1,0 4,5 4,1
B 5 1 5,1 3,8 3,7 6,2 5,6 4,1 3,4 1,0 4,9 5,3
B 5 2 5,8 3,8 3,5 7,1 4,4 3,4 3,2 1,0 3,7 4,9
B 6 1 3,3 1,8 3,2 5,8 3,9 2,9 3,9 1,0 2,7 2,6
B 6 2 3,8 1,4 2,7 5,7 2,5 1,7 1,7 4,8 4,8 4,9
B 7 1 4,5 4,7 4,6 4,3 4,2 4,4 2,9 2,1 5,9 3,9
B 7 2 5,7 4,6 4,2 4,6 3,9 3,7 3,2 2,0 4,1 3,6
B 8 1 5,0 3,3 2,7 5,0 3,5 4,0 2,5 1,8 5,4 1,0
B 8 2 4,8 3,2 2,6 4,6 2,1 2,8 1,6 2,2 7,0 1,0
B 9 1 4,8 2,6 2,3 6,6 3,1 2,0 2,5 1,0 1,4 3,2
B 9 2 4,7 3,1 1,7 7,8 3,2 2,2 2,2 1,0 2,8 2,4
C 1 1 6,4 4,5 3,5 4,8 3,3 3,8 3,7 2,7 6,8 2,4
C 1 2 6,5 4,1 4,0 4,4 2,5 4,9 2,6 5,0 6,8 2,7
C 2 1 4,4 6,0 6,2 5,9 3,2 3,0 3,2 2,9 6,1 2,9
C 2 2 4,6 5,6 6,0 6,0 3,9 3,0 3,9 2,0 5,5 3,8
C 3 1 6,2 5,6 4,0 4,8 1,7 1,6 1,0 1,0 3,3 2,0
C 3 2 6,0 3,5 4,4 5,0 1,0 1,7 1,0 2,3 4,2 1,0
C 4 1 4,9 2,5 3,7 7,0 1,0 2,3 1,0 5,1 4,6 1,0
C 4 2 6,4 2,8 2,7 7,3 1,0 1,8 1,0 3,1 3,1 1,0
C 5 1 5,1 5,0 4,7 5,7 4,3 2,6 2,7 2,1 1,8 2,7
C 5 2 5,0 5,0 5,3 5,1 2,8 2,3 1,0 3,4 2,3 2,1
C 6 1 4,4 5,4 4,6 4,1 2,9 1,8 1,0 4,8 3,2 3,1
C 6 2 4,8 3,5 4,8 4,8 3,0 1,8 1,8 5,0 2,3 4,9
C 7 1 4,4 5,5 5,3 5,5 2,5 3,3 1,0 4,2 4,3 1,0
C 7 2 4,3 5,5 5,3 5,6 1,0 4,0 1,0 5,0 4,6 1,0
C 8 1 6,2 6,5 2,2 5,0 1,0 3,9 1,0 6,8 4,9 1,0
C 8 2 5,2 6,2 3,8 4,6 1,0 3,3 1,0 5,0 4,7 1,0
C 9 1 4,9 4,9 4,9 5,1 2,4 3,8 2,5 2,5 2,5 2,4
C 9 2 5,1 4,9 3,7 4,6 4,2 4,3 3,1 1,0 5,6 3,0
NMKL PROCEDURE
No. 27 (2013)
Measurement uncertainty in sensory analysis
Page: 7 of 24
Version: 1
Date: April 2013
A data-set such as the one in Table 1 is sometimes referred to a ‘standard sensory data set’.
The structure is simple and is commonly encountered in sensory analyses. In addition to
product varieties, the data could also be classified into a large number of groups: storage
temperature, storage time, recipes, type of feed (for experiments involving animals), type of –
and quantity of – fertilizer (in experiments involving crops), and many others. In addition,
these groups, in statistical terms called effects or factors, can be combined in different ways.
The simplest model is a complete factorial design where all levels of one factor are combined
with each level of all the other factors. Although this has an intuitive appeal and easily lends
itself to simple interpretations, it is restricted by time and cost limitations placed on most
projects. As an example, seven factors, each with only two levels, involves 27 = 128
combinations. This means that 128 different samples will have to be tested, and with 2
replicates and 12 panellists the panel leaders will have to prepare 3072 samples for
evaluation. In other words, the data in Table 1 would have 3072 rows. In addition to the
practical problem of handling many samples, there is also the issue of panellist fatigue.
Consequently the sensory scientist is often faced with the need of reducing a planned
experiment. As an alternative to reducing the number of factors and/or factor levels, this can
also be achieved by using a fractional factorial design. We will not go into the details here,
but only refer the reader to the vast literature on experimental design available in statistical
text-books. Suffice it to say that if one wants to exclude certain factor combinations, this
should not be done in a haphazard way, but only after consulting statistical theory.
Sensory profiling data fall in the category interval data, meaning that there is no natural zero
as in a data-set of, for example, lengths, volumes or masses. But we do assume that sensory
differences are meaningful, e.g. the distance between 4 and 3 is the same as the distance
between 7 and 6, since the panellists have been trained to perceive them as such. It follows
that concepts such as mean values and standard deviations, and statistical tests involving
distributions such as the Normal (Gaussian), T-, χ2- and F-distributions, are all applicable.
The temperature scales of Kelvin and Celsius are often used to exemplify the difference
between data on a ratio and on an interval scale: C is an interval scale, and it is meaningless
to state “10 C is twice as hot as 5 C”. On the other hand, stating that “100 K is twice as hot
as 50 K”, is meaningful, since K does indeed have a natural zero (-273.15 C).
Although a sweetness score of 4 is not “twice as sweet” as a score of 2 we are still able to do
computations involving multiplications, e.g. transforming the data using a linear trans-
formation such as for appropriate values of a and b. Occasionally this has to
be done when due to technical problems some of the data have to be recorded using pen and
paper. Then the manual data are represented by lengths on a line (e.g. between 0 mm and 150
mm) and are transformed into values on the scale used by the computer system. The
transformation above is also used when combining data sets from different panels using
different end points on their scales.
NMKL PROCEDURE
No. 27 (2013)
Measurement uncertainty in sensory analysis
Page: 8 of 24
Version: 1
Date: April 2013
BINOMIAL DATA
Binomial data are data on the form «A is different from B» (Triangle test, Duo-trio test) or «A
is sweeter than B» (Pair test). These tests are often used in sensory evaluation to analyse if
small differences are expected between samples, during training of panellists to test their
ability to detect small differences, and also within the field of consumer tests.
QUALITY CONTROL DATA Quality control data can sometimes come in a form that may look like profiling data (numbers
on a scale), but actually the numbers can stand for:
OK product (0)
Minor, insignificant deviation from the standard (1)
Small deviation from the standard (2)
Major deviation from the standard (3)
Gross deviation from the standard (4).
Although these data may be confused with profiling data, we notice that this is not an interval
scale: it is hard to argue that the difference between 0 and 1 is the same as between 2 and 3.
Therefore, the classical statistical theory based on ratio or interval data is no longer
applicable.
STATISTICAL CONCEPTS AND DEFINITIONS
Although they probably are well-known to most analysts, we feel it is relevant to repeat some
important concepts and definitions.
Measurements are designated by capital, indexed letters: the sequence X1, X2, ... , X10 tells us
that there are 10 measurements, for example the following 10 assessments of sweetness of a
juice sample, performed by 10 sensory panellists: 6.4, 6.7, 7.3, 8.6, 5.7, 7.7, 7.4, 6.9, 7.0, 7.1.
Since the amount of data may vary from situation to situation, statisticians prefer to express
this in a more general way: “We have n measurements”, where n may be any number. In the
example above n=10.
NMKL PROCEDURE
No. 27 (2013)
Measurement uncertainty in sensory analysis
Page: 9 of 24
Version: 1
Date: April 2013
MEAN VALUE
Since single values seldom are of interest by themselves, the most common way of summa-
rising them, are by computing the mean value, or more precisely: the arithmetic mean1. This
is defined as “the sum of the data divided by the number of them”, or in mathematical terms:
n
1i
iXn
1X
where X1, X2, ..., Xn are the individual values in the data set.
Alternatively, particularly in connection with ANOVA, where several indices are used, the
notation
K
1k
ijkij XK
1X
is often practical, the dot in the subscript indicating which index is averaged over. In the
present formula, the mean value over the K replicates for pannellist i and variety j is
computed.
A different kind of mean – the median – is sometimes used in consumer studies, where the
data often are considered to be on the ordinal level. The median is defined as the middle
observation, in the sense that one half of the data is larger than the median, one half is
smaller. Normally, the ordinary arithmetic mean is used on data from trained panellists.
STANDARD DEVIATION
The standard deviation is a measure of the spread of the data: each value in the data set is
compared to the mean value and the differences are squared and summed, divided by one less
than the number of observations and then the square root is computed:
n
1i
2
i )X(X1n
1S
For computational and practical reasons, the formula is often given in this form:
1 There are also geometric and harmonic means, defined as the n´th root of the product of all (n) observations,
and the inverse of the arithmetic mean of the inverse values, respectively
NMKL PROCEDURE
No. 27 (2013)
Measurement uncertainty in sensory analysis
Page: 10 of 24
Version: 1
Date: April 2013
2n
1i
2
i
2 XnX1n
1S
When all values are close together (Figure 1), the standard deviation is small
when they vary over a large range (Figure 2), the standard deviation becomes large
Variance is often used to describe variation in a data set, but this is just the square of the
standard deviation: V=S2. One reason for using the standard deviation is that this is on the
same measurement scale as the original data, whereas the variance is measured in squared
units (m2 if the original data set is in meters; g
2 if the original data set is in grams). For many
units, square units make no physical sense. Admittedly, this is no serious problem in sensory
analysis, since the scale used in profiling is abstract anyway, so it does not matter if the
squares of these units also are difficult to relate to. In keeping up with traditions, we will be
using the standard deviation in the following.
ERROR
In statistics, error is defined as the difference between the measurement result and the true
value, and it is divided into two parts: systematic error and random error. Systematic error
may be caused by:
Readings and/or recordings
A value is erroneously recorded (4.22 instead of 4.2; 6.5 instead of 5.6)
Misplaced decimal point: 14.5 instead of 1.45
Computing errors while transforming from markings on a line to a 1-9 scale
-10 -5 0 5 100
100
200
300
Figure 1
-10 -5 0 5 100
20
40
60
80
100
Figure 2
NMKL PROCEDURE
No. 27 (2013)
Measurement uncertainty in sensory analysis
Page: 11 of 24
Version: 1
Date: April 2013
Samples
Coding errors (mix-up of samples and/or coding labels)
Unintentional treatment effects (bacterial attacks during storage; cooler breaking down)
Panellists
Different use of the scale
Errors in readings or recordings are not very relevant, since the use of pen and paper has been
substituted by electronic recording systems in nearly all sensory laboratories. However, cases
have been known where the electronic equipment has broken down, and some evaluations had
to be recorded using pen and paper followed by manually transferring them to a computer,
often after transformation from millimetres to some specified scale2.
Depending on their nature, systematic errors can in some cases be corrected, or new samples
can be provided for additional analyses.
One type of systematic error which has been the subject of many discussions is panellists who
do not use the scale in the same way. One aspect to bear in mind is that in certain situations
this is not something to be bothered with: if we have a panellist who systematically gives
values a specified number of units above (or below) the others, this is a not a problem at all.
Those who are used to the thought process put forth in the ISO Guide to the Expression of
Uncertainty in Measurement (GUM) would probably consider this a serious systematic error.
But since sensory data often are analysed statistically by Analysis of Variance (ANOVA) or
by some multivariate techniques such as Principal Component Analysis (PCA) or Partial
Least Squares (PLS), the actual level of the single panellists are not really important. The
ANOVA is robust to level differences between the panellists, as long as they rank the samples
in (approximately) the same order. This is a consequence of the fact that the ANOVA
formulas are made up of mean values and other linear combinations of the original data, and
constant level differences do not influence the relationships between the groups of data. The
same goes for PCA and PLS models when these are based on mean values over panellists and
replicates, and when based on individual scores, the data are often standardised in the
software anyway. Therefore, as long as the results are used only for internal comparisons of
groups and not for estimates to be compared with other panels, the effect of this particular
type of systematic error is no impediment for a satisfactory analysis.
Random error is defined in GUM as the “result of a measurement minus the mean that would
result from an infinite number of measurements of the same measurand carried out under
repeatability conditions.”
Random errors are caused by the fact that the panellists are living persons and cannot be
expected to evaluate the same sample identically on two or more occasions. Also: the samples
2 Data measured in mm along a line of length 150 mm are transformed to a 1-9 scale by the formula
150/)150(8XX OrigTrans , where XTrans and XOrig (measured in mm) are the transformed and original
data, respectively. The general formula for transforming from [a,b] to[A,B] is:
a)Ba)/(bAbA)X((BX OrigTrans
NMKL PROCEDURE
No. 27 (2013)
Measurement uncertainty in sensory analysis
Page: 12 of 24
Version: 1
Date: April 2013
are themselves representing a larger population, and apples (or sausages, or bottles of wine,
etc.,) are not 100% identical although they have the same origin and/or have received the
same treatment. The random error (or residual) helps us express the level of uncertainty in the
data – noting in passing that uncertainty is yet another word whose scientific meaning is
different from its everyday meaning.
MEASUREMENT UNCERTAINTY
‘The word “uncertainty” means doubt, and thus in its broadest sense “uncertainty of measure-
ment” means doubt about the validity of the results of a measurement. Because of the lack of
different words for this general concept of uncertainty and the specific quantities that provide
quantitative measures of the concept, for example, the standard deviation, it is necessary to
use the word “uncertainty” in these two different senses.’ (GUM, 2.2.1)
GUM also states (in Chapter 1 Scope):
'1.2 This Guide is primarily concerned with the expression of uncertainty in the measurement
of a well-defined physical quantity – the measurand – that can be characterised by an
essentially unique value.'
Bearing in mind our description of sensory data above, GUM at first glance does not cover
the field of sensory analysis. However, we seek comfort in:
'1.4 This Guide provides general rules for evaluating and expressing uncertainty in measure-
ments rather than detailed, technology-specific instructions. (...) It may therefore be necessary
to develop particular standards based on this Guide to deal with the problems peculiar to
specific fields of measurement or with the various uses of quantitative expressions of
uncertainty. These standards may be simplified versions of this Guide, but should include the
detail that is appropriate to the level of accuracy and complexity of the measurements and
uses addressed.'
The sentence ‘problems peculiar to specific fields of measurement or with the various uses of
quantitative expressions of uncertainty’ referred to above is actually what prompted the
present work.
The definition of uncertainty of measurement used in GUM is:
‘Parameter, associated with the result of a measurement that characterizes the dispersion of
the values that could reasonably be attributed to the measurand.’ (GUM, 2.2.3)
TRUENESS
ISO 5725-1 defines trueness as the closeness of agreement between the average value
obtained from a large series of test results and an accepted reference value. Almost the same
NMKL PROCEDURE
No. 27 (2013)
Measurement uncertainty in sensory analysis
Page: 13 of 24
Version: 1
Date: April 2013
thing is formulated in VIM (ISO International Vocabulary of Basic and General Terms in
Metrology, 2nd
Ed, 1993): ‘Closeness of agreement between the average of an infinite number
of replicate measured quantity values and a reference quantity value.’ The term trueness is
only seldom used in statistical texts, where the term bias is preferred. But because of the
negative connotations attached to this word, the more positive term trueness has won
precedence within most professions working with practical measurements.
A well-known figure demonstrating the difference between trueness and precision is the
shooting target depicted in Figure 3
Figure 3: The relationship between precision and trueness
Here, the shots in a) are not very precise, but on average quite close to the bull’s eye. The
shots in b) illustrates a situation with shots both precise and on the mark. In c) the shots are
widely spread and off the mark, the values are neither true nor precise. In d) they are still off
the mark, but rather precise. Figure 3 also illustrates another important aspect with analyses,
whether in sensory or other scientific disciplines: taking only one, or only a few, measure-
ments may give very misleading results. Each of the single shots in a) taken by itself gives a
rather poor result. As a consequence, a sensory panel with panellists not necessarily agreeing
in detail, may still give acceptable results (again referring to a)).
Figure 3 can also be used to illustrate other concepts: in a) the distances between the bull’s
eye and each of the shots are defined as the shots’ error. The trueness is best illustrated in d)
Trueness
Precision
a b
c d
NMKL PROCEDURE
No. 27 (2013)
Measurement uncertainty in sensory analysis
Page: 14 of 24
Version: 1
Date: April 2013
as the distance from the bull’s eye to the average of the shots. Thus: the concept of error is
related to a single measurement, whereas trueness is related to an average of several
measurements.
It may be argued that the distinction between trueness and precision that holds for chemical
and microbiological data is irrelevant for sensory profiling data since there are no true values
of a sensory attribute. Although the profiling scale may seem comparable to chemical or
physical measurements, the sensory profiling scale is quite abstract and cannot even be given
absolute definitions (although this statement is disputed by some): it is meaningless to talk
about an “absolute sweetness” that can be used on all kinds of products. Given this argument,
it may still be relevant to talk about true values in one specific field of sensory analysis,
namely quality control.
Also with some binomial data trueness is a meaningful concept: in a triangle test a panellist is
asked to tell which of three samples is different from the other two, so the panellist is either
right or wrong. Precision can be defined in relation to how many such correct decisions the
panellist makes.
ACCURACY
The standard definition (GUM) is: ‘B.2.14 closeness of the agreement between the result of a
measurement and a true value of the measurand.’
Not surprisingly, this is often confused with the definition of trueness. Again referring to
Figure 3, going in the direction from c) to b), that is, from lower left to upper right, we have
increasing accuracy and decreasing uncertainty. Figure 4 is highlighting this fact and could be
superimposed on Figure 3.
Consequently, accuracy can be decomposed into two components: trueness and precision.
Unfortunately, numerical values cannot be attached to either of these concepts.
NMKL PROCEDURE
No. 27 (2013)
Measurement uncertainty in sensory analysis
Page: 15 of 24
Version: 1
Date: April 2013
Figure 4: Conceptual representation of trueness, accuracy, uncertainty and precision.
ANOVA NOTATION
At this point it is convenient to introduce the usual univariate ANOVA model related to data
sets like the one found in Table 1, taking into account only one sensory attribute. The follow-
ing notation will be used, with I panellists, J varieties and K replicates (in Table 1: I=9, J=3,
K=2):
KkJjIieX ijkijjiijk ,...,1;,...,1;,...,1
Xijk is the evaluation from panellist i for variety j in replicate k, μ represents the total mean, αi
the effect of panellist i, βj the effect of variety j, αβij the effect of the panellist × variety
interaction and eijk the error (or rather: the residual), in plain words: the variation in the data
that cannot be attributed to any of the effects in the model. The notation used in different
statistical textbooks may vary in details, but basically they are equivalent; we follow Lea,
Næs and Rødbotten (1997).
If more factors are added to the experiment, the model is extended correspondingly, with 3-
factor or higher-order interactions and/or nested factors, according to the specific design.
An important distinction in ANOVA models that influences the computations of the F-tests is
that of random and fixed effects. An effect is considered fixed if we are interested in the
particular levels as such. In an experiment including different varieties of carrots, we are
normally interested in comparing just those particular varieties – that is why they are included
Acc
urac
y
Precision
Tru
en
ess
Unc
erta
inty
NMKL PROCEDURE
No. 27 (2013)
Measurement uncertainty in sensory analysis
Page: 16 of 24
Version: 1
Date: April 2013
in the experiment. Thus, the variety effect is a fixed effect. On the other hand, we are
normally not interested in the particular assessors in the panel, we want to generalise the
results outside the particular assessors used. Thus, the assessor effect is a random effect. The
assessors are normally drawn, although not completely at random, from a population of
potential assessors, all having demonstrated a minimum level of competence in assessing
relevant products. Although this does not qualify for being random in the strictest sense of the
word, it is considered sufficient for practical purposes. The simplest way of handling fixed
versus random effects, is to define as fixed those effects where our interest is in the particular
levels of the effect, for instance the different carrot varieties in the example above. All other
effects (and the interactions where they appear) are random.
Another import distinction is that of crossed and nested (or hierarchical) effects. Normally the
assessor and variety effects are crossed, meaning that each assessor evaluates each variety. In
an experiment comparing feeding regimes for farmed fish, one fish only receives one type of
feed, and so the fish effect is nested within the feed effect. It is important to note that it is not
always obvious just by looking at the data to determine whether an effect is nested or not. The
analyst must have access to additional information about the design and the practical aspects
of the experiment to decide on the proper statistical analysis. Of course, all these details
should be decided upon prior to the sensory sessions.
REPEATABILITY
Repeatability is a measure of how well the analysis can be repeated. In chemistry, this is
defined in GUM as:
'B.2.15 Closeness of the agreement between the results of successive measurements of the
same measurand carried out under the same conditions of measurement.'
In sensory analysis this is what is called “proper replicates”. Ideally, this is obtained for a
panellist when (s)he evaluates the same physical unit two or more times over a short period of
time. GUM does not clearly state what is meant by ‘a short period of time’, but in sensory
analysis this is often interpreted as ‘on the same day’, or even ‘within a few days’ if the
experiment is large. GUM also states that the same measurand should be evaluated. Since
sensory measurements involving taste and some textural attributes are destructive by nature,
the same physical sample cannot be tested more than once. Although the same sample in
principle could be evaluated repeatedly, and even by all the assessors in turn, this is seldom
done. In most situations it is more practical to have the assessors evaluating the same sample
for all the attributes in one sitting.
A problem might be that “the same sample” is close to impossible to obtain. Ideally a sample
consists a homogeneous unit which can be divided into any number of subsamples, all
identical, to be served to the panellists in as many replicates as one wishes. However, this is
almost only possible with liquids (properly stirred), and not even with all liquids. If there are
no limitations and the sample can be prepared in any size or volume, we are home free. But
NMKL PROCEDURE
No. 27 (2013)
Measurement uncertainty in sensory analysis
Page: 17 of 24
Version: 1
Date: April 2013
this is not always the case. In addition, with for example wine, it may be important to evaluate
several bottles, as there often is between-bottle variation. To get an estimate of this variation
we cannot take several bottles and stir them into one large, homogeneous sample. With
coffee, it may be of interest to estimate the variation between different brews or different
coffee machines. With many foods, it is natural for the panellists to evaluate one complete
physical unit: grapes, nuts and the like are examples that spring to mind. Apples, hot dogs,
hamburgers, are foods where usually the complete unit is served, but it is not required that the
panellist should devour all of it. In this case, the unit may or may not be divided into
subsamples, served to different panellists. Sometimes it is possible to mash or grind the
sample into a relatively homogeneous patty, but again: it is often of interest to estimate the
variation between units. When the units are so small that only one panellist can evaluate it, the
variation between the units is confounded with the variation between the panellists. This
means that the variation between the units may be caused by the units themselves, by the
panellists, or both, and unfortunately we have no way of knowing which.
Since the measure of importance really is the average value over the panellists, the panellist
repeatability is not our main concern in daily work. It can be argued that what we would like
to keep under control is the repeatability of the panel. In an ideal set of circumstances we
would like the panel to perform in the same manner over time. This is particularly a major
problem with an internal panel, a panel consisting of people with other main duties within the
organisation. For apparent economic reasons, many panels consist of laboratory technicians,
secretaries, scientists, students, and so on, who are called for when a sensory evaluation will
take place. Some of them are able to leave their daily tasks to serve on the panel, others are
not. Consequently, the composition of the panel will vary from experiment to experiment, and
often within an experiment.
On the basis of the definition of a measurement in GUM (‘B.2.5 Measurement. Set of
operations having the object of determining the value of a quantity’), both a single assessor’s
data as well as an average over all assessors’ data can be viewed as a measurement.
Depending on the design, these averages may or may not be computed separately for each
replicate. If the replicate factor is related to time, so that the assessors always evaluate
replicate 1 before replicate 2, this computation makes sense. But if the replicates are randomly
assigned over the whole experiment, then replicate 1 for one assessor may have nothing in
common with replicate 1 for another assessor. Consequently these averages are meaningless
and arbitrary.
There are several intuitive ways of expressing the standard deviation – all emphasising
different aspects of uncertainty, and they are in general not available from the ANOVA table.
These measures will be treated in later sections.
REPRODUCIBILITY
Reproducibility can be seen as a kind of extended repeatability and always comes with some
kind of qualification such as: reproducibility over panellists (within a given panel), or repro-
ducibility over panels being the most relevant. The GUM definition is:
NMKL PROCEDURE
No. 27 (2013)
Measurement uncertainty in sensory analysis
Page: 18 of 24
Version: 1
Date: April 2013
‘B.2.16 Closeness of the agreement between the results of measurements of the same
measurand carried out under changed conditions of measurement.’
GUM gives a list of possible changed conditions, of which only measuring instrument and
location (both represented by the panel in sensory analysis) and time are relevant to us. The
reproducibility can only be properly estimated through a proficiency testing scheme.
PRESENTATION OF MEASUREMENT UNCERTAINTY IN PRACTICE
To explore the standard deviation, repeatability and reproducibility further, we will look at
different ways by which standard deviations of data from a sensory experiment can be
computed.
Treating sensory replicates as if they were proper replicates in the sense that they are
evaluated within a short time range and on identical material, and in the framework of the
ANOVA model defined earlier, a specific panellist's repeatability is measured by the standard
deviation over replicates. The repeatability is connected to a specified attribute and specified
variety. Thus, for panellist i and variety j,
K
1k
2
.ijijk )XX(1K
1)j,i(S
is the standard deviation expressing the repeatability according to GUM’s basic definition.
Except for the notation that reflects the fact that we are considering the “standard sensory data
set” of Table 1, the formula above is identical to the definition in the chapter STANDARD
DEVIATION. In passing, we note that for K=2, i.e. 2 replicates, the formula simplifies to:
( ) | |
√
- the absolute value (the positive value) of the difference between the two replicates, divided
by √ .
For attribute A5 in Table 1, Table 2 shows the standard deviations for the different panellists
and products:
NMKL PROCEDURE
No. 27 (2013)
Measurement uncertainty in sensory analysis
Page: 19 of 24
Version: 1
Date: April 2013
Table 2: Standard deviations for each product and panellist separately
Panellist
Product
A B C
1 0,42 0,07 0,57
2 0,07 0,71 0,49
3 0,78 1,77 0,49
4 0,00 0,21 0,00
5 0,99 0,85 1,06
6 0,28 0,99 0,07
7 0,71 0,21 1,06
8 1,48 0,99 0,00
9 0,28 0,07 1,27
Similar tables could be computed for all attributes.
Although not touched upon in GUM, we may extend the repeatability to include the different
varieties, resulting in:
J
1j
K
1k
2
ij.ijk )X(X1)J(K
1S(i)
which is the average repeatability for panellist i, averaged over products. (Note that this
average is formed by averaging the variances S2(i,j) and then extracting the square root, not
averaging the standard deviations directly.) This is a summary measure indicating how well
an assessor can repeat him/herself, averaged over the products.
With only 2 replicates, Si) reduces to:
( )
√ √∑( )
For attribute A5, we get Table 3:
Table 3: Average standard deviations for each panellist: Panellist S(j)
1 0,41
2 0,50
3 1,15
4 0,12
5 0,97
6 0,60
7 0,75
8 1,03
9 0,75
NMKL PROCEDURE
No. 27 (2013)
Measurement uncertainty in sensory analysis
Page: 20 of 24
Version: 1
Date: April 2013
Performing an ANOVA for each panellist separately, we can obtain Table 3 by computing
I
)Error(SS)j(S
where SS(Error) – the error sum-of-squares, or residual sum of squares – is taken from the
ANOVA output available in all statistical software providing traditional statistical analyses.
Averaging the squares of the entries in Table 3 will give us the error term from an ANOVA
model incorporating varieties, assessors and replicates.
Normally the repeatability on the panellist level is not provided in sensory reports, but its
main interest lies in the fact that it can be used during training and as a quality control of the
panellists. This last application assumes that the assessors evaluate a narrow selection of
products, since comparing the assessors’ performance for an attribute over different food cate-
gories is hardly relevant.
Other intuitive ways of looking at the measurement uncertainty exist. One approach which is
relevant if the replicates are related to time (and should be treated as a specific factor in the
ANOVA model, not only as a part of the residual) is to average over assessors in each
replicate and compute the standard deviation over the replicates:
( ) √
∑( )
giving:
Variety S(j)
A 0,40
B 0,18
C 0,15
Averaging over varieties, we get:
√
( ) ∑∑( )
Or, in this particular example: S=0,27 ( √
)
This uncertainty measure is, however, not often seen in practical work.
NMKL PROCEDURE
No. 27 (2013)
Measurement uncertainty in sensory analysis
Page: 21 of 24
Version: 1
Date: April 2013
A slightly different approach is to use the ANOVA table to estimate some of the measurement
uncertainties above. Analysing attribute A5 in Table 1 we end up with the ANOVA table
(Table 4):
Source DF SS MS F p E(MS)
Variety 2 24.144 12.072 10.079 0.0015 2
V
2
AV
2
E 182
Panellist 8 53.378 6.672 11.456 <0.0005 2
A
2
E 6
Panellist Variety 16 19.163 1.198 2.056 0.048 2
AV
2
E 2
Error 27 15.725 0.582 2
E
where:
DF = Degrees of freedom
SS = Sum of squares
MS = mean Sum of Squares ( = SS/DF)
F = MS for any Source (Variety, Panellist or Panellist Variety) divided by the appropriate
error term (the error term being the Panellist Variety interaction for the Variety F-value,
Error for the Panellist and Panellist Variety F-values3)
p = The probability of obtaining an F-value at least as large as the one actually observed,
given that the hypothesis (‘null-hypothesis’) of no influence of the Source in question, is true.
E(MS) = Expected Mean Squares, the expected value expressed in terms of variance
components4.
This type of table is produced by any run-of-the-mill general statistical package, except for
the E(MS) column which is only featured in a few packages.
The standard deviations, and variances, for the different Sources, are estimated by starting
with the bottom line of the table and replacing all the ’s in the E(MS) column with their
estimates:
The variance representing the extent to which the panellists repeat themselves is then
estimated by MS(Error), or 0.582. Or, if we want to put it on a standard deviation form:
0.763. Note that this standard deviation, as well as all other conclusions derived from Table 4,
apply to all the assessors. To draw similar conclusions based on a single assessor the ANOVA
must be performed for each assessor separately.
In the line corresponding to Panellist Variety, we replace 2
E
by the value just found
(0.582) to obtain the estimate 0.308 for MS(VA), or on standard deviation form: 0.555. In
3 A certain indetermination exists: some software programs use the Variety Panellist interaction as the error
term also for the Panellist effect. We will not probe further into this material here but only refer to the
statistical literature. 4 V, A, AV, E represent the square roots of the variance of the Variety, Panellist, Variety Panellist and
Error, respectively. How to find the E(MS) values is outside the scope of this NMKL Procedure.
NMKL PROCEDURE
No. 27 (2013)
Measurement uncertainty in sensory analysis
Page: 22 of 24
Version: 1
Date: April 2013
similar fashion we get the estimate 1.015 for MS(A) (on standard deviation form: 1.007). For
MS(V) we get 0.604 (on standard deviation form: 0.777)
Still another measure of uncertainty is to consider the error term for the F-test in the ANOVA
output – this depends on the model selected, but equals the variety by panellist interaction in
the default ANOVA model used here. In more complicated models, incorporating other
random effects in addition to the assessor effect, and/or nested effects, this error term may
become quite complex and even impossible to compute. In such cases it can often be
approximated by a linear combination – the Satterthwaite approximation (Satterthwaite,
1946).
In addition to the variations due to the effects in the model, other sources must also be taken
into account. Sampling is one important source, both selection from a larger population of the
test samples to be used in the sensory laboratory, as well as the sub-sampling and preparations
taking place in the laboratory. The variations in the material are heavily depending on the
nature of the test samples. The ideal material for a sensory test (on foods) is a homogenous
liquid which can be divided into as close to identical subsamples as is possible before being
served. Looking at non-food materials, this homogeneity is probably easier to obtain than in
most food samples. At the other extreme we find materials where the panellists have to
evaluate different physical units (such as carrots). In such cases, uncertainty may be reduced
by taking several units, cutting them into smaller units, mixing them all together and taking
samples from this mixture to present to the panellists. In the carrot example, we could cut the
carrots into small cubes (approximately 1cm 1cm in size) and serve them in beakers of
appropriate size. That way, the carrots are in a manner of speaking being averaged in the
mouth of the panellists.
Another problem that is hard to safeguard against – and which usually follows Murphy’s Law
of appearing when it is at its most inconvenient – is physical disturbances such as noise, foul
odours, or a fire alarm going off by mistake. This list can be extended indefinitely.
NMKL PROCEDURE
No. 27 (2013)
Measurement uncertainty in sensory analysis
Page: 23 of 24
Version: 1
Date: April 2013
REFERENCES
EA-4/09, 2003: Accreditation for Sensory Testing Laboratories
ISO Guide to the Expression of Uncertainty in Measurement, 1993
ISO International Vocabulary of Basic and General Terms in Metrology, 2nd
Ed, 1993
ISO 5725-1:1994: Accuracy (Trueness and Precision) of Measurement Methods and Results,
Part 1: General Principles and Definitions
Joint Committee for Guides in Metrology (JCGM): Evaluation of measurement data – Guide
to the expression of uncertainty in measurement (GUM with minor corrections), 2008
NMKL Procedure No. 4, 2nd
Ed., 2005: Validation of chemical analytical methods
NMKL Procedure No. 5, 2nd
Ed., 2003: Estimation and expression of measurement
uncertainty in chemical analysis
NMKL Procedure No. 6, 1998: Generelle retningslinier for kvalitetssikring af sensoriske
laboratorier (Available in Danish only) [Eng.: General guidelines for quality assurance of
sensory laboratories ]
NMKL Procedure No. 8, 3rd
Ed., 2008: Measurement of uncertainty in quantitative
microbiological examination of foods
NMKL Procedure No.14, 2004: SENSVAL: Guidelines for internal control in sensory
analysis laboratories
NMKL Procedure No. 16, 2005: Sensory quality control
NMKL Procedure No. 20, 2007: Evaluation of results from qualitative methods
Lea P, Næs T, Rødbotten M (1997): Analysis of Variance for Sensory Data.
John Wiley & Sons. ISBN 0-471-96750-5
Satterthwaite F E (1946): An approximate distribution of estimates of varaiance components.
Biometrics Bulletin, 2, 110-114.
NMKL PROCEDURE
No. 27 (2013)
Measurement uncertainty in sensory analysis
Page: 24 of 24
Version: 1
Date: April 2013
ABBREVIATIONS ANOVA: Analysis of Variance
EA: European Accreditation
GUM: Guide to the Expression of Uncertainty in Measurement
ISO: International Organization for Standardization
JCGM: Joint Committee for Guides in Metrology
NCS: Natural Colour System
NMKL: Nordisk MetodikKomite for Levnedsmidler/Nordic Committee on Food Analysis
PCA: Principal Component Analysis
PLS: Partial Least Squares