+ All Categories
Home > Documents > A Thurstonian comparison of the Tetrad and Degree of Difference tests

A Thurstonian comparison of the Tetrad and Degree of Difference tests

Date post: 31-Dec-2016
Category:
Upload: rune
View: 214 times
Download: 0 times
Share this document with a friend
30
Accepted Manuscript A Thurstonian Comparison of the Tetrad and Degree of Difference Tests John M. Ennis, Rune Christensen PII: S0950-3293(14)00086-X DOI: http://dx.doi.org/10.1016/j.foodqual.2014.05.004 Reference: FQAP 2774 To appear in: Food Quality and Preference Received Date: 4 November 2013 Revised Date: 4 May 2014 Accepted Date: 5 May 2014 Please cite this article as: Ennis, J.M., Christensen, R., A Thurstonian Comparison of the Tetrad and Degree of Difference Tests, Food Quality and Preference (2014), doi: http://dx.doi.org/10.1016/j.foodqual.2014.05.004 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Transcript
  • Accepted ManuscriptA Thurstonian Comparison of the Tetrad and Degree of Difference TestsJohn M. Ennis, Rune ChristensenPII: S0950-3293(14)00086-XDOI: http://dx.doi.org/10.1016/j.foodqual.2014.05.004Reference: FQAP 2774To appear in: Food Quality and PreferenceReceived Date: 4 November 2013Revised Date: 4 May 2014Accepted Date: 5 May 2014

    Please cite this article as: Ennis, J.M., Christensen, R., A Thurstonian Comparison of the Tetrad and Degree ofDifference Tests, Food Quality and Preference (2014), doi: http://dx.doi.org/10.1016/j.foodqual.2014.05.004

    This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customerswe are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, andreview of the resulting proof before it is published in its final form. Please note that during the production processerrors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

  • Comparison of Tetrad and Degree of Difference

    1

    A Thurstonian Comparison of the Tetrad and Degree of Difference Tests 1

    John M. Ennis1* and Rune Christensen2 2

    1. The Institute for Perception 3 7629 Hull Street Road 4 Richmond, VA 23235 5 6

    2. Technical University of Denmark 7 Department of Applied Mathematics and Computer Science 8 Richard Petersens Plads 9 Building 324, Room 220 10 2800 Lyngby, Denmark 11

    * Corresponding Author 12

    Running Head: Comparison of Tetrad and Degree of Difference 13

    14

  • Comparison of Tetrad and Degree of Difference

    2

    Abstract 1

    The recurring need to assess product reformulations has kept difference testing at the 2 forefront of sensory science. Within the realm of difference testing, the Tetrad test has risen 3 in popularity recently as its superiority over the Triangle test has been demonstrated both in 4 theory and in practice. But it remains to compare the Tetrad test in detail with other 5 commonly used testing methods such as the Degree of Difference (DOD) test. In this paper, 6 we provide such a comparison by considering, from a theoretical perspective, 7 the differences between both power and precision for the Tetrad and DOD tests. In particular 8 we show that, theoretically and for the range of sensory effect sizes likely to be of interest in 9 consumer research, the Tetrad test is more powerful and more precise than the DOD test. 10 Even so, if there is substantially more perceptual noise in the Tetrad test from the two 11 additional stimuli, it is possible that performance of the DOD could surpass the performance 12 of the Tetrad test in practice. To investigate this last statement, we quantify the additional 13 noise required to negate the theoretical advantage of the Tetrad test. 14

    Keywords: Difference testing, Discrimination testing, Thurstonian, Tetrad, Degree of 15 Difference 16

    17

  • Comparison of Tetrad and Degree of Difference

    3

    Highlights 1

    The Tetrad and DOD tests are compared with respect to precision and power. 2 The Tetrad test is theoretically more powerful and more precise than the DOD test. 3 Perceptual noise could blunt the advantages of the Tetrad test in practice. 4

    5

    6

  • Comparison of Tetrad and Degree of Difference

    4

    I. Introduction 1

    Discrimination testing programs continue to play a key role connecting sensory to business 2 (Lawless & Heymann, 2010; Meilgaard, Civille, & Carr, 2007; Stone, Bleibaum, & Thomas, 3 2012). Among the standard discrimination testing methods, the Tetrad test has risen in 4 prominence recently as its superiority over the Triangle test has become well-established 5 both in theory (Bi & OMahony, 2013; Ennis & Christensen, 2013; Ennis & Jesionka, 2011) 6 and practice (Delwiche & OMahony, 1996; Garcia, Ennis, & Prinyawiwatkul, 2012; Ishii, 7 OMahony, & Rousseau, 2014; Masuoka, Hatjopoulos, & OMahony, 1995). See 8 (OMahony, 2013) for a recent review of the Tetrad test. 9

    Despite its apparent benefits, the Tetrad test has one drawback - it requires evaluation of four 10 samples. A potential challenge then for Tetrad testing is that the evaluation of additional 11 samples could lead to additional perceptual noise (Dessirier & OMahony, 1998; Lau, 12 OMahony, & Rousseau, 2004; Lee & OMahony, 2007; Rousseau & OMahony, 1997; 13 Rousseau, Rogeaux, & OMahony, 1999; Rousseau, Stroh, & OMahony, 2002; Stillman & 14 Irwin, 1995). For example, the Tetrad test loses its theoretical advantage over the Triangle 15 test once the evaluation of the additional stimulus leads to a 50% increase in perceptual noise 16 (Ennis, 2012) - this large margin for error is consistent with the experimentally confirmed 17 superiority of the Tetrad test over the Triangle test mentioned above. 18

    In addition to the Triangle test, other unspecified testing methods such as the Duo-Trio (Kim 19 & Lee, 2012) and Degree of Difference (DOD) test (Young et al., 2008) are in common use. 20 Theoretical comparisons of the Tetrad and Duo-Trio tests are complicated by the fact that 21 there are at least two decision rules that can be used in the Duo-Trio, and which decision rule 22 is used is influenced heavily by the experimental conditions (Hautus, Shepherd, & Peng, 23 2011; van Hout, Hautus, & Lee, 2011). See also (Ennis, 2013; Rousseau & Ennis, 2013) for 24 discussions of the impact of task instructions on decision rules and test performance. Thus, 25 in this paper we focus on the theoretical relationship between the Tetrad and DOD tests. To 26 this end we use Thurstonian analysis (Frijters, 1979; Thurstone, 1927) to compare the power 27 (Ennis, 1990, 1993; Ennis & Jesionka, 2011) and the precision (Ennis & Christensen, 2013) 28 of these two tests. 29

    The methodology developed in this paper has been implemented in the free R-package sensR 30 (Christensen and Brockhoff, 2014) and all the figures in this paper can be reproduced using 31 this package. 32

    II. A Thurstonian Comparison 33 34

    1. Tetrad Test Assumptions 35

    For the Tetrad test, we assume that four samples - two from one group and two from another 36 - are presented to respondents along with instructions to group the samples into two groups 37

  • Comparison of Tetrad and Degree of Difference

    5

    of two (Rousseau & Ennis, 2013). These instructions give rise to the decision rule modeled 1 in (Ennis, Ennis, Yip, & OMahony, 1998), and from this decision rule the power (Ennis & 2 Jesionka, 2011) and precision (Bi & OMahony, 2013; Ennis & Christensen, 2013; Ennis, 3 2012) of the Tetrad test can be computed for all sample sizes and all sensory effect sizes 4 (Christensen, 2011; Ennis, Rousseau, & Ennis, 2013). 5

    2. DOD Test Assumptions 6

    We use the standard Thurstonian model for the DOD, in which respondents are assumed to 7 use scale boundaries to rate the DOD between samples in pair. Specifically, the perceptual 8 difference between the samples in a pair is assumed to be normally distributed with standard 9 deviation 2. The mean of this distribution is 0 for control pairs and for test pairs. See 10 Figure 1 for an illustration and Appendix 1 for mathematical details. 11

    INSERT FIGURE 1 ABOUT HERE 12

    The model parameters ( and decision thresholds) can be estimated from data by 13 maximization of the likelihood function, and the variance covariance matrix of the model 14 parameters can be estimated as the inverse of the Hessian of the log-likelihood function. 15

    The performance (precision of d-prime and power of statistical hypothesis tests) of the DOD 16 protocol depends on five criteria: 17

    1. Size of 18 2. Number of response categories 19 3. Location of decision thresholds 20 4. Sample size 21 5. Ratio of same-pairs to different-pairs 22

    Our goal is to consider how precision and power of the DOD test depends on the above 23 characteristics, and to compare the DOD to the Tetrad test with respect to these measures. 24 To simplify our exploration, we begin by detailing an approach towards finding optimal 25 decision thresholds. When such optimal thresholds are used, we then show that precision and 26 power varies only little with the number of response categories and with the ratio of same-27 pairs to different-pairs (as long as this ratio is close to 1). This work will allow us to assume 28 four response categories, as experimentally suggested by Fu & Rousseau (2011), and an 29 equal number of same-pairs and different-pairs, as is commonly used in practice. Finally, in 30 what follows, we assume that the Wilcoxon rank sum test is used for hypothesis testing - this 31 test statistic has the advantage that no model fitting is necessary to perform the test and it is a 32 natural candidate for the simulations that we will consider. 33

    a. Optimal Criteria for Decision Thresholds 34

  • Comparison of Tetrad and Degree of Difference

    6

    In choosing decisions thresholds we considered thresholds according to three possible 1 optimality criteria: 2

    1. Minimize the standard error of d-prime 3 2. Maximize the likelihood ratio (LR) statistic 4 3. Provide an equal probability that observations falls in each of the response 5

    categories when averaged over same-pairs and different-pairs. 6

    Note that minimizing the standard error of d-prime is equivalent to maximizing the B-value 7 (cf. Bi, Ennis, & OMahony (1997)). 8

    To find optimal decision thresholds, we note that the threshold parameters completely 9 specify the limiting (multinomial) distribution of the data. All three optimality criteria may 10 thus be evaluated according to these limiting frequencies and subsequently optimized over 11 the decision thresholds to yield the optimal thresholds for a given value of . As a technical 12 point, we note that the LR criterion depends not only on the value of under the alternative, 13 but also on the value of under the null hypothesis. The other two criteria depend only on 14 the value of under the alternative. In fact the LR criterion is asymptotically equivalent to 15 the minimum standard error criterion as 0 approaches A since then the LR statistic reduces 16 to the incremental curvature of the profile likelihood function, i.e. the squared standard error 17 of d-prime. 18

    Figures 2a, and 2b illustrates the effects of decision thresholds on the standard error of d-19 prime and the LR test statistic, respectively, as a function of for n = 100 and when four 20 response categories are used1. Figure 2c shows the actual optimal thresholds used to produce 21 the results in Figures 2a and 2b. These figures show that the three methods lead to very 22 similar standard errors of d-prime and LR statistic even though the decision boundaries are 23 very different. In other words; the performance of the DOD protocol is remarkably constant 24 over a large space of values for the decision parameters. 25

    INSERT FIGURES 2A, 2B, AND 2C ABOUT HERE 26

    b. Effect of the Number of Response Categories 27

    To investigate the effect of the number of response categories, we computed the standard 28 error of d-prime and the size of the LR statistic while varying the number of response 29 categories from 2 to 18. We computed optimal decision thresholds according to each of the 30 three criteria mentioned above, and assumed = 1 and a sample size of 100 for both same-31 pairs and different-pairs. Figure 3a illustrates the effect of the number of response categories 32 on the standard error of d-prime while Figure 3b shows the effect of the number of response 33 categories on the LR statistic. 34

    1 We consider the possibility of using a different number of response categories momentarily.

  • Comparison of Tetrad and Degree of Difference

    7

    INSERT FIGURES 3A and 3B ABOUT HERE 1

    The difference between using 4 and 9 response categories corresponds roughly to a 2 difference in of 0.02. Thus there is no significant gain in power or precision of d-prime by 3 increasing the number of response categories beyond four categories. 4

    c. Effect of the Ratio of Same-Pairs to Different-Pairs 5

    To study the effect of the of ratio same-pairs to different-pairs, we computed the standard 6 error of d-prime and the size of the LR ratio statistic ( = 0 under the null hypothesis) while 7 varying the ratio from 0.1 to 0.9. For a total of 200 evaluation pairs and with = 1, Figures 8 4a and 4b show the effect of the ratio on the standard error of d-prime and the LR statistic 9 using the three criteria for choosing optimal decision thresholds. The standard error of d-10 prime is minimized by using a slightly smaller proportion of same pairs than different pairs. 11 Conversely, the LR statistic is maximized using a slightly higher proportion of different pairs 12 than same pairs. In both cases, however, the optimal ratio is very close to 1. Since a ratio of 13 1 corresponds to a balanced experimental design, which is independently desirable, we 14 assume throughout the remainder of this article that an equal number of same and different 15 pairs are presented. 16

    INSERT FIGURES 4A and 4B ABOUT HERE 17

    d. Summary of Assumptions 18

    In the next section, we proceed to compare the DOD test to the Tetrad test with respect to 19 power and precision of measurement. Based on the analyses presented in this section, we 20 henceforth assume that optimal decision boundaries for the DOD test four categories are used 21 and that an equal number of same and different pairs are presented. Thus, given a sensory 22 effect size and a sample size N, a realization of a DOD experiment may then be generated 23 by simulating from the relevant multinomial distributions. 24

    III. Comparison of Tetrad and DOD Tests 25

    We now compare the Tetrad and DOD tests with respect to precision of measurement and 26 power. 27

    1. Precision 28

    With the bounds selected for the DOD, it is now straightforward to compare the Tetrad and 29 DOD with respect to precision of measurement. As a measure of precision, we use the 30 expected width of the 95% profile likelihood confidence interval as illustrated in Figure 5. 31

    INSERT FIGURE 5 ABOUT HERE 32

  • Comparison of Tetrad and Degree of Difference

    8

    From this figure, we see that the Tetrad test is more precise than the DOD test for values 1 less than approximately 2.7, assuming there is no additional noise from the additional stimuli. 2

    2. Power 3

    To estimate the power of the DOD test, we ran a series of simulations. Specifically, we 4 simulated 100,000 experiments for 25 equally spaced values of between 0 and 3 inclusive. 5 For each simulation, we generated ratings for a control and a test pair using decision 6 thresholds that maximize the LR statistic, which provides an approximate highest power 7 scenario for the DOD test. A one-sided Wilcoxon rank sum test was applied to each 8 simulated data set. If the p-value from this test was less than 0.05, the experiment was 9 counted as significant - the power of the DOD test for the given value was thus estimated 10 as the proportion of significant experiments. Figure 6 shows the power of the DOD test 11 compared with the power of the Tetrad test when N = 30 (Ennis & Jesionka, 2011). 12

    INSERT FIGURE 6 ABOUT HERE 13

    a. Generalizing to Other Sample Sizes 14

    In Figure 6 we saw that the Tetrad test theoretically has higher power than the DOD test for a 15 sample size of N = 30. To investigate if this result generalizes to other sample sizes, we 16 computed the sample size need to achieve at least 80% power using the exact binomial test 17 for the Tetrad test as a function of . For these combinations of and sample size we then 18 computed the power of the Degree of Difference test. For comparison, we also computed the 19 power of the Triangle test for these settings. The results of this comparison are shown in 20 Figure 7. 21

    INSERT FIGURE 7 ABOUT HERE 22

    Figure 7 shows that at no value of will the power of the DOD test approach the power of 23 Tetrad test if the sample size is such that at 80% power is achieved in the Tetrad test. This 24 figure also shows that the power of the DOD test is remarkably similar to the power of the 25 Triangle test. Finally, note that this figure is as favorable as possible to the DOD test, as the 26 decision thresholds were chosen so as to maximize the LR statistic. 27

    b. Considering the Effect of Perceptual Noise 28

    As the last piece of our power comparison we note that, to this point, we have assumed no 29 additional noise from the two additional stimuli that the Tetrad test requires to elicit a 30 response. Since the Tetrad test involves four evaluations in a single trial, sensory fatigue and 31 memory requirements could lower the relative ability of the Tetrad test to detect differences 32 (Ennis, 2012). To assess the relationship between the Tetrad and DOD tests in practice, we 33 now consider how much additional noise there can be before the theoretical power advantage 34 of the Tetrad test demonstrated above is lost. 35

  • Comparison of Tetrad and Degree of Difference

    9

    1

    For N = 30 and values equally spaced between 0 and 3 in a DOD test, Figure 8 provides the 2 corresponding values in a Tetrad test that would give an equal level of power. Such a curve 3 is called an isopower curve since it is the set of pairs of values, , , that yield 4 equal power in the two tests. In particular, in every case, the Tetrad test requires a smaller 5 value than the DOD test to yield the same power. 6

    INSERT FIGURE 8 ABOUT HERE 7

    As in (Ennis, 2012), we can use isopower curves to estimate the maximal amount of 8 additional perceptual noise that the Tetrad test can withstand before losing its theoretical 9 power advantage. Since is a signal to noise ratio, we can allow for potentially different 10 noise values and for the two tests - these noise values will be related by the 11 equation (Ennis, 2012): 12

    = . (1) 13 Since Equation (1) represents the ratio of perceptual noise that provides equal power from the 14 two tests, a ratio greater than 1 on the left side of Equation (1) corresponds to a case in which 15 the Tetrad test can withstand some additional perceptual noise without losing its theoretical 16 power advantage. For example, if this ratio is 1.3, the Tetrad test can withstand a 30% 17 increase in perceptual noise before losing its power advantage over the DOD test in practice. 18

    Based on an analysis of Equation (1), Figure 9 shows the additional perceptual noise the 19 Tetrad test can tolerate before losing its power advantage over the DOD test, for N = 100 and 20 0.5 1.75. 21

    INSERT FIGURE 9 ABOUT HERE 22

    Figure 9 shows that as long as the additional perceptual noise in the Tetrad test does not 23 exceed 30%, the Tetrad test will continue to enjoy a power advantage over the DOD test for 24 the range of values shown. Since consumer relevant values typically fall in this range (cf. 25 Ishii, Kawaguchi, OMahony, & Rousseau, 2007), the Tetrad test has the potential to be more 26 powerful than the DOD test for business-relevant sensory differences. To determine whether 27 or not this last result holds in practice is an experimental question - see Ennis (2012) for 28 more detail on how to conduct comparative experiments involving the Tetrad test. 29

    IV. Conclusion 30

    In this paper we have addressed a longstanding need to compare the Tetrad and Degree of 31 Difference (DOD) tests from a theoretical perspective. The summary of this comparison is 32 that the Tetrad test is both more precise and more powerful than the DOD test, when we 33 assume there is no additional noise from the two additional stimuli in the Tetrad test and that 34

  • Comparison of Tetrad and Degree of Difference

    10

    the respondents use optimal scale boundaries in the DOD test. In addition, using Thurstonian 1 theory, we estimated that the Tetrad test will be more powerful than the DOD test for 2 difference testing within the range of business-relevant sensory differences, as long as the 3 additional noise associated with testing four samples does not exceed 30%. Whether or not 4 this latter condition is met is one that can only be determined directly by comparative testing 5 - the benefit of the present paper is to show that such comparative tests are worth conducting. 6

    Combining the results of this paper with previous research comparing the Tetrad test with the 7 Triangle and Two-Out-of-Five tests, we are closer to offering clear recommendations for 8 which difference test to use in which setting. Specifically, if the nature of the difference is 9 unknown and the only tests under consideration are the DOD, Tetrad, Triangle, and Two-10 Out-of-Five tests, one should use the Tetrad test as long as the samples are not fatiguing and 11 it possible to test four samples at once. Otherwise one should use the DOD test. Despite the 12 cleanliness of this recommendation, however, it is incomplete - there are other experimental 13 strategies such as warm-up (cf. Angulo, Lee, & OMahony, 2007; Mata-Garcia, Angulo, & 14 OMahony, 2007; M. OMahony, Thieme, & Goldstein, 1988) and the constant reference 15 Duo-Trio test (cf. Hautus et al., 2011; Kim & Lee, 2012; Lee, Van Hout, & Hautus, 2007; 16 van Hout et al., 2011) that increase the number of correct answers in difference tests and 17 hence yield higher operational power. Thus further research is needed before the power of 18 sensory difference testing is fully understood in practice. But for precision, it is not clear 19 how to estimate sensory effect sizes when the decision rules are not known and, in these last 20 examples, operational power can be achieved when subjects change their decision rules over 21 the course of repeated trials. Thus, for now, it seems that the Tetrad test is the preferred 22 sensory difference testing method for estimating the magnitude of unspecified sensory 23 differences. 24

    Acknowledgement 25

    The authors thank Daniel Ennis and Benoit Rousseau for helpful guidance on this topic. 26

    References 27

    Angulo, O., Lee, H.-S., & OMahony, M. (2007). Sensory difference tests: Overdispersion and 28

    warm-up. Food Quality and Preference, 18(2), 190195. 29

    Bi, J., Ennis, D. M., & OMahony, M. (1997). How to estimate and use the variance of d from 30

    difference tests. Journal of Sensory Studies, 12, 87104. 31

    Bi, Jian, & OMahony, M. (2013). Variance of d for the Tetrad test and comparisons with 32

    other forced-choice methods. Journal of Sensory Studies, 28(2), 91101. 33

    Brockhoff, P., & Christensen, R. (2010). Thurstonian models for sensory discrimination tests 34

    as generalized linear models. Food Quality and Preference, 21(3), 330338. 35

  • Comparison of Tetrad and Degree of Difference

    11

    Christensen, R. H. B. (2011). Statistical methodology for sensory discrimination tests and its 1

    implementation in sensR. Retrieved from http://cran.r-project.org 2

    Christensen, R. H. B., & Brockhoff, P. B. (2009). Estimation and inference in the same-3

    different test. Food Quality and Preference, 20(7), 514524. 4

    Delwiche, J., & OMahony, M. (1996). Flavour discrimination - An extension of Thurstonian 5

    paradoxes to the tetrad method. Food Quality and Preference, 7, 15. 6

    Dessirier, J., & OMahony, M. (1998). Comparison of d values for the 2-AFC (paired 7

    comparison) and 3-AFC discrimination methods: Thurstonian models, sequential 8

    sensitivity analysis and power. Food Quality and Preference, 10(1), 18. 9

    Ennis, D. M. (1990). Relative power of difference testing methods in sensory evaluation. 10

    Food technology, 44, 114, 116, 117. 11

    Ennis, D. M. (1993). The power of sensory discrimination methods. Journal of Sensory 12

    Studies, 8(4), 353370. 13

    Ennis, D. M., Rousseau, B., & Ennis, J. M. (2013). IFPrograms User Manual. Retrieved from 14

    www.ifpress.com 15

    Ennis, J. M., Rousseau, B., & Ennis, D. M. (2014). Sensory difference tests as measurement 16

    instruments: A review of recent advances. Journal of Sensory Studies, 29(2), 89-102. 17

    Ennis, J. M. (2012). Guiding the switch From Triangle testing to Tetrad testing. Journal of 18

    Sensory Studies, 27(4), 223231. 19

    Ennis, J. M. (2013). A Thurstonian analysis of the two-out-of-five test. Journal of Sensory 20

    Studies, 28(4), 297310. 21

    Ennis, J. M., & Christensen, R. (2014). Precision of measurement in Tetrad testing. Food 22

    Quality and Preference, 32(A), 98106. 23

    Ennis, J. M., Ennis, D. M., Yip, D., & OMahony, M. (1998). Thurstonian models for variants 24

    of the method of tetrads. British Journal of Mathematical and Statistical Psychology, 25

    51, 205215. 26

    Ennis, J. M., & Jesionka, V. (2011). The Power of Sensory Discrimination Methods Revisited. 27

    Journal of Sensory Studies, 26(5), 371382. 28

    Frijters, J. (1979). The paradox of discriminatory nondiscriminators resolved. Chemical 29

    senses, 4, 355358. 30

  • Comparison of Tetrad and Degree of Difference

    12

    Fu, Y., & Rousseau, B. (2011). Effect of the number of rating categories in the Degree of 1

    Difference methodology. In 2011 Pangborn Sensory Science Symposium. 2

    Garcia, K., Ennis, J. M., & Prinyawiwatkul, W. (2012). A large-scale experimental comparison 3

    of the Tetrad and Triangle tests in children. Journal of Sensory Studies, 27(4), 217222. 4

    Hautus, M., Shepherd, D., & Peng, M. (2011). Decision strategies for the two-alternative 5

    forced choice reminder paradigm. Attention, Perception, & Psychophysics, 73(3), 7296

    737. 7

    Ishii, R., Kawaguchi, H., OMahony, M., & Rousseau, B. (2007). Relating consumer and 8

    trained panels discriminative sensitivities using vanilla flavored ice cream as a 9

    medium. Food Quality and Preference, 18(1), 8996. 10

    Ishii, R., OMahony, M., & Rousseau, B. (2014). Triangle and tetrad protocols: Small sensory 11

    differences, resampling and consumer relevance. Food Quality and Preference, 31, 4912

    55. 13

    Kendall, S., Stuart, A., & Ord, J. (1987). Kendalls advanced theory of statistics, Vol. 1 (2nd 14

    ed.). New York: Oxford Univ Press. 15

    Kim, M., & Lee, H. (2012). Investigation of operationally more powerful duo-trio test 16

    protocols: Effects of different reference schemes. Food Quality and Preference, 25(2), 17

    183191. 18

    Lau, S., OMahony, M., & Rousseau, B. (2004). Are three-sample tasks less sensitive than 19

    two-sample tasks? Memory effects in the testing of taste discrimination. Perception & 20

    Psychophysics, 66(3), 464474. 21

    Lawless, H., & Heymann, H. (2010). Sensory evaluation of food: principles and practices. 22

    New York, NY: Springer. 23

    Lee, H., & OMahony, M. (2007). Difference test sensitivity: cognitive contrast effects. 24

    Journal of Sensory Studies, 22(1), 1733. 25

    Lee, H., Van Hout, D., & Hautus, M. (2007). Comparison of performance in the A-Not A, 2-26

    AFC, and same-different tests for the flavor discrimination of margarines: The effect of 27

    cognitive decision strategies. Food quality and preference, 18(6), 920928. 28

    Masuoka, S., Hatjopoulos, D., & OMahony, M. (1995). Beer bitterness detection: Testing 29

    Thurstonian and Sequential Sensitivity Analysis models for triad and tetrad methods. 30

    Journal of Sensory Studies, 10(3), 295306. 31

  • Comparison of Tetrad and Degree of Difference

    13

    Mata-Garcia, M., Angulo, O., & OMahony, M. (2007). On warm-up. Journal of sensory 1

    studies, 22(2), 187193. 2

    Meilgaard, M., Civille, G., & Carr, B. (2007). Sensory evaluation techniques. Boca Raton, 3

    Florida: Taylor & Francis. 4

    OMahony, M., Thieme, U., & Goldstein, L. (1988). The warm-up effect as a means of 5

    increasing the discriminability of sensory difference tests. Journal of Food Science, 53, 6

    18481850. 7

    OMahony, Michael. (2013). The Tetrad test: Looking forward, looking back. Journal of 8

    Sensory Studies, 28(4), 259263. 9

    Pawitan, Y. (2001). In all likelihood statistical modelling and inference using likelihood. 10

    Oxford, UK: Oxford University Press. 11

    Peltier, C., Brockhoff, P. B., Visalli, M., & Schlich, P. (2014). The MAM-CAP table: A new tool 12

    for monitoring panel performances. Food Quality and Preference, 32, Part A(0), 2427. 13

    Rousseau, B., & Ennis, J. M. (2013). Importance of Correct Instructions in the Tetrad test. 14

    Journal of Sensory Studies, 28(4), 264269. 15

    Rousseau, B., & OMahony, M. (1997). Sensory difference tests: Thurstonian and SSA 16

    predictions for vanilla flavored yogurts. Journal of Sensory Studies, 12(2), 127146. 17

    Rousseau, B., Rogeaux, M., & OMahony, M. (1999). Mustard discrimination by same-18

    different and triangle tests: aspects of irritation, memory and t criteria. Food Quality 19

    and Preference, 10(3), 173184. 20

    Rousseau, B., Stroh, S., & OMahony, M. (2002). Investigating more powerful discrimination 21

    tests with consumers: effects of memory and response bias. Food Quality and 22

    Preference, 13(1), 3945. 23

    Stillman, J., & Irwin, R. (1995). Advantages of the same-different method over the triangular 24

    method for the measurement of taste discrimination. Journal of Sensory Studies, 10(3), 25

    261272. 26

    Stone, H., Bleibaum, R., & Thomas, H. A. (2012). Sensory Evaluation Practices (4th ed.). New 27

    York: Academic Press. 28

    Thurstone, L. (1927). A law of comparative judgment. Psychological Review, 34(4), 273286. 29

    Van Hout, D., Hautus, M. J., & Lee, H.-S. (2011). Investigation of test performance over 30

    repeated sessions using signal detection theory: comparison of three nonattribute-31

  • Comparison of Tetrad and Degree of Difference

    14

    specified difference tests 2-AFCR, A-NOT A and 2-AFC. Journal of Sensory Studies, 26(5), 1

    311321. 2

    Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 3

    8083. 4

    Young, T., Pecore, S., Stoer, N., Hulting, F., Holschuh, N., & Case, F. (2008). Incorporating 5

    test and control product variability in degree of difference tests. Food Quality and 6

    Preference, 19(8), 734736. 7

    8

    9

  • Comparison of Tetrad and Degree of Difference

    15

    List of Figure Captions 1

    Figure 1. Thurstonian model for the DOD test. The distributions are of perceptual 2 differences within the control or test pairs. 3

    Figure 2a. Standard error of d' as a function of when optimal decision thresholds are used, 4 according to three different criteria. 5

    Figure 2b. The square root of the LR statistic as a function of when optimal decision 6 thresholds are used, according to three different criteria. 7

    Figure 2c. Optimal decision thresholds, according to three different criteria, as a function of 8 when four response categories are used. 9

    Figure 3a. Standard error of d' as a function of number of categories when optimal decision 10 thresholds are used, according to three different criteria, when =1 and N = 100. 11

    Figure 3b. The square root of the LR statistic as a function of number of categories when 12 optimal decision thresholds are used, according to three different criteria, when =1 and N = 13 100. 14

    Figure 4a. Standard error of d' as a function of the proportion of same-pairs when optimal 15 decision thresholds are used, according to three different criteria, when =1 and 200 pairs 16 are presented in total. 17

    Figure 4b. The square root of the LR statistic as a function of the proportion of same-pairs 18 when optimal decision thresholds are used, according to three different criteria, when =1 19 and 200 pairs are presented in total. 20

    Figure 5. Expected width of 95% likelihood confidence intervals for Tetrad and DOD tests, 21 where N = 100 and optimal decision thresholds for the DOD test are used. 22

    Figure 6. Power comparison of Tetrad and DOD tests, where N = 30 and optimal decision 23 thresholds for the DOD test are used. 24

    Figure 7. Power comparison of Tetrad, DOD, and Triangle tests, when the sample size is 25 chosen as a function of to give at least 80% power in the Tetrad test. 26

    Figure 8. Isopower curve relating values in a Degree of Difference (DOD) test () to 27 the corresponding values in a Tetrad test () that yield equal power for N = 30. 28 Figure 9. Percent of additional perceptual noise in the Tetrad test that would results in equal 29 power with the DOD test when N = 100. 30

    31

  • Comparison of Tetrad and Degree of Difference

    16

    Appendix - The Thurstonian Model for the DOD test 1

    For completeness, we include the Thurstonian model for the DOD test. This model is not 2 original to this paper, and has been implemented in software since 1993 (Ennis et al., 2013). 3 In this model, we assume two products with normal distributions: 4

    0,1 , 1 The differences between same-pairs and different-pairs are then distributed as 5

    , 0,2 , 2 Let j = 1, , J, J 2 index the response categories from "same" (j = 1) to "different" (j = J) 6 and $ for j = 1, , J 1 denote the decision thresholds. 7 For a same-pair we may write the cumulative probability that a rating, Y falls in or below the 8 j'th response category as 9

    &'( = )* , , = 1 2./0'/22 where denotes the standard normal cumulative distribution function. Similarly the 10 cumulative probability that a rating of a different-pair falls in or below the j'th response 11 category can be written as 12

    &'5 = )* , , = . 60' 2 7 .60' 2 7

    The probability that a rating falls in each of the categories are then 13

    8'( = &'( &'9( 8'5 = &'5 &'95 where we define ;< = ;= = 0 and >< = >= = 1. 14 The log-likelihood functions may then be written as: 15

    , 0; A =BCA'(log8'( + A'5log8'5IJ

    'K

    where x$< = IOPK yP< = j and x$= = IOPK yP= = j are the observed frequencies for same-16 pairs and different-pair. 17

    18

  • 3 2 1 1 2 3

    ControlTest

    "4" "3" "2" "1" "2" "3" "4"

  • 0.5 1.0 1.5 2.0 2.5 3.0

    0.25

    0.30

    0.35

    0.40

    0.45

    0.50

    Stan

    dard

    erro

    r of d

    'Equi probmin semax LR

    (a)

  • 0.5 1.0 1.5 2.0 2.5 3.0

    0

    2

    4

    6

    8

    10

    LREqui probmin semax LR

    (b)

  • 0.5 1.0 1.5 2.0 2.5 3.0

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5

    Dec

    isio

    n th

    resh

    olds

    (4 ca

    tegori

    es)

    Equi probmin semax LR

    (c)

  • 2 6 10 14 18

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    Number of response categories

    Stan

    dard

    erro

    r of d

    'Equi probmin semax LR

    (a)

  • 2 6 10 14 18

    0

    2

    4

    6

    8

    10LR

    Number of response categories

    Equi probmin semax LR

    (b)

  • 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

    0.25

    0.30

    0.35

    0.40

    0.45

    0.50

    0.55

    Proportion of samepairs

    Stan

    dard

    erro

    r of d

    'Equi probmin semax LR

    (a)

  • 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

    1.0

    1.2

    1.4

    1.6

    1.8

    2.0

    Proportion of samepairs

    LREqui probmin semax LR

    (b)

  • 0.5 1.0 1.5 2.0 2.5 3.0

    0.6

    0.7

    0.8

    0.9

    1.0

    1.1

    1.2

    1.3

    Expe

    cted

    wid

    th o

    f 95%

    likelih

    ood

    CI (N

    = 10

    0)Degree of DifferenceTetrad

  • 0.0 0.5 1.0 1.5 2.0 2.5 3.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Sensory Effect Size ()

    Pow

    er

    Degree of DifferenceTetrad

  • 0.5 1.0 1.5 2.0 2.5 3.0

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Pow

    er

    727 210 87 43 27 16 13 9 7 5 5 5 3Sample size

    TetradDegree of DifferenceTriangle

  • 0.0 0.5 1.0 1.5 2.0 2.5 3.0

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    Degree of Difference Sensory Effect Size DOD

    Tetra

    d Se

    nsor

    y Ef

    fect

    Size

    TET

    0.20.20.20.3

    0.40.5

    0.50.6

    0.70.8

    0.91.0

    1.11.1

    1.21.3

    1.41.5

    1.61.7

    1.71.8

    1.92.0

    2.1

  • DOD

    Pct.

    mor

    e Te

    trad

    noise

    DOD more powerful

    Tetrad more powerful

    0.5 0.75 1 1.25 1.5 1.75

    0

    10

    20

    30

    40

    50

    60

    0.15 0.25 0.5 0.75 0.9 0.99Power of DOD difference test (N=100)


Recommended