+ All Categories
Home > Documents > Experimental results of fingerprint comparison validity and reliability: A review and critical...

Experimental results of fingerprint comparison validity and reliability: A review and critical...

Date post: 28-Dec-2016
Category:
Upload: lyn
View: 212 times
Download: 0 times
Share this document with a friend
15
Professional commentary Experimental results of ngerprint comparison validity and reliability: A review and critical analysis Ralph Norman Haber , Lyn Haber Human Factors Consultants, 313 Ridge View Drive, Swall Meadows, CA 93514, USA abstract article info Article history: Received 14 January 2013 Received in revised form 8 June 2013 Accepted 16 August 2013 Available online xxxx Keywords: Fingerprints AnalysisComparisonEvaluation (ACE) method Accuracy Reliability Experimental results Error rates Our purpose in this article is to determine whether the results of the published experiments on the accuracy and reliability of ngerprint comparison can be generalized to ngerprint laboratory casework, and/or to document the error rate of the AnalysisComparisonEvaluation (ACE) method. We review the existing 13 published experiments on ngerprint comparison accuracy and reliability. These studies comprise the entire corpus of experimental research published on the accuracy of ngerprint comparisons since criminal courts rst admitted forensic ngerprint evidence about 120 years ago. We start with the two studies by Ulery, Hicklin, Buscaglia and Roberts (2011, 2012), because they are recent, large, designed specically to provide estimates of the accuracy and reliability of ngerprint comparisons, and to respond to the criticisms cited in the National Academy of Sciences Report (2009). Following the two Ulery et al. studies, we review and evaluate the other eleven experiments, considering problems that are unique to each. We then evaluate the 13 experiments for the problems common to all or most of them, especially with respect to the generalizability of their results to laboratory casework. Overall, we conclude that the experimental designs employed deviated from casework procedures in critical ways that preclude generalization of the results to casework. The experiments asked examiner-subjects to carry out their comparisons using different responses from those employed in casework; the experiments pre- sented the comparisons in formats that differed from casework; the experiments enlisted highly trained exam- iners as experimental subjects rather than subjects drawn randomly from among all ngerprint examiners; the experiments did not use ngerprint test items known to be comparable in type and especially in difculty to those encountered in casework; and the experiments did not require examiners to use the ACE method, nor was that method dened, controlled, or tested in these experiments. Until there is signicant progress in dening and measuring the difculty of ngerprint test materials, and until the steps to be followed in the ACE method are dened and measurable, we conclude that new experiments patterned on these existing experiments cannot inform the ngerprint profession or the courts about casework accuracy and errors. © 2013 Forensic Science Society. Published by Elsevier Ireland Ltd. All rights reserved. Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 1.1. The entire corpus of experiments measuring ngerprint comparison accuracy and reliably . . . . . . . . . . . . . . . . . . . . . . . . 0 2. Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 2.1. Accuracy of conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 2.2. Correct discrimination between same-source and different source pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 2.3. Appropriate and inappropriate conclusion rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 2.4. Reliability of conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 2.5. Reliability of examiners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 2.6. Reliability as consistency within the examiner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 3. The Ulery et al. [1,2] experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 3.1. The Ulery et al. experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 3.2. Accuracy of conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 3.2.1. Correct identication conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 Science and Justice xxx (2014) xxxxxx Corresponding author. Tel.: +1 760 387 2458; fax: +1 760 387 2459. E-mail addresses: [email protected] (R.N. Haber), [email protected] (L. Haber). SCIJUS-00397; No of Pages 15 1355-0306/$ see front matter © 2013 Forensic Science Society. Published by Elsevier Ireland Ltd. All rights reserved. http://dx.doi.org/10.1016/j.scijus.2013.08.007 Contents lists available at ScienceDirect Science and Justice journal homepage: www.elsevier.com/locate/scijus Please cite this article as: R.N. Haber, L. Haber, Experimental results of ngerprint comparison validity and reliability: A review and critical analysis, Sci. Justice (2014), http://dx.doi.org/10.1016/j.scijus.2013.08.007
Transcript
Page 1: Experimental results of fingerprint comparison validity and reliability: A review and critical analysis

Science and Justice xxx (2014) xxx–xxx

SCIJUS-00397; No of Pages 15

Contents lists available at ScienceDirect

Science and Justice

j ourna l homepage: www.e lsev ie r .com/ locate /sc i jus

Professional commentary

Experimental results of fingerprint comparison validity and reliability:A review and critical analysis

Ralph Norman Haber ⁎, Lyn HaberHuman Factors Consultants, 313 Ridge View Drive, Swall Meadows, CA 93514, USA

⁎ Corresponding author. Tel.: +1 760 387 2458; fax: +E-mail addresses: [email protected]

1355-0306/$ – see front matter © 2013 Forensic Science Shttp://dx.doi.org/10.1016/j.scijus.2013.08.007

Please cite this article as: R.N. Haber, L. Haanalysis, Sci. Justice (2014), http://dx.doi.org

a b s t r a c t

a r t i c l e i n f o

Article history:Received 14 January 2013Received in revised form 8 June 2013Accepted 16 August 2013Available online xxxx

Keywords:FingerprintsAnalysis–Comparison–Evaluation (ACE)methodAccuracyReliabilityExperimental resultsError rates

Our purpose in this article is to determinewhether the results of the published experiments on the accuracy andreliability of fingerprint comparison can be generalized to fingerprint laboratory casework, and/or to documentthe error rate of the Analysis–Comparison–Evaluation (ACE) method. We review the existing 13 publishedexperiments on fingerprint comparison accuracy and reliability. These studies comprise the entire corpus ofexperimental research published on the accuracy of fingerprint comparisons since criminal courts first admittedforensic fingerprint evidence about 120years ago.We start with the two studies by Ulery, Hicklin, Buscaglia andRoberts (2011, 2012), because they are recent, large, designed specifically to provide estimates of the accuracyand reliability of fingerprint comparisons, and to respond to the criticisms cited in the National Academy ofSciences Report (2009).Following the two Ulery et al. studies, we review and evaluate the other eleven experiments, considering problemsthat are unique to each. We then evaluate the 13 experiments for the problems common to all or most of them,especially with respect to the generalizability of their results to laboratory casework.Overall, we conclude that the experimental designs employed deviated from casework procedures in criticalways that preclude generalization of the results to casework. The experiments asked examiner-subjects tocarry out their comparisons using different responses from those employed in casework; the experiments pre-sented the comparisons in formats that differed from casework; the experiments enlisted highly trained exam-iners as experimental subjects rather than subjects drawn randomly from among all fingerprint examiners;the experiments did not use fingerprint test items known to be comparable in type and especially in difficultyto those encountered in casework; and the experiments did not require examiners to use the ACE method, norwas that method defined, controlled, or tested in these experiments.Until there is significant progress in defining andmeasuring the difficulty offingerprint testmaterials, and until thesteps to be followed in the ACEmethod are defined andmeasurable, we conclude that new experiments patternedon these existing experiments cannot inform the fingerprint profession or the courts about casework accuracy anderrors.

© 2013 Forensic Science Society. Published by Elsevier Ireland Ltd. All rights reserved.

Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 01.1. The entire corpus of experiments measuring fingerprint comparison accuracy and reliably . . . . . . . . . . . . . . . . . . . . . . . . 0

2. Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 02.1. Accuracy of conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 02.2. Correct discrimination between same-source and different source pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 02.3. Appropriate and inappropriate conclusion rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 02.4. Reliability of conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 02.5. Reliability of examiners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 02.6. Reliability as consistency within the examiner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0

3. The Ulery et al. [1,2] experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 03.1. The Ulery et al. experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 03.2. Accuracy of conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0

3.2.1. Correct identification conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0

1 760 387 2459.(R.N. Haber), [email protected] (L. Haber).

ociety. Published by Elsevier Ireland Ltd. All rights reserved.

ber, Experimental results of fingerprint comparison validity and reliability: A review and critical/10.1016/j.scijus.2013.08.007

Page 2: Experimental results of fingerprint comparison validity and reliability: A review and critical analysis

2 R.N. Haber, L. Haber / Science and Justice xxx (2014) xxx–xxx

3.2.2. Erroneous identification conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 03.2.3. Correct exclusion conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 03.2.4. Correct discrimination between same-source and different source pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 03.2.5. No value (inappropriate) conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 03.2.6. Inconclusive (inappropriate) conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 03.2.7. Appropriate conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0

3.3. Reliability of conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 03.3.1. Consensus on value conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 03.3.2. Consensus on identification conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 03.3.3. Consensus on exclusion conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0

3.4. Reliability of examiners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 03.5. The Ulery et al. experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 03.6. Reliability: Percent of repeated pairs receiving the same conclusions from the same examiner . . . . . . . . . . . . . . . . . . . . . . . 0

4. Design problems in Ulery et al. [1,2] experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 04.1. The “value-only-for-exclusion” conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 04.2. Random assignment of pairs to subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 04.3. Repetition of latent and exemplar prints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 04.4. Imbalance in results of same-source and different-source pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 04.5. Duration between retests in Ulery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 04.6. Sampling of the examiners retested in Ulery et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 04.7. Other more general problems that also affected the Ulery et al. [1,2] experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 04.8. General conclusions regarding the Ulery et al. [1,2] experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0

5. Other experiments that did not manipulate biasing information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 05.1. Langenburg, Champod and Genessay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0

5.1.1. Accuracy of conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 05.1.2. Correct discrimination between same- and different-source pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 05.1.3. Reliability: consensus of conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 05.1.4. Reliability: consensus among examiners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 05.1.5. Specific problems with Langenburg et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0

5.2. Evett and Williams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 05.2.1. Accuracy of conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 05.2.2. Correct discrimination between same-source and different-source pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 05.2.3. Reliability among conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 05.2.4. Reliability among examiners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 05.2.5. Specific problems with Evett and Williams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0

5.3. Langenburg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 05.3.1. Accuracy of conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 05.3.2. Correct discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 05.3.3. Consensus of conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 05.3.4. Consensus among examiners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 05.3.5. Specific problems in Langenburg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0

5.4. Meagher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 05.4.1. Accuracy of conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 05.4.2. Reliability of conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 05.4.3. Specific problems with Meagher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0

5.5. Summary of results of these five experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 05.5.1. Correct conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 05.5.2. Erroneous conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0

6. The final seven experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 06.1. Wertheim, Langenburg and Moenssens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 06.2. Gutowski . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0

6.2.1. Accuracy of conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 06.2.2. Specific problems with Gutowski . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0

6.3. Tangen et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 06.3.1. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 06.3.2. Specific problems with Tangen et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0

6.4. Summary of conclusions about the final three experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 07. Common problems of generalizing these accuracy and error rates results to casework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0

7.1. Extreme variability of results across experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 07.2. Lack of statistical tests of significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 07.3. Measurement of the difficulty of fingerprints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 07.4. Ceiling effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 07.5. Distribution of conclusions in experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 07.6. Non-random sampling of examiners in the experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 07.7. Non-adherence to casework procedures: pairing single fingerprints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 07.8. Idealized working conditions in the experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 07.9. Knowledge of being tested . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 07.10. Absence of AFIS-produced exemplars in the experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 07.11. Contrasting biases in experiments and in casework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 07.12. Summary of generalization limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0

8. The use of proficiency test results to estimate error rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 09. Assessment of the ACE method by these 13 experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0

Please cite this article as: R.N. Haber, L. Haber, Experimental results of fingerprint comparison validity and reliability: A review and criticalanalysis, Sci. Justice (2014), http://dx.doi.org/10.1016/j.scijus.2013.08.007

Page 3: Experimental results of fingerprint comparison validity and reliability: A review and critical analysis

3R.N. Haber, L. Haber / Science and Justice xxx (2014) xxx–xxx

10. Societal implications of these experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 011. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0

1. Introduction

Fingerprint comparison evidence has been used in court for 100years. Only in the past two decades has evidence been sought that thisevidence is accurate and valid. We examine the research resultsconcerning the accuracy and reliability of fingerprint comparisonconclusions.

1.1. The entire corpus of experiments measuring fingerprintcomparison accuracy and reliably

Table 1 lists all of the published studies in the past 120years, throughmid-2013,whichmeet the following criteria: the subjects are practicingfingerprint examiners; the examiners are shown a number of finger-print cases, each composed of an unknown latent print to be comparedto one ormore exemplar prints; and they are asked to concludewhetherthe latent and one of the exemplar prints came from the same source orfrom two different sources. All 13 studies listed in Table 1 meet thesecriteria. We could find no others. They are listed in the order discussedin this review.

As a check, we compared the contents of Table 1 to the 77-item an-notated reference list by the ScientificWorking Group on Friction RidgeAnalysis, Study and Technology (SWGFAST) [14] containing experi-ments and treatises related to fingerprints. The SWGFAST list includedfour of the experiments listed in Table 1 [1,3,5,10], and listed no addi-tional experiments meeting the selection criteria.

Ground truth (whether the latent and exemplar fingerprints wereknown to come from the same or from two different sources) is knownfor 9 of the 13 experiments in Table 1. In the remaining four [4,6–8], con-currence among expert examiners was used as a measure of groundtruth. Knowledge of ground truth permits every definitive conclusionto be scored for accuracy in determining whether the pairs of printscame from the same source or from two different sources. Because theexperimental designs differed from experiment to experiment, as didscoring procedures and which data were reported, it is not possible sim-ply to compare each result from each experiment. All of the experimentsprovided at least one measure of the accuracy of conclusions, most pro-vided a measure of the reliability of conclusions, and a few a measureof the reliability of examiners.

Table 1The13 articles published from1890 to 2013 that provideobjective evidence of the accuracyand/or reliability of fingerprint examination conclusions. The articles are listed in the orderin which they are discussed in this article.

[1] Hicklin, Buscaglia and Roberts [1][2] Ulery, Hicklin, Buscaglia and Roberts [2][3] Langenburg, Champod and Gennessay [3][4] Evett and Williams [4][5] Langenburg [5][6] Meagher [6][7] Dror, Charlton and Peron [7][8] Dror and Charlton [8][9] Hall and Player [9][10] Langenburg, Champod and P. Wertheim [10][11] Wertheim, Langenburg and Moenssens [11][12] Gutowski [12][13] Tangen, Thompson and McCarthy [13]

Please cite this article as: R.N. Haber, L. Haber, Experimental results ofanalysis, Sci. Justice (2014), http://dx.doi.org/10.1016/j.scijus.2013.08.007

2. Measures

The following scoringmeasureswere extracted from the experiments,either by being reported by the authors, or computable from the dataprovided.

2.1. Accuracy of conclusions

The percentage of times the conclusions of examiners matchedground truth was used as the measure of the accuracy of conclusions.With respect to same-source pairs, correct identifications and erroneousexclusions can be tabulated.With respect to different source pairs, correctexclusions and erroneous identifications can be tabulated.

2.2. Correct discrimination between same-source and different source pairs

We computed for each experiment the percentage of correct conclu-sions (combining correct identifications and correct exclusions). Weused this measure as a more general index of accuracy. None of the ex-periments reported this percentage in their respective reports.

2.3. Appropriate and inappropriate conclusion rates

As partially defined in SWGFAST [16], we refer to the correct defini-tive of exclusion and identification as “appropriate,” because they reflecta conclusion that matches the ground truth knowledge of the truesource of each pair. Conclusions of no-value and inconclusive can be de-scribed as “inappropriate,” because they fail to match ground truth.None of the experiments in the corpus reported their results in thisway. We describe this classification because it provides another mea-sure of the accuracy of the conclusions reached by the examiners inthese experiments.

2.4. Reliability of conclusions

Reliability was most frequently measured as the percentage of testitems that received the same conclusion from all examiners.

2.5. Reliability of examiners

Anothermeasure of reliability, less often reported,was the percentageof examiners who agreed with one another on the responses they gave:the reliability of examiners.

2.6. Reliability as consistency within the examiner

A few studies used a test–retest design, in which the same examiner,without his awareness, repeated comparisons of the same latent–exemplar pairs at a later time. Consistencywithin the examiner reflectsthe percentage of times the examiner reached the same conclusion.

3. The Ulery et al. [1,2] experiments

3.1. The Ulery et al. [1] experiment

Ulery et al. [1] is the only accuracy experiment cited in a FederalCourt decision to admit fingerprint comparison evidence (USA v. Love

fingerprint comparison validity and reliability: A review and critical

Page 4: Experimental results of fingerprint comparison validity and reliability: A review and critical analysis

4 R.N. Haber, L. Haber / Science and Justice xxx (2014) xxx–xxx

[15]), which grants it greater authority than the other experiments inTable 1.

The authors tested 169 highly trained, highly experienced latentprint examiners. Nearly allwere certified as exceptionally skilled andpro-ficient, either by the International Association for Identification (IAI), theFBI, or the laboratory in which they worked. Each examiner was individ-ually presented about 100 latent–exemplar pairs, for a total of 17,121trials. Overall, for 70% of the pairs, the latent and exemplar prints werefrom the same source, and the remaining 30% were from differentsources. For each examiner, the 100 pairs were selected randomly froma pool of 744 pairs created by the experimenters, so that no two exam-iners compared the same 100 pairs. Each examiner was sent a disk withhis own set of 100 trials. The examiners carried out the experiment ontheir own computers, and returned their responses to the authors withtheir conclusions when finished.

For each trial, a latent printwasfirst presented on the screenwithoutan exemplar, and the examiner was asked to decide whether the latentcontained sufficient quality and quantity of information for identifica-tion, just for exclusion, or for neither (e.g., no value). If the latent printwas judged of no value, that conclusion was recorded, the paired exem-plar never appeared, that trial ended for that latent print, and a new latentprint appeared. If the latent was judged to be of value for either identifi-cation or exclusion, its paired exemplar then appeared along side, andthe examiner, after carrying out an examination, had to concludewhetherthe pair was an identification, an exclusion, or inconclusive. When theexaminer's conclusion was entered, a new latent appeared.

3.2. Accuracy of conclusions

3.2.1. Correct identification conclusionsSetting aside latent prints judged of no value (so that no pairing

occurred), for the same-source pairs for which the correct responsewas identification, 45% were correctly identified; the remaining 55%were missed identifications. These missed identifications included 13%that were erroneously excluded and 42% that were inconclusive.

3.2.2. Erroneous identification conclusionsWhen examiners did conclude identification, they were correct

99.9% of the time. Only six identification conclusions were made tothe different-source pairs, each made by a different examiner. Thisgives an erroneous identification rate of 0.1%.

3.2.3. Correct exclusion conclusionsSetting aside latent prints judged of no value, for thedifferent-source

pairs, 79% were correctly excluded, while the remaining 21% weremissed exclusions. The missed exclusions were all inconclusive exceptfor the six erroneous identifications. When examiners concluded exclu-sion on the different-source pairs, theywere correct 87%of the time. Theerroneous exclusion rate (for same-source pairs) was 13%.

3.2.4. Correct discrimination between same-source and differentsource pairs

Combining the 3707 correct identifications and the 3949 correct ex-clusions, examiners correctly discriminated between same-source anddifferent-source for 58% of the pairs, and failed to designate the correctsource for the remaining 42% of the pairs.

3.2.5. No value (inappropriate) conclusionsThe results showed that overall 23% of the latent prints (3947 out of

17,121 presentations) were judged of no value andwere not compared.If the randomly chosen same and different source pairings had been ofequal difficulty, the percent of those latent prints rejected as being ofno value would be equivalent for the same- and different-source pairs.However, the results showed that the latent prints that were to havebeen used in the same-source pairs received seven times as many no-

Please cite this article as: R.N. Haber, L. Haber, Experimental results ofanalysis, Sci. Justice (2014), http://dx.doi.org/10.1016/j.scijus.2013.08.007

value conclusions (3389, or 86%) as did the latent prints that were tobe used in the different-source pairs (558, or 14%).

3.2.6. Inconclusive (inappropriate) conclusionsThe examiners judged 37% of the pairs as inconclusive (4907 out of

the 13,174 pairs that were compared). These inconclusive conclusionswere not distributed evenly between the same- and different-sourcepairs. The same-source pairs received four times as many inconclusiveconclusions (3875, or 80%) as did the different-source pairs (1032, or20%).

Combining the no-value and inconclusive conclusions (a totalof 8854), 52% of the total trials (17,121) received inappropriateconclusions.

3.2.7. Appropriate conclusionsThe remaining 8267 trials (or 48% of the 17,121 trials) received de-

finitive conclusions (identification or exclusion). However, 617 ofthese were erroneous (611 erroneous exclusions and 6 erroneous iden-tifications).When these 617 conclusions are removed, then 7650 or 45%of the total trials received appropriate and correct conclusions. This re-sult is not reported in Ulery et al. [1].

The appropriate and correct conclusionswere not distributed evenlybetween the same- and different-source pairs. The same-source pairsreceived only 32% appropriate and correct conclusions (3703 out of11,578). The different-source pairs received 71% appropriate and cor-rect conclusions (3647 out of 5,543), more than twice as many as didthe same-source pairs.

3.3. Reliability of conclusions

Since each examiner received a different selection of pairs, none ofthe pairswere presented 169 times,whichwould have allowed the con-clusions to be analyzed for consensus. The average number of exam-iners that viewed each pair was 39 examiners (23%). Hence, thenumbers that follow are based on much lower frequencies than 169.

3.3.1. Consensus on value conclusionsOnly 43% of the value conclusions for the latent prints were unani-

mous: 15% of the latent prints were unanimously concluded to be ofno value for comparison, and 28% were unanimously concluded to beof value. On the remaining 57% of the pairs, the examiners differedfrom one another on their value conclusion.

3.3.2. Consensus on identification conclusionsOnly 15% of the same-source pairs were identified by all of the

examiners. Of the remaining 85%, 46% of the pairswere unanimously in-appropriate or erroneous (exclusion) conclusions. The remaining 39% ofthe same-source pairs were inconsistently judged: the examiners dif-fered among themselves.

3.3.3. Consensus on exclusion conclusionsSeventy-five percent of the different-source pairs were correctly ex-

cluded by all examiners. Of the remainder, 5%were unanimously judgedinconclusive (an inappropriate conclusion); and 20% received differingconclusions across the examiners.

3.4. Reliability of examiners

These data were not reported, but we used the authors' Fig. 7 forestimates. The average identification rate for the same-source pairswas 45%. The best subject identified 65% of the same-source pairs pre-sented, and the poorest identified only 20%.

On the different-source pairs, the average correct exclusion rate was79%. The best examiner excluded virtually every different-source pair,and the poorest excluded only 40% of the different-source pairs.

fingerprint comparison validity and reliability: A review and critical

Page 5: Experimental results of fingerprint comparison validity and reliability: A review and critical analysis

5R.N. Haber, L. Haber / Science and Justice xxx (2014) xxx–xxx

3.5. The Ulery et al. [2] experiment

The same authors reported the results of a second presentation ofsome of the same pairs to a subset of the same examiners who hadbeen used in Ulery et al. [1]. This was strictly a reliability-of-conclusionsstudy: how many times would the same conclusion be given when atrial was repeated a second time at some later date.

The elapsed time between the original testing and the repetition test-ing averaged seven months. Seventy-two of the original 169 examinerswere retested: they received 25 re-presentations of pairs they had com-pared previously, 16 from the same-source pairs and 9 from different-source pairs. The particular 25 pairs differed for each of the 72 examiners.If an examiner had originally made an erroneous identification and/or anerroneous exclusion, those were re-presented. The rest of the 25 pairswere selected randomly from the remaining original 100 pairs shownto each examiner. The authors were not able to measure the reliabilityof examiners' consensus among one another, because each examinerwas given different pairs in the repeatability testing.

3.6. Reliability: Percent of repeated pairs receiving the same conclusionsfrom the same examiner

Overall, about 90% of the repeated test items received the sameresponse on their second presentation. On the 16 same-source pairs,89% of the original identification conclusions were repeated, and 11%were changed, most to inconclusive, and a few to no-value. On thedifferent-source pairs, 90% of the exclusions were repeated, and 10%were changed, mostly to inconclusive and a few to no-value. None ofthe few erroneous identificationswere repeated, and no further errone-ous identifications were made.

The Ulery et al. [2] result is that 10% of the conclusions were incon-sistent within the same examiners. While Ulery et al. do not commenton this magnitude, if this result were applied to casework, it would sug-gest that one out of ten conclusions of a fingerprint examination wouldbe different if the comparison was repeated at a later date by the sameexaminer.

4. Design problems in Ulery et al. [1,2] experiments

We highlight six design or analysis issues that pertain exclusively tothe twoUlery et al. experiments. Each of these reduces the usefulness oftheir conclusions and their generality. We consider, in Section 7 below,design and analysis issues thatwe found in nearly all of the experimentsin Table 1.

4.1. The “value-only-for-exclusion” conclusion

Three issues are of concern about this conclusion.First, the examinerswere asked on a post-experiment questionnaire

if they used the conclusion “of-value-only-for-exclusion” in their nor-mal casework. Only 17% said yes. The remaining 83% of the examinersmay have interpreted this unfamiliar conclusion in a variety of differentways, making its application inconsistent across examiners. As a conse-quence, the results pertaining to this and the other value assessments ofthe latent prints cannot be interpreted.

Second,while the experiment allowed examiners tomake three levelsof value conclusions, the experimental design confounded two of them.When examiners judged a latent to be of-value-only-for-exclusion, theywere still allowed to compare it and then offer conclusions inconsis-tent with the “only” in the conclusion: they were allowed to con-clude identification or inconclusive. The value judgment of “ofvalue only for exclusion” was then re-recorded as “of value.” Particu-larly the subsequent identification judgments seem contrary to the def-inition of “value-only-for-exclusion.”

Third, the of-value-only-for-exclusion conclusion wasmade to 3122(18%) of the 17,121 latent prints. After the exemplar appeared, only 500

Please cite this article as: R.N. Haber, L. Haber, Experimental results ofanalysis, Sci. Justice (2014), http://dx.doi.org/10.1016/j.scijus.2013.08.007

of those 3122 exemplar prints (16%) were actually excluded as thesource. Nearly all of the rest (2622) were judged inconclusive (84%),and 1% were identified. All trials labeled as of-value-only-for-exclusionthat were not excluded were combined with value-for-comparisonwhen scored. The better and more useful scoring and experimental de-sign for an of-value-only-for-exclusion judgment is that if the examinerreaches any conclusion other than exclusion after seeing the exemplar,the value conclusion is re-scored as no-value and a new trial begins. Thiswas not done on the 2622 trials in which the examiner failed to excludethe exemplar after having concluded that the latent was useful only forexclusion. These 2622 trials were not tagged in the results, so they couldnot be eliminated from further analyses. Their inclusion confounds thefurther results reported.

Based on these concerns, the “of-value-only-for-exclusion” conclusionis confoundedwith the “value-for identification” conclusion,making bothvalue judgments difficult to interpret.

4.2. Random assignment of pairs to subjects

The design called for each subject to be shown about 100 pairs offingerprints. The best design would have been to present the same100 pairs to each subject. Thiswould have permitted standard statisticaltests on themanipulated variables among the 100 pairs. The authors didnot do this.

Instead, each subject's 100 pairswere drawn randomly from the cor-pus of latent and exemplar prints comprising the 744 pairs that wereconstructed by the authors. As a result, the content of the trials differedfrom subject to subject. For example, while 30% of the 744 pairs werecreated as different-source pairs, the 169 subjects received from 26%to 53% different-source pairs among their 100 trials, a two-to-one vari-ation due to the random sampling from the pool. A comparable rangeoccurred for the same-source pairs. The number of instances of thesame-source versus different-source pairs varied from subject to sub-ject, and the particular pairs differed from subject to subject. This varia-tion cannot be captured in data analyses, and inflates the error variancein their experiment.

This problem reoccurred in Ulery et al. [2] for the re-presentation of aselected set of pairs. Each examiner received a different set of pairs. Thisprevented distinguishing between inconsistent judgments resultingfrom the difficulty of the latent–exemplar pairings and inconsistentjudgments resulting from inconsistency among different individualexaminers. Since the purpose of the Ulery et al. [2] experiment was tomeasure reliability, the confounding of test item reliability and experi-ment reliability limits the usefulness of the results. The authors do notdiscuss this limitation.

The authors did comment more than once that because the contentof trials varied from subject to subject, many important analyses couldnot be done. Most seriously, this flaw precluded most statistical signifi-cance tests of the variables in their experiment. This design issue wouldhave been avoided if the same pairs had been presented to all of theexaminers.

4.3. Repetition of latent and exemplar prints

While the authors do not describe howmany of the same exemplarprints were presentedmore than once to an examiner, repetition had tohave occurred. For example, the authors created the 744 pairs from atotal of 356 latent prints and 484 exemplar prints. To make 744 pairs,each latent print had to have been used at least twice (and could havebeen usedmore often, depending on the (unreported) efforts of the au-thors to limit repetition). Similarly, each exemplar print had to havebeen used almost twice, and it could have been more.

This repetition of someof the sameprints creates anuncontrolled var-iable of familiarity. Familiarity is one of the most potent variables, whichincreases accuracy in every perceptual task (Haber andHershenson [17]).An unknown number of the prints used in this study potentially

fingerprint comparison validity and reliability: A review and critical

Page 6: Experimental results of fingerprint comparison validity and reliability: A review and critical analysis

6 R.N. Haber, L. Haber / Science and Justice xxx (2014) xxx–xxx

benefitted from multiple presentations. A more appropriate estimate ofan error rate would have either avoided all repetitions, or the latentand exemplar prints that were repeated should have been removedfrom the data set or analyzed separately.

4.4. Imbalance in results of same-source and different-source pairs

Examiners are heavily penalized by their profession for erroneousidentifications, but not for erroneous exclusions. Examiners are explicitlypraised for caution inmaking identifications, and are trained to concludeidentification onlywhen they are absolutely certain. As a result, more in-conclusive responses and fewer correct definitive responses would beexpected for same-source pairs. This expected result was found.

The examiners correctly excluded 79% of the different-source pairs,but correctly identified less than half (45%) of the same-source pairs.

The examiners made more accurate appropriate conclusions for thedifferent-source pairs (71%) as compared to the same-source pairs (24%).

The examiners judged only one-fourth as many of the different-source pairs inconclusive (20%) as compared to inconclusive conclusionsfor the same-source pairs (80%).

The examiners made fewer erroneous identifications among thedifferent-source pairs (0.1%) than erroneous exclusions among thesame-source pairs (13%).

There is a second interpretation of these four findings: the different-source pairs were easier to compare than the same-source pairs, sopoorer performance would be expected on the more difficult same-source pairs. There is one finding that can only be interpreted as adifference in difficulty. The examiners made only one-seventh asmany no-value conclusions among the latent prints intended for thedifferent-source pairs (14%) compared to the latent prints intendedfor the same-source pairs (86%). Since the no-value conclusions weremade before the latent was paired with an exemplar, the latent printsin the pool for different-source pairsmust have been significantly easier.Because only the different-source pairs can be used to estimate the erro-neous identification rate, if the different-source pairs were easier thanthe same-source pairs in the Ulery et al. [1] experiment, the low errone-ous identification rate of 0.1% would have been higher had the difficultybeen equivalent.

The authors stated that their selection procedures were specificallydesigned to make the different-source pairs more difficult to comparethan the same-source pairs. However, these results suggest that thismanipulation failed, and the different-source pairs were actually easier.

The difficulty issue is also present in the Ulery et al. [2] experiment.The authors reported that repeatability scores in Ulery et al. [2] werelower for the pairs judged more difficult by the examiners in the 2011experiment. However, the authors do not report the repeatability scoresseparately for the easy versus difficult pairs (data they had available)nor between same- and different-source pairs (which the authorscould have used in these analyses). Since the authors had some datafrom the initial Ulery et al. [1] experiment that they used to estimate dif-ficulty, these results could have been used to refine the scoring of theUlery et al. [2] experiment.

4.5. Duration between retests in Ulery [2]

The authors reported that the test–retest interval averaged onlyseven months: for some examiners it was shorter. The examinerswere not asked if they were aware of the repetition (e.g., were any ofthe pairs familiar?), and whether they remembered their earliercomparison and conclusion. Some recognition seems likely, sincethe design of the Ulery et al. [2] experiment and its procedureswere identical to the earlier one. To avoid the possibility that mem-ory could have helped preserve the same conclusions, and elevatedthe repeatability scores, the time interval should have been treatedas an independent variable.

Please cite this article as: R.N. Haber, L. Haber, Experimental results ofanalysis, Sci. Justice (2014), http://dx.doi.org/10.1016/j.scijus.2013.08.007

4.6. Sampling of the examiners retested in Ulery et al. [2]

The authors omitted to report how the 72 re-tested examiners wereselected from among the original 169 examiners. If the Ulery et al. [2]examiners were disproportionately sampled from the better examinersin the Ulery et al. [1] experiment (e.g., those from the FBI laboratories),the test–retest accuracy results might be higher than they would havebeen otherwise.

4.7. Other more general problems that also affected theUlery et al. [1,2] experiments

After we consider the remaining experiments, we note the commonproblems that flaw most of them (see Section 7 below). Nearly all ofthese also apply to the two Ulery et al. [1,2] experiments. These includenon-random sampling of the examiners to serve as subjects, lack of sig-nificance tests of differences, idealized working conditions, and usingsingle prints to compare rather than each latent to ten-print cards.

4.8. General conclusions regarding the Ulery et al. [1,2] experiments

The uncontrolled difference in difficulty between the same- anddifferent-source pairs suggests that the low erroneous identificationrate found was due to easy different-source pairings, and would havebeen higher had comparable pairings been used. Similarly, the non-random sampling of the examiners who served as subjects suggeststhat the erroneous identification rate would have been higher amongaverage examiners. Further, in spite of the authors' intentions to re-spond to the National Academy of Sciences [18] critiques, these resultsdo not provide evidence of the validity or reliability of the ACE method,since that method explicitly was not assessed in these studies. Finally,the results show high levels of unreliability: examiners did not alwaysagree with their own previous conclusions in Ulery et al. [2], and theyoften disagreed with each other in Ulery et al. [1]. The reliability resultssuggest that the outcome of a particular comparison depends more onwhich examiner is assigned to the case than on the physical character-istics of the stimulus print to be compared.

5. Other experiments that did not manipulate biasing information

We next consider four additional studies listed in Table 1. These ex-periments range greatly in the number of examiner-subjects, the num-ber of comparisons made, the ratio of same- to different-source pairs,and the conclusions allowed. Even so, they provide some data on accu-racy and error rates.

5.1. Langenburg, Champod and Genessay [3]

The authors asked 176 skilled examiners to compare the same 12pairs of fingerprints (a total of 2112 trials), of which seven weresame-source pairs and five different-source pairs. The no-value conclu-sionwas not allowed. (Note: the purpose of the experimentwas to eval-uate several aids or techniques that might improve accuracy orreliability. To do this, the authors divided the 176 examiners into sixgroups: one control in which no aids were used, and five additionalgroups in which one or more aids were used. Since only one of theaids produced a significant improvement, and the absolute magnitudeof that effect was small, we report the data from all 176 examinerscombined.)

5.1.1. Accuracy of conclusionsOf the seven same-source pairs, 63% were correctly identified. For

the remaining 37%, 31% were inconclusive and 6% erroneous exclusions(the erroneous exclusion rate). Of the five different-source pairs, 80%were correctly excluded. For the remaining 20%, 17% were inconclusiveand 3% were erroneous identifications.

fingerprint comparison validity and reliability: A review and critical

Page 7: Experimental results of fingerprint comparison validity and reliability: A review and critical analysis

7R.N. Haber, L. Haber / Science and Justice xxx (2014) xxx–xxx

5.1.2. Correct discrimination between same- and different-source pairsCombining the accuracy of identifying the same source pairs (63%),

and the accuracy of excluding the different-source pairs (80%), the over-all correct discrimination rate was 70%.

5.1.3. Reliability: consensus of conclusionsNone of the 12 pairs received the same conclusion from all of the ex-

aminers. Four of the pairs received two different conclusions (eithercorrect or inconclusive), and eight of the 12 pairs received all threeallowed conclusions (correct, erroneous and inconclusive).

5.1.4. Reliability: consensus among examinersThe authors did not report the reliability of examiners in their agree-

ment with one another, but we extracted this information from theirappendix. Using just the control group of 24 examiners, for the same-source pairs, agreement was low: no examiner identified all seven,half identified six or five of the seven, and the remaining half identifiedfour or fewer of the seven. Agreement was somewhat higher on thedifferent-source pairs: half of the examiners excluded all five pairs,with the other half excluding fewer. Another analysis showed thatonly two examiners shared the same pattern of conclusions across the12 pairs, with the remaining 22 examiners having patterns that didnot match any other examiner.

5.1.5. Specific problems with Langenburg et al. [3]While this experiment reported the second-highest correct conclu-

sion rate among the experiments (70%), it showed the highest errone-ous identification rate (3%) among the 13 experiments. The authorscomment that they expected the high erroneous identification ratebecause they worked hard to make the different-source pairs moredifficult. However, the experiment produced no evidence that thedifferent-source pairsweremore difficult: in fact, they received a higherpercentage of correct conclusions (80%) than did the same-source pairs(63%). So the relatively high erroneous identification rate cannot be at-tributed to more difficult different-source pairs as compared to same-source pairs.

The authors created different-source pairs that had been rankedby AFIS as highly similar, and assumed this was a valid procedure toselect difficult comparisons. The data suggest this procedure failed.Only one other experiment in the corpus selected pairs from anAFIS output (Tangen at al. [13]), so that the results cannot becombined or compared with the other experimental results in thecorpus.

5.2. Evett and Williams [4]

Evett andWilliams published thefirst experiment that reported dataon the accuracy and reliability offingerprint comparisons. Their primarypurpose was to determine the usefulness of the standard in the UnitedKingdom in use at that time to conclude identification, which required16 or more points of agreement between the unknown latent printand the known exemplar print.

The subject-examiners were 130 of the most highly skilled exam-iners in the United Kingdom and Wales. Each was asked to comparethe same ten latent–exemplar pairs. Nine of the ten pairs were fromthe same source, and one was from different sources. All of the latentprints were considered to be of value. (Ground truth was not knownabsolutely: the nine same-source pairs were taken from real Scot-land Yard cases, and the authors used a consensus of skilled exam-iners to assign their source. Similarly, a consensus agreed that allpairs were of value.) At the time of this study, examiners in theUnited Kingdom used four conclusions, which approximately, butnot exactly, corresponded to the four conclusions recommended bySWGFAST [16]: “insufficient detail for an opinion” (no value), “notidentical” (exclusion), “probable identification” defined as a likelymatch but one that lacked 16 points of agreement, much stronger

Please cite this article as: R.N. Haber, L. Haber, Experimental results ofanalysis, Sci. Justice (2014), http://dx.doi.org/10.1016/j.scijus.2013.08.007

than SWGFAST (inconclusive), and “full identification” containing16 or more points of agreement (identification). The examinerswere asked to report their conclusion for each pair, and for thosepairs judged identification or probable identification, they wereasked to indicate the number of features they found in agreementbetween the latent and exemplar in that pair.

5.2.1. Accuracy of conclusionsFifty-nine percent of the same-source pairs were correctly identi-

fied and 41% were missed identifications. On the single different-source pair, 50% of the examiners correctly excluded the exemplar,and the remaining 50%made a missed exclusion. No erroneous iden-tifications were made to the single different-source pair, but 8% ofthe examiners excluded one of the same-source prints (erroneousexclusions).

5.2.2. Correct discrimination between same-source anddifferent-source pairs

The combined accuracy of both correct identifications of the ninesame-source pairs (59%) and correct exclusions of the single different-source pair (50%) was 57%.

5.2.3. Reliability among conclusionsThe authors reported the range of conclusions for only four of the

nine same-source pairs. For every one, more than one conclusion wasoffered. One pair drew two different conclusions (identification andprobable identification) of the possible four; the remaining three pairseach drew three of the four possible conclusions. One pair was judgednot identical (exclusion) by 8% of the examiners, but probable identifi-cation by 54% of the others. On the single different-source pair, onlyabout half of the examiners' conclusions were exclusion, with the re-maining half concluding insufficient information (no value). Thesedata show that for each of the latent–exemplar pairs, these examinersproduced multiple conclusions.

A second measure of the reliability of conclusions assesses theamount of consensus for each pair. For one of the four pairs forwhich full identification was the correct conclusion, 97% of the ex-aminers concluded identification. The other pairs had much lowerconsensus of conclusions, ranging to less than half for any singleconclusion.

In this experiment, a third measure of reliability was available:examiner concurrence on the number of points in agreement whenthe conclusion was full or probable identification. For every pair, thenumber of matching points reported by the examiners varied dramati-cally. For the most extreme pair, the number of points in agreementranged from a low of 13 points for one examiner to a high of 54 pointsfor another examiner.

5.2.4. Reliability among examinersThe authors reported that on the seven same-source pairs which the

experts selecting the pairs had ranked as full identifications, only 28% ofthe examiners agreed among themselves that all seven were identifica-tions; another 23% agreed that six of the sevenwere identifications. Theremaining 50% of the examiners failed to identify two or more of thesame-source pairs. One examiner identified only one of the sevensame-source pairs. Hence, only one of the full-identification pairsreceived some agreement, and that pertained to only 28% of theexaminers.

5.2.5. Specific problems with Evett and Williams [4]This experiment is often quoted as an estimate of high accuracy

and low error rates (e.g., Langenburg [5]), but its results do not sup-port that assessment. For example, it has only one different-sourcepair, so these examiners had little chance to make an erroneousidentification. Barely half of the conclusions were correct, and lowreliability among examiners and among pairs was the common

fingerprint comparison validity and reliability: A review and critical

Page 8: Experimental results of fingerprint comparison validity and reliability: A review and critical analysis

Table 2The findings from five experiments [1,3–6], listed in the order described. The resultsare in percentages: correct identifications, correct exclusions, correct conclusions, missedidentifications, missed exclusions, erroneous identifications and erroneous conclusions.Meagher [6] did not provide complete data.

Exp Correct Correct Correct Missed Missed Erron. Erron.

# Ident. Excl. Concl Ident. Excl. Ident. Exclus

1. 45 79 62 55 21 0.1 132. 63 80 70 37 2 3 63. 91 21 78 9 79 1 14. 59 50 57 41 50 0 85. 77 23

The experiments are numbered as follows:1. Ulery et al. [1].2. Langenburg et al. [3].3. Langenburg [5].4. Evett and Williams [4].5. Meagher [6].

8 R.N. Haber, L. Haber / Science and Justice xxx (2014) xxx–xxx

result. The authors report detailed analyses for only four of the tenpairs, and offer no information on the representativeness of thosefour.

5.3. Langenburg [5]

Six experienced examiners from the same laboratory as the experi-menter compared 49 same-source pairs (294 trials) and 11 different-source pairs (66 trials) and were asked for their conclusions. The no-value conclusion was not allowed.

5.3.1. Accuracy of conclusionsFor the 294 same-source pairs, 91% were identified, 8% were incon-

clusive and 1% erroneous exclusions. For the 66 different-source pairs,only 21% were excluded, 78% were inconclusive, and 1% identified(the erroneous identification rate).

5.3.2. Correct discriminationThe correct discrimination between same- and different-source pairs

was 78%. This is high even though only 21% of the different-source pairswere correctly excluded. The latter result would have had more impacton the correct discrimination rate if the experiment had used moredifferent-source pairs.

5.3.3. Consensus of conclusionsAll examiners identified 38 of the 49 same-source pairs (78%), and

11 pairs (22%) received two or more conclusions. All the examiners ex-cluded 7 of the 11 different-source pairs (64%), and 4 pairs received twoor more conclusions (36%).

5.3.4. Consensus among examinersOn 43 of the 60 trials, examinerswere unanimous in giving the same

conclusions (72%), and on the remaining 17 trials (28%), the examinerswere not consistent.

5.3.5. Specific problems in Langenburg [5]The examiners were selected from a single laboratory without

information about the relative training and experience in that labo-ratory compared to a wider sample of laboratories. The different-source pairs produced a dramatically different pattern of resultsfrom those found in every other experiment, including others bythe same experimenter (Langenburg et al. [10] and Langenburget al. [3]). For these reasons, the results cannot be combined with re-sults from other experiments, nor can the results be interpreted. It isimpossible to determine whether the deviant results are due to theparticular different-source test pairs presented, to the particular ex-aminers, or to some other variable.

5.4. Meagher [6]

This was not a report of a peer-reviewed experiment, but the resultsof a survey by the FBI presented in a Daubert hearing held prior to thetrial of USA v. Brian Mitchell [19]. Ground truth was not known. TheFBI sent Mr. Mitchell's ten-print card and the two latent prints liftedfrom the get-away car (attributed by the FBI to Mr. Mitchell) to 50fingerprint crime laboratories in the U.S., asking the best examinerin each laboratory to compare the two latent prints to Mr. Mitchell'sexemplars and provide their conclusions.

5.4.1. Accuracy of conclusionsThirty-nine of the laboratories responded, and 30 reported they

identified the latent prints to Mr. Mitchell's exemplar (the FBI'sconclusion). However, nine (23%) of the laboratories concludedthey could not identify Mr. Mitchell from these latent prints. Sincethe true source of these pairs is unknown, accuracy of conclusionsis either 77% or 23%.

Please cite this article as: R.N. Haber, L. Haber, Experimental results ofanalysis, Sci. Justice (2014), http://dx.doi.org/10.1016/j.scijus.2013.08.007

5.4.2. Reliability of conclusionsAquarter of the examiners (23%) reached a different conclusion from

the other 77%.

5.4.3. Specific problems with Meagher [6]At the trial, Mr. Meager also described the letter he sent with the re-

quest. In it, he stated the request as a check on the accuracy of the FBIexaminers. The examiners were told the FBI's identification conclusion.This could have acted as a strong bias to agree.

5.5. Summary of results of these five experiments

Both the accuracy and the reliability results among these five experi-ments are highly variable between experiments, and in some cases, evenwithin an experiment. These results are shown in Table 2. (We excludedUlery et al. [2] because they did not report accuracy results.)

5.5.1. Correct conclusionsAcross the five experiments, correct identification of the same-

source pairs ranged from very high accuracy of 91% (Langenburg [5]),to a low of 45% (Ulery et al. [1]). Correct exclusion in the different-source pairs ranged from 79% (Ulery et al. [1]) to 21% (Langenburg,[5]). Discrimination between same and different source pairs rangedfrom a high of 78% (Langenburg [5]) to a low of 57% (Evett andWilliams[4]).

5.5.2. Erroneous conclusionsThe erroneous identification rate ranged from a low of 0.1% (Ulery

et al. [11]) to a high of 3% (Langenburg et al. [13]). The erroneous exclu-sion rate ranged from a low of 1% (Langenburg [5]) to a high of 13%(Ulery et al. [1]). The missed identification rate ranged from a low of9% (Langenburg [5]) to a high of 55% (Ulery et al. [1]).

Table 2 shows that the pattern of results across experiments was ex-tremely inconsistent. The most extreme was between Langenburg [5]and Ulery et al. [1]; Langenburg [5] reported the highest accuracy forsame-source pairs (91%), and the lowest accuracy for different-sourcepairs (21%). In contrast, Ulery et al. [1] reported the opposite pattern:the lowest identification accuracy for same-source pairs (45%), butrelatively high exclusion accuracy for the different-source pairs (79%).

6. The final seven experiments

Table 1 contains four experiments (7, 8, 9, and 10) that were de-signed to show the effects of bias on accuracy and reliability: wouldan examiner change a previous conclusion if given new informationsuggesting that a different conclusion was the correct one? Becauseeach of these experiments asked examiners to offer conclusions follow-ing examinations, thesemet our criteria for selection of experiments for

fingerprint comparison validity and reliability: A review and critical

Page 9: Experimental results of fingerprint comparison validity and reliability: A review and critical analysis

9R.N. Haber, L. Haber / Science and Justice xxx (2014) xxx–xxx

inclusion in Table 1. However, their bias manipulations created unusualconditions with respect to measurement of accuracy and reliability,such that these experiments are not comparable to the ones alreadyconsidered. Therefore, we do not consider them further in the contextof this article.

The remaining three studies also met our criteria for inclusion inTable 1. However, their design flaws prevent them from being com-pared to the initial set of five we described. Their design flaws eachhighlight different problems in designing accuracy and error rate ex-periments. Each of these three studies also reported low erroneousidentification rates, and their results have been used as evidence of ac-curacy (e.g., Langenburg [5], Langenburg et al. [10]). As we show, suchconclusions are inappropriate and misleading.

6.1. Wertheim, Langenburg and Moenssens [11]

These authors conducted six training sessions of a week's durationincluding about 100 fingerprint examiners of varying fingerprint expe-rience and training. The authors combined the six sessions into a singledata set. Because the fingerprint comparisons required during the train-ing courses were designed to improve comparison skill, and not to esti-mate accuracy and reliability of examinations, the results cannot be usedfor estimates of accuracy or reliability.

For example: examiners knew that every latent print had its sourceamong the exemplars presented; the difficulty level of the requiredcomparisons was adjusted to each examiner's training and skill; exam-iners could ask for help, which narrowed the search among the exem-plars to a single “suspect”; examiners could decide which latents tocompare, and could leave the remainder blank; the missed identifica-tions (blanks left on the answer sheet) were not counted, so frequencyof missed identifications could not be scored; trainees were allowed toskip cases that they found very difficult, thereby eliminating cases thatwould be more likely to produce erroneous identifications; examinerswere not permitted to conclude novalue or inconclusive; and no scoringof reliabilitywas provided. This training study provides no basis for gen-eralizing accuracy results beyond this particular training environment.A more detailed critique of this study is published (Haber and Haber[21]), along with responses from the authors.

We have retained this study in the corpus because these results arefrequently and inappropriately quoted by fingerprint examiners incourt or in print as showing a high accuracy and low error rate.

6.2. Gutowski [12]

Gutowski reported data from the published results of six years ofworld-wide latent fingerprint comparison proficiency tests offeredthrough the Collaborative Testing Services (CTS) [22]. The results werefrom 2000 to 2005, with a total of 30,642 test-takers. The CTS testsincluded 87% same-source pairs and 13% different-source pairs, em-bedded in from10 to 12 cases in each test. The examinerswere restrictedby CTS instructions to only two conclusions: “identification” (by indicat-ing the finger and name of the perpetrator), and “not identified.”

6.2.1. Accuracy of conclusionsOf the 26,486 same-source pairs, the examiners identified 26,142,

a correct identification rate of just under 99%, and concluded not-identified for the remainder.

Of the 4156 different-source pairs, 4056 or 97% were not identifiedand 100 were identified (2%)—the erroneous identification rate. Becausesomeof these 4056not-identified conclusionsmaybe inconclusive ratherthan exclusions, a correct exclusion rate cannot be computed. No reliabil-ity scoring was reported.

6.2.2. Specific problems with Gutowski [12]One problem is the sampling of examiners. While it is unknown to

what degree the examiners taking the CTS tests are typical of the

Please cite this article as: R.N. Haber, L. Haber, Experimental results ofanalysis, Sci. Justice (2014), http://dx.doi.org/10.1016/j.scijus.2013.08.007

profession as a whole, Haber and Haber [23] suggested that these test-takers are likely to be among the best in the profession, and that alarge percentage of the same examiners re-take the tests each year. An-other problem is that the difficulty level of the test items is unknown, sothat results fromyear to year cannot be compared or combined. Anotherproblem is that the extremely high scores create a ceiling effect inwhichno conclusions or correlations can be made. (The very high scores havecontinued from 2006 to the present.) Two further problems are createdby the atypical responses required on this test, aswe discuss in Section 8below.

We included Gutowski [12] in the corpus only because it is oftenquoted as a source of evidence for very low erroneous identificationrates (e.g., Langenburg [10], Langenburg, et al. [3]; Reznieck et al. [25]).This conclusion is unwarranted.

6.3. Tangen et al. [13]

This study, like the two Ulery studies, was refereed and then pub-lished in amainstream scientific journal. The authors are research scien-tists rather than forensic practitioners. The authors explicitly state thattheir goal was not to extend the results from these lab-based experi-ments to the “real world” of the accuracy of practicing examinersperforming casework. “Generalizability in this context [the experiment]refers to the extent to which the difference between expert and noviceperformance is “real”, not the extent to which the laboratory setting re-sembles the everyday operations of a fingerprint bureau.” (Thompson[46] p. 7.)

The authors asked 37 “naïve” undergraduate students (for a controlgroup), and 37 skilled examiners frompolice stations across Australia tocompare the same set of 36 latent–exemplar pairs, 12 of which weresame-source pairs. Of the 24 different-source pairs, 12were constructedso that the paireddifferent-source exemplar resembled the latent (ratedas hard different-source pairs), and 12 were constructed from randomlychosen different sources (rated as easier different-source pairs). Inorder to control for variability in the quality and quantity of informationin the latent prints, each latent was randomly assigned to one of threeexemplar types (match, similar-distractor, non-similar distractor) foreach participant.

6.3.1. ResultsThe practicing examiners were virtually perfect in their responses

for both same- and different-source pairs, 92.1% and 100%, respectively.No differences were found between the difficult and easy different-source pairs: 99.3% and 100%, respectively. No reliability results werereported. The novices' performance was poorer, particularly on thesimilar-distractor pairs: 74.5% correct on the matched pairs, 77% onthe non-similar non-matching pairs, and 44.8% on the similar-distractor pairs. The authors concluded that trained fingerprint exam-iners were experts as compared to untrained undergraduates.

6.3.2. Specific problems with Tangen et al. [13]We consider eight design problems.First, the authors required examiners to use a rating scale of 1 to 12,

in which ratings of 1–6 were scored as identifications, with 1 very cer-tain and 6 quite uncertain; and ratings of 7–12 were scored as exclu-sions, with 7 very uncertain and 12 very certain. The authors statedthat they used the “new” response categories in order to contrast accura-cy with confidence, and to apply a signal detection model to the results[26]. For undisclosed reasons, the authors did not report an analysis ofconfidence and no information was reported on the relationship be-tween accuracy and confidence for fingerprint comparison results.

Second, because the authors used novel response categories insteadof the standard SWGFAST [16] categories of no-value, exclusion, identi-fication, or inconclusive [15], their results cannot be combined or com-pared with the rest of the corpus of experiments.

fingerprint comparison validity and reliability: A review and critical

Page 10: Experimental results of fingerprint comparison validity and reliability: A review and critical analysis

10 R.N. Haber, L. Haber / Science and Justice xxx (2014) xxx–xxx

Third, unannounced, the authors recorded an identification conclu-sion whenever an examiner selected any number from 1 through 6,and an exclusion conclusion for any number from 7 through 12, eventhough subjects did not label their conclusions as identification or ex-clusion. Most examiners are taught or required today to conclude anidentification only if they are absolutely certain they have reached thecorrect conclusion, and are willing to testify to it in court. Tangen et al.scored conclusions of identification made with less than certainty, andsome with minimum confidence. Thompson [46, p. 10] justifies thisdecision as a way to assess accuracy while eliminating individual dif-ferences in willingness to say “match” or “no match” with less thanperfect confidence. This decision precludes generalizability to casework.In casework, examiners probablywould report their responses 2 through11 as inconclusive, not identification/exclusion. As a consequence, accu-racy scores may be artificially inflated.

Fourth, Tangen et al. did not permit a conclusion of inconclusive, sothere is no way to ascertain which conclusion the examiners wouldhave given in typical casework. In many of the experiments reviewedhere, the frequency of inconclusive judgments was quite high. Tangenet al.'s omission of inconclusive conclusions further prevents comparisonof these results to the other experiments.

Fifth, virtually no errors were made in the experiment. As withseveral of the other experiments (e.g.,Wertheim et al. [11], Langenburget al. [10], Langenburg [5], Gutowski [12]), the ceiling effect of scoresclose to 100% prevents any differences in the pattern of results to be de-tected. Statistically, such ceiling effects render the data as unreliable(see 7.4 below). Any attempt to correlate these results with training,proficiency, or the impact of any experimental variable (e.g., difficulty)can only produce a zero magnitude of correlations. Tangen et al.'s find-ing of a tiny erroneous identification rate cannot be generalized to case-work or to any other context because of the difference in responsecategories. While the authors state that this was not their intent, theirresults are reported as error rate findings.

Sixth, the difficult non-match pairs were created by inputting aknown-source latent to the Australian National AFIS and selecting themost highly ranked nonmatching exemplar produced by the computersearch. Examiners frequently report that computer searches may pro-duce top candidates that to the human eye obviously do not match.The authors do not mention whether such exemplars were eliminatedfrom the study.

Seventh, the authors divided the pairs into easy and difficult, but theresults did not support the differentiation of difficulty.

Eighth, the latent prints and their matches in the experiment weretaken from the Forensic Information Biometric Repository, as were therandomnon-matching prints. The latents were believed to vary system-atically in quality. The only control exercised to insure that all were ofvalue for comparison – contained sufficient information to make anidentification –was “to ask several experts about the sufficiency of infor-mation in several prints to see whether they agree with each other andthemselves on repeated examinations (i.e., between and within partici-pant reliability).” (Thompson [46, p. 11]). These data are not reported.

6.4. Summary of conclusions about the final three experiments

The results of these three experiments do not contribute informationabout accuracy or error rates from fingerprint examinations. They shouldnot be quoted in this context.

7. Common problems of generalizing these accuracy and error ratesresults to casework

Wehave reviewed particular problems in each of these experimentsin generalizing their results to casework. Those problems limit theusefulness of individual experiments. In addition, these experimentsshare almost a dozen common problems that reduce or preclude

Please cite this article as: R.N. Haber, L. Haber, Experimental results ofanalysis, Sci. Justice (2014), http://dx.doi.org/10.1016/j.scijus.2013.08.007

generalization of their results collectively to casework performance.We start with five limitations that stem from poor reliability of theresults.

7.1. Extreme variability of results across experiments

Each column in Table 2 above illustrates the variability in the resultsacross these experiments. The result of one experiment does not predictthe result of another.

In addition, most of these experiments demonstrated that theamount of agreement between examiners in their conclusions was low.Except for conditions where the results reach a ceiling at close to 100%,the examiners rarely reached a unanimous conclusion for the pairsthey compared. The low reliability in the experimental results precludesinference of performance levels in casework: one examiner's conclusionsdo not match another's, and the performance of one examiner does notpredict the performance of another. In the only within-subject analysisamong the first five experiments, Ulery et al. [2] found that 10% of theconclusions reached by examiners for mated pairs changed when thesame pairs were compared a second time. These within-subject resultssuggest that the performance of an individual examiner at a given timepredicts with only 90% accuracy the same examiner's performance at alater time. For the non-mated pairs, predictability drops to to 86%.

7.2. Lack of statistical tests of significance

Only two of the experiments [7,8] applied statistical tests todetermine if their results were significantly different from chance orfrom each other (see also Rosenthal and Dror [20]). In traditionalpeer-reviewed scientific journals, reports of empirical research withoutevidence of significance testing would normally not be considered forpublication. Only two of these 13 experiments [10,13] were publishedin rigorously vetted scientific journals. It is alarming that these experi-ments were published without any statistical testing. Statistical testingof the reliability of results needs to be included in the experimental de-sign process. The experiments with the most conspicuous absence ofplanning for statistical significance testing are the two Ulery et al. [1,2]studies.

Statistical tests provide several kinds of assurance. As one example,such tests document the extent to which the results differ significantlyfrom chance— tests that cannot be performed when the results clusterat perfect performance. Statistical tests are also used to show that thescores actually test what they are intended to test. For example, if thistest is intended to document proficiency, do the results correlate withother measures of proficiency? As a different example, does an examinerwho demonstrates higher accuracy in comparison on a proficiency testfind more features in a latent than an examiner with poorer accuracyperformance?

The absence of statistical tests of the experimental results reportedhere is bad science.

7.3. Measurement of the difficulty of fingerprints

In each of these experiments, the authors attempted to select latentand exemplar prints at a particular level of difficulty, but most failed topresent evidence of the accuracy of the selection procedures they used.The most common selection procedure relied on the concurrenceamong so-called expert examiners to select prints, but that procedureis not validated and the results reported in these experiments suggestthat the procedure is not valid. Today, there is no validated method,judgment or metric to evaluate the difficulty of a latent print, either incasework or in these experiments. This absence precludes combiningor comparing results across studies. It is equally unknown whetherthe experimental and casework latent prints are similar in difficulty,and similar over the same ranges of difficulty. Experimental results can-not be generalized to casework without documentation that the types

fingerprint comparison validity and reliability: A review and critical

Page 11: Experimental results of fingerprint comparison validity and reliability: A review and critical analysis

11R.N. Haber, L. Haber / Science and Justice xxx (2014) xxx–xxx

and difficulties of the latent prints are comparable in both. The same istrue for the difficulty of exemplar prints.

The difficulty of comparing a latent print to an exemplar print de-pends not only on the information quality of the latent print (unknown)and of the exemplar print (unknown), but also on the similarities anddiscrepancies between the two prints when they are paired together(also unknown). These experiments used several criteria to control forsimilarity between the two prints in each pair:most commonwere con-sensus judgments of expert examiners. Three experiments used AFISsimilarity ratings to select similar pairings (Langenburg et al. [3], Tangenet al. [13] and Ulery et al. [1]). No demonstrations were offered that ei-ther of these procedures produced pairs that predictably varied in com-parison difficulty.

To the contrary: the results of some of the experiments themselvesprovide evidence that the attempt to manipulate difficulty failed. Forexample, no differences between supposedly easy and difficult printswere found [3,13], and the differences between same- vs. different-source pairs in Ulery et al. [1] were opposite to the authors' intent.

At present, there is no validatedmetric that specifies the difficulty ofa latent–exemplar pairing, or the difficulty of comparing a latent to aten-print exemplar. Until these measurements are validated, the diffi-culty of the pairs in an experiment cannot be compared to those in case-work. Equally as important, none of the results of these experiments canbe compared to each other.

7.4. Ceiling effects

Two of the Langenburg experiments [5,10], and the three rejectedexperiments [11–13] report identification and/or exclusion accuracy re-sults that cluster between 90% and 100%. Statistically, when data sets ofresponses are bunched at the top and contain nearly identical and per-fect scores for each subject, the restricted variation among subjectslowers the reliability of the findings, usually close to zero. It also callsinto question use of these results to test hypotheses or differences. Forexample, when everyone on a proficiency test effectively achieves thesame very high score, the results of the test cannot be used to documentproficiency or anything else. An examiner who achieves a 99% scorecannot be shown to be more proficient than one who has a 95% score.

Results like those reported by [5,10–13] prompt three contradictoryconclusions. The first is that the test items were all too easy for the sub-jects being tested, so they all received near perfect scores. The second isthat the subjects in those experiments were selected (specifically or bychance) to be among the very highest skilled subjects in the profession;the subjects were not a random sample of the profession. Or third, thetest items are representative, the subject selection is representative ofthe entire profession, and the very high scores truly reflect the accuracyof the entire profession.Without measures of the difficulty of latent andexemplar fingerprints, or their pairing, the first alternative cannot beruled out, and it seems the most likely explanation. Without measuresof the representativeness of the subjects relative to the population offingerprint examiners as a whole, the second alternative cannot beruled out, though evidence about subject selection for nearly all ofthese experiments showed non-random sampling biased toward highskill. The third alternative can be refuted by the results of the other ex-periments in Table 1: in spite of the design flaws,many examinersmademistaken conclusions.

The third alternative is also contrary to traditional evaluations ofexaminer skill: those with more training are better than those withless training; those with more supervision are better than those withless supervision; those with more experience are better than thosewith less experience; those that are certified are better than thosewho are not certified; and thosewith high laboratory supervisor ratingsare better than those with lower ratings. Thus, some examiners arebetter than other examiners, and a reliable and valid test should reflectthose individual differences.

Please cite this article as: R.N. Haber, L. Haber, Experimental results ofanalysis, Sci. Justice (2014), http://dx.doi.org/10.1016/j.scijus.2013.08.007

7.5. Distribution of conclusions in experiments

The relative proportions of the different conclusions reached byexaminers in casework (value or no-value for the latent, and exclusion,identification or inconclusive conclusions for each comparison) havenot been empirically measured. Haber and Haber [23] estimated thesedistributions based on interviews with fingerprint examiners and labo-ratory supervisors, plus reviews of transcripts of testimony offered incriminal trials in which examiners gave their personal estimates of dis-tributions of conclusions (e.g., Gische [24]). The no-value conclusionwas estimated to occur for between 50% and 75% of all latent printsbrought to an examiner. When comparisons are made to the remaininglatent prints, virtually all are exclusions. Inconclusive judgments arerare in casework, and identification conclusions are even rarer: estimat-ed to be less than 1% of all conclusions reached in casework.

The distribution of conclusions in each of the experiments vastlydiffered from these casework estimates. Only two of the experimentspermitted no-value judgments [1,4], and, where allowed, the preva-lence of no value conclusions was half of casework estimates. Thesame imbalance occurred for identification conclusions: in contrastto less than 1% overall in casework, same-source pairs ranged from100% [9] to 33% [13], with a concomitantly large number of identifi-cation conclusions.

This mismatch between casework and experimental design has twoconsequences. If the purpose of these experiments, even in part, is to es-timate the erroneous identification rate, then the experiments shouldcontain a substantial number of different-source pairs, because onlydifferent-source pairs can provide an opportunity to make erroneousidentifications. These experiments included relatively few different-source pairs: the overall average was only about a quarter of the pairs.Yet the erroneous identification rate is singled out in these experimentsas their most important finding, and the erroneous identification ratesof Ulery et al. [1] have been used to justify the introduction of finger-print evidence in court (USA v. Love, [15]). If the purpose of the experi-ments is to estimate erroneous identification rates, the prevalence ofsame-source over different-source pairs is bad science and is a biasedexperimental design. Given the small percentage of different-sourcepairs and the lack of their comparability to casework, generalizing theresults of erroneous identification rates from these experiments to case-work is unjustified.

The prevalence of same-source pairs over different-source pairs has afurther consequence: the introduction of the unknown effect of examinerexpectations when participating in these experiments. Research on deci-sion theory (e.g., Swets et al. [26]) has shown that subjects in decision ex-periments adjust the distribution of their responses to the distribution ofthe stimuli they experience. Such expectancy effects were notmentionedand were not measured in any of the experiments, so there is no way todetermine their effects on accuracy and reliability.

We have described elsewhere experimental designs that mimiccasework distribution directly, while employing double blind proce-dures (e. g., Haber and Haber [23]; Haber and Haber [45]). If the exper-imental test items and proficiency test items are inserted into theregular flow of casework in a laboratory, in a way that makes themindistinguishable from normal casework items, then the examinerswill not know which items are from casework and which are experi-mental or proficiency items. The latter can be scored separately fromthe casework, so they provide an unbiased assessment of accuracy andreliability.

The remaining six design flaws artificially elevate the accuracy and/or reduce the errors in the experiments as compared to casework.

7.6. Non-random sampling of examiners in the experiments

None of these authors attempted to choose randomly from all exam-iners in the fingerprint profession. The descriptions of the examinerswho participated (especially Ulery et al. [1,2]) suggest that they were

fingerprint comparison validity and reliability: A review and critical

Page 12: Experimental results of fingerprint comparison validity and reliability: A review and critical analysis

12 R.N. Haber, L. Haber / Science and Justice xxx (2014) xxx–xxx

substantially above average in skill and experience. To the extent thathigh examiner ability produces greater performance accuracy, compar-ison accuracy reported in these experimentswould be higher than if theexaminers were selected randomly. For this reason, the results cannotbe generalized to the fingerprint profession as a whole, or to any singleexaminer, including a fingerprint examiner who testifies in a particularcourt case.

7.7. Non-adherence to casework procedures: pairing single fingerprints

In casework, theminimum-sized case consists of a single latent printand ten exemplar prints from a single suspect. Larger cases containmul-tiple latent prints and/or multiple suspects' exemplar prints, requiringten comparisons for each latent print of value. However, the authorsof these 13 experiments (excluding [11]) simplified them so the latentand exemplar printswere presented as a single pair. Thismade the com-parison singular: the examiner did not have to search amongexemplars,and did not have to evaluate the similarities between multiple exem-plars and the latent; there was only one exemplar.

No study has yet contrasted the paired one-to-one design as usedin these experiments to the one-to-many design that characterizesfingerprint comparison casework. If the one-to-one design is shown tobe easier than casework procedures, these experimental results wouldproduce more accurate performance than casework. This seems likely.Until clear evidence is accumulated that the one-to-one procedure pro-vides comparable results to the one-to-many procedure used in case-work, the absolute level of results (e.g., the erroneous identificationrate) cannot be applied to casework.

7.8. Idealized working conditions in the experiments

Testing conditions in these experiments were far more conducive toaccurate performance than conditions in typical casework. Examinerstook these tests on their own time, without deadlines, supervision, in-terruptions, or distractions. The distractions, interruptions and timepressures common to casework have been widely demonstrated to de-crease performance accuracy (Dror [27]). The betterworking conditionsin the experiments, compared to casework, suggest inflated accuracyscores to an unknown extent, which prevent generalization of the re-sults of these experiments to casework.

7.9. Knowledge of being tested

Substantial research has shown that subjects who know they arebeing tested perform better than when the tests are not announcedand cannot be differentiated from routine work (e.g., Koppl et al. [28]).This testing variable increases performance accuracy in the experimentscompared to casework. The examiners in each of these experimentsknew they were being tested.

7.10. Absence of AFIS-produced exemplars in the experiments

In casework, based on estimates obtained from examiners, super-visors and testimony in court, at least 50% of cases do not have a sus-pect, and require AFIS to provide exemplars from suspects (Haberand Haber [23]). The candidate exemplars from an AFIS search bydefinition are likely to bemore similar to eachother than exemplars ran-domly occurring from police-produced suspects. Therefore, when AFISprovides the exemplars, the discrimination between the same-sourceand the other exemplars is more difficult. Further, Dror and Mnookin[29] and Dror et al. [30] showed that when AFIS was used to produceexemplars, more erroneous identifications and erroneous exclusionswere made, as compared to comparisons of exemplars produced frompolice-produced suspects.

None of the experiments presented AFIS-produced candidate exem-plars to be compared to the latent prints. (Langenburg et al. [3], Tangen

Please cite this article as: R.N. Haber, L. Haber, Experimental results ofanalysis, Sci. Justice (2014), http://dx.doi.org/10.1016/j.scijus.2013.08.007

et al. [13] and Ulery et al. [1], used AFIS to select prints for use in theirexperiments, but their subjects never saw more than one AFIS candi-date in the experiments.) Consequently, casework examinations are,overall, more difficult than the comparisons in these experiments, andthe accuracy and reliability results of these experiments are inflated com-pared to casework.

7.11. Contrasting biases in experiments and in casework

When an examiner, working for the criminal justice system, is com-mitted to catching and identifying perpetrators (Kassin, Dror andKukucka [31]), that commitment can produce a bias in decision processesso that identification conclusions are more important and valued thanother conclusions (Risinger et al. [32], Dror and Cole [33], Mnookinet al. [34], Office of the Inspector General's investigation of the FBI'sMayfield erroneous identification (OIG), [35]; National Academy ofSciences [18, pp. 222–224]). One outcome can be that the examinerwill attend more to confirmatory evidence to support an identification,and will be less sensitive to negative evidence that might point to otherconclusions. Confirmative biases in examiners performing caseworkarising from their own commitments, from pressure for positive out-comes, or from outside biasing information are more likely to increaseerroneous identifications in casework.

In contrast, in the context of the experiments, a confirmation biaswas less likely to occur. The absence of a confirmation biaswould reduceerroneous conclusions, including erroneous identifications, inflating theaccuracy results from the experiments as compared to casework.

A second bias is present in casework that was absent in the experi-ments. Because nearly all casework is carried out in police, sheriff, andjustice departments, information bias is prevalent. To the extent that in-formation bias produces more erroneous conclusions, accuracy in case-work performance is lower than found in these experiments, in whichthe sources of bias are more limited.

7.12. Summary of generalization limitations

We conclude that these experimental results cannot be applied tocasework performance: the experiments either use procedures ormethods that differ from casework, or they systematically overesti-mate accuracy rates and/or underestimate error rates.

8. The use of proficiency test results to estimate error rates

The Collaborative Testing Services [17] is the largest manufacturerand distributor of fingerprint proficiency tests. They publish the resultsof their tests twice annually on their website, based on several thousandtest-takers per year. The identification rate has averagedwell above 95%correct over the last decade. Gutowski [12] found that six years between2000 and 2005 averaged about 99% correct.

Severalfingerprint examiners have proposed in print that the resultsof a proficiency test, such as the CTS proficiency test, be used to providean estimate of the erroneous identification rate for fingerprint examina-tions (e.g., Reznieck et al. [25]). They note that the test items haveknown ground truth, so responses can be scored as correct or erroneous.They also note that many examiners take this test regularly, providing ahistory of performance (e.g., Gutowski [12]). Koehler [36] andHaber andHaber [23] have objected to this use of proficiency test results. We listseveral problemswith the CTS test, any one ofwhich renders CTS resultsunusable as a measure of error rates.

First, the CTS do not report measures of reliability, and there is noway to estimate reliability from other measures.

Second, tests onwhich virtually all test-takers receive a near-perfector perfect score statistically are useless as predictors of proficiency.

Third, the CTS explicitly warns users that the test results should notbe used to estimate error rates or any other measure of performance.

fingerprint comparison validity and reliability: A review and critical

Page 13: Experimental results of fingerprint comparison validity and reliability: A review and critical analysis

13R.N. Haber, L. Haber / Science and Justice xxx (2014) xxx–xxx

Thiswarning is ignored by thosewho refer to these tests as ameasure oferror rates.

Fourth, the CTS proficiency tests allow only one of the four SWGFAST[16] permitted conclusions: identification. It permits a conclusion of“not identified,” which is not approved by SWGFAST. In addition tothe atypical conclusions, the “not identified” response is ambiguous:does the examiner mean exclude or inconclusive? There is no way oftelling.

Fifth, the proficiency test producers do not describe how test items,both latent and exemplar fingerprints, are manufactured, selected, orpaired; and they do not describe how the difficulty level of the latentprints, the exemplar prints, or their pairings are assessed.

Sixth, CTS administers and reports results of the testswithout knowl-edge of the distribution of people who take the tests, their employment,their training, or other measures of their skill levels.

Seventh, although the fingerprint profession provides a definitionof an examiner who is trained to competence (SWGFAST [37]), nonumeric standard exists, such as a numeric passing score for eacharea of proficiency. Until the profession defines objectively the requiredtest score or its equivalent for a proficient fingerprint examiner, nomeaningful proficiency test can be constructed.

For these reasons, error rate data based on the CTS proficiency testcannot be applied to casework.

Many of these proficiency test criticisms can be mitigated throughproper test construction and psychometric evaluation, using presentlyavailable tools. However, the absence of measures of the difficulty of alatent, of an exemplar, and their pairing are currently insurmountablehurdles. There are several ongoing attempts to develop measures ofthe difficulty of individual prints (e.g., Hicklin et al. [38], Mnookinet al. [34]), but these have not been tested yet or validated. The lack ofthese quantitative measures prohibits the development of a proficiencytest with interpretable results.

9. Assessment of the ACE method by these 13 experiments

None of these 13 experiments required examiners to use the ACEmethod, and none provided evidence that the examiners did use ACE.These comparison experiments assessed only comparison conclusions.Conclusions are outcomes of the application of some defined or unde-fined method: they are not the method itself. Research experimentsthat do not control the method used are called “black box” studies, be-cause they provide no information about what is inside the box. Whendifferent examiners use different methods, varying outcomes are expect-ed for the same comparisons. That is the overall result found in theseexperiments.

Ultimately, black box studies, because they do not require or assessthe method used by each examiner, are of no benefit to the fingerprintprofession.

In thefirstDaubert pretrial hearing on the admissibility offingerprintevidence into court (USA v. Mitchell [19]), Meagher [39] testified thatthe fingerprint profession uses the ACE method exclusively. It is THEmethod used to perform fingerprint examinations. This claim is still re-peated in authoritative publications (e.g., McRoberts [42]). If the finger-print profession claims that all fingerprint examiners use the ACEmethod on which they base their testimony, experiments must assesstheACEmethod (Haber andHaber [40]; see alsoMnookin [41],Mnookinet al. [34]). At present, none do.

None of the 13 experiments and none of the current proficiency testsdefines the ACE method for their examiners to use. None of them de-scribes the details of that method; none instructs the examiners to usethe ACE method; and, most importantly, none of them documents theexaminers' use of the ACE method with bench notes or other means. Ifall examiners use only one method, called ACE, then there should bedocumentation of the method and of its use. Since the corpus of exper-iments reviewed here is the only ones published, we conclude that the

Please cite this article as: R.N. Haber, L. Haber, Experimental results ofanalysis, Sci. Justice (2014), http://dx.doi.org/10.1016/j.scijus.2013.08.007

accuracy and the reliability of the ACE method have never been tested,and are therefore unknown.

SWGFAST [16] describes the major components of one version ofACE, and acknowledges that different examiners perform many ofthe components or steps differently. A number of authors havecommented that ACE is not a single unified method, but is more aframework of similar but not identical steps (Triplett [43]; Cole[44]; The National Academy of Sciences [18]); The Office of the In-spection General [35]. For example, Triplett [43] describes this vari-ation in several different contexts, and concludes that there is nosingle ACE. Variation means that there is no single method to testfor its accuracy and reliability.

The claim that all examiners use the ACE comparison method be-came necessary in part because the Daubert criteria (as well as mostFrye criteria) for the admission of fingerprint evidence in federal andstate courts focus on methodology, not on accuracy of conclusions.To meet these criteria, evidence is required to show that the methodused to produce fingerprint conclusions is accepted by both practi-tioners and scientists, that the method's accuracy and reliability aretestable, that the method has been tested, that it has a publishederror rate (presumably one that is low), and it has evidence of reli-ability (presumably one that is high) (Haber and Haber [40]). The ex-perimental results described in this article, which are the entirecorpus of results, are irrelevant to the criteria demanded by Daubertand Frye.

It is a serious design omission that this corpus of experiments pro-vides no information about the accuracy of the ACE method. To repairit, the fingerprint profession needs to prepare a complete manual ofthe ACE method, pretest it, have the profession (i.e., the IAI) approveit, teach the manual to all fingerprint examiners, and finally work outan assessment procedure to document that the “approved” methodhas been used and has been used correctly by examiners when they docasework, or when they are tested in research or proficiency context. Atpresent, thefingerprint professionhas notmet any of these requirements.

10. Societal implications of these experimental results

Our legal system weighs as unacceptable outcomes both thethreat that an innocent person will be convicted, and the risk that aguilty person will go unprosecuted or released, though the formeris considered far more heinous. On the assumption that these exper-iments reflect real life, and that every same-source pair is from a“guilty” person and every different-source pair is from an “innocent”person, their results suggest that many guilty persons go free andmany innocent persons are convicted or remain at risk. Ignoringthe caveats already mentioned about design flaws in these 13 exper-iments, the numbers reported by each experiment illustrate seriousproblems about the role of fingerprint examinations in the criminaljustice system. Guilty persons remain at large through either an erro-neous exclusion or a missed identification. Innocent persons areconvicted or at risk through an erroneous identification or a missedexclusion.

The erroneous exclusion rate (concluding exclusion to a same-source"guilty" pair) ranged from a low of 1% (Langenburg [5]) to a high of 13%(Ulery et al. [1]).

The missed identification rate (concluding inconclusive or exclusionto a same-source “guilty” pair) ranged from a low of 9% (Langenburg[5]) to a high of 55% (Ulery et al. [1]). If these data could be generalizedto casework, they would indicate that a very large number of “guilty”perpetrators remain at large to commit further crimes. It is for the pro-fession and the courts to determine what constitutes an acceptablemissed identification rate, a standard that neither body has established.

Erroneous identifications (concluding identification to a different-source “innocent” pair) ranged from virtually zero (Ulery et al. [1]) to3% (Langenburg et al. [3]). A 3% rate produces three identifications ofan innocent person out of every 100 identification conclusions made.

fingerprint comparison validity and reliability: A review and critical

Page 14: Experimental results of fingerprint comparison validity and reliability: A review and critical analysis

14 R.N. Haber, L. Haber / Science and Justice xxx (2014) xxx–xxx

Is that too high? It is for the profession and for the courts to determinewhat constitutes an acceptable erroneous identification rate, a standardthat neither body has established.

Themissed exclusion rate (concluding inconclusive or identification toa different-source pair) ranged from a lowof 7% (Langenburg et al. [3]) toa high of 79% (Langenburg [5]). Failure to exclude an innocent personmay leave that person as a suspect.

The inconclusive conclusion rate (concluding inconclusive to either asame- or a different-source pair) is a very large number in many ofthese experiments. In Ulery et al. [1], inconclusive judgments reached37%, and were heavily concentrated on the same-source “guilty” pairs.This means perpetrators were not identified and innocent people werenot excluded.

If the results from these experiments were generalizable to casework,fingerprint comparison evidence would leave guilty perpetrators free,and would leave innocent persons under threat of further prosecution.

We do not believe that the results of these experiments can begeneralized to casework. However, the large numbers of inappropriateconclusions suggest that fingerprint evidence and fingerprint compari-son methodology need careful scrutiny.

11. Conclusions

In the several sections above, we have been critical of the experimen-tal designs, the procedures, the analyses, and the interpretations of these13 experiments.We have concluded that they have flaws or inadequaciesthat prevent their results to be applied, individually or collectively to case-work; they do not provide acceptable estimates of error rates even in thecontext of the single experiment itself; and they do not give any evidenceof the accuracy or reliability of ACE. Many of these conclusions are sup-ported by analyseswithin single experiments, especially those concerningdesign flaws, as well as by the variability in results across experiments.We have also concluded that no generalization to casework can bemade from the corpus of experiments considered together. Many of theresults reveal considerable variability between and within examiners.This unreliability in conclusions seems likely to be a true finding, giventhe absence of a defined method. Carefully designed experimentation isrequired to document the presence and extent of this variability.

Putting all of these concerns aside, don't these experiments at leastsuggest very lowerroneous identification rates forfingerprint examiners?Our answer is sharply NO. Not one of these 13 experiments can justify anestimate of the erroneous identification rate in fingerprint comparisoncasework, and certainly not the low rates reported in their results.

Even though the experiments were published over a 17-year span,not one of them was designed as a replication of an earlier one, andnone can be argued to be even a partial replication. The great variabilityof results (as in Table 2) suggests that our knowledge of fingerprintaccuracy and reliability has not been advanced.

Three of the problems noted throughout our critiques demandvalid solutions before useful research can be performed to documentthe accuracy of fingerprint comparisons: creating a validated mea-sure of latent print difficulty, of exemplar print difficulty, and of thedifficulty of comparison of print to print; being able to match testitem difficulty to the range of casework difficulty; and providing ac-curacy and reliability evidence of the method (e.g., ACE) used by theexaminers on the test items. Until solutions to these problems arefound and validated, further experiments of the kinds describedhere cannot provide estimates of either casework accuracy or the va-lidity of the ACE method.

Acknowledgments

We would like to thank Michele Triplett and (a fingerprint examin-er), and Simon Cole (a research scientist).We especially thank an anon-ymous reviewer,whoprovided conceptual and specific suggestions thatre-shaped this article. This article was prepared without any funds from

Please cite this article as: R.N. Haber, L. Haber, Experimental results ofanalysis, Sci. Justice (2014), http://dx.doi.org/10.1016/j.scijus.2013.08.007

individuals, granting agencies or the United States government. Wepresented many of the results of this article during a Frye Hearing (Illi-nois v. Robert Morris) in May, 2012, in which we had been retained asdefense experts, and Glenn Langenburg had been retained as the expertfor the State. Both authors of this article have testified for the defense ina number of Daubert and Frye hearings, as well as in criminal trials inwhichfingerprint evidencewas at issue. Some of the income to our con-sulting firm is earned as a consultant or expert witness in criminal trialsinvolving fingerprint evidence. We have never been asked by theUnited States government nor by any state for consultation or testimo-ny for the prosecution in a criminal trial in which fingerprint evidencewas at issue.

References

[1] B.T. Ulery, R.A. Hicklin, J. Buscaglia, M.A. Roberts, (2011), Accuracy and reliabil-ity of forensic latent fingerprint decisions, www.pnas.org/cgi/doi/10.1073/pnas.1018707108.

[2] B.T. Ulery, R.A.Hicklin, J. Buscaglia,M.A. Roberts, (2012), Repeatability and reproducibil-ity, of decisions by latent print examiners, www.plosone.org/article/info:doi/10.1371/journalpone 0032800.

[3] G. Langenburg, C. Champod, T. Genessay, Informing the judgments of fingerprintanalysts using quality metric and statistical assessment tools, Forensic Sci. Int. (2012),http://dx.doi.org/10.1016/j.forsciint.2011.12.017.

[4] Z.W. Evett, R.L. Williams, Review of the 16 point fingerprint standard in England andWales, Forensic Science International 46 (1996) 49–73.

[5] G. Langenburg, A performance study of the ACE-V process: a pilot study to mea-sure the accuracy, precision, reproducibility, repeatability and bias ability ofconclusions resulting from the ACE-V process, J. Forensic Identification 59(2009) 219–257.

[6] S.B. Meagher, Report of the Federal Bureau of Identification's Mitchell Survey, US v.Mitchell, 365 F, 1998, (3d. Circuit).

[7] I.E. Dror, D. Charlton, A. Peron, Contextual information renders experts vulner-able to making erroneous identifications, Forensic Science International 56(2006) 74–78.

[8] I.E. Dror, D. Charlton,Why expertsmakemistakes, J. Forensic Identification 56 (2006)600–616.

[9] L.J. Hall, L. Player, Will the introduction of an emotional context affect fingerprintanalysis and decision-making, Forensic Sci. Int. 181 (2008) 36–39.

[10] G. Langenburg, C. Champod, P. Wertheim, Testing for potential contextual biaseffects during the verification stage of the ACE-V methodology when conductingfingerprint comparisons, J. Forensic Sci. 54 (2009) 571–582.

[11] K. Wertheim, G. Langenburg, A.A. Moenssens, A report of latent print examineraccuracy during training exercises, J. Forensic Identification 56 (2006) 55–92.

[12] S. Gutowski, Error Rates in Fingerprint Examinations: The View in 2006, ForensicScience Bulletin, Autumn, 2006. (2006).

[13] J.M. Tangen, M.B. Thompson, D.J. McCarthy, Identifying fingerprint expertise,Psychol. Sci. 22 (2011) 995–997.

[14] SWGFAST, SWGFAST Response to the Research, Development, Testing & EvaluationInter-Agency Working Group of the National Science and Technology Council, Com-mittee on Science, Subcommittee on Forensic Science, 2011.

[15] US v Danny Love. Daubert Hearing, S.D. Cal, 2011) No. 10 cp 2418-MMM. (US v. Love,2011 SL 2173644, S.D. California).

[16] SWGFAST, Standards for the Documentation of Analysis, Comparison, Evaluationand Verification, 2010. (Latent).

[17] R.N. Haber, M. Hershenson, The Psychology of Visual Perception, Second EditionHolt, Rinehart & Winston, New York, 1980.

[18] National Academy of Sciences, Strengthening Forensic Sciences in the United States:A Path Forward, National Academies Press, Washington DC, 2009.

[19] US v Brian Mitchell, 365F (3d. District, 1990).[20] I. Dror, R. Rosenthal, Meta-analytically quantifying the reliability and bias ability of

fingerprint expert' decision making, J. Forensic Sci. 53 (2008) 900–903.[21] L. Haber, R. Haber, Letter to the editor regarding: a report of latent print ex-

aminer accuracy during training and exercises, J. Forensic Identification 56(2006) 493–499.

[22] Collaborative Testing Services (CTS), www.collaborativetesting.com.[23] L. Haber, R.N. Haber, Challenges to Fingerprints, Lawyers and Judges Publishing Co.,

Tucson, 2009.[24] S. Gische, Testimony in Frye Hearing, District of Columbia v. Faison, 2010. (May

2010).[25] M. Reznieck, R. Ruth, D.W. Schilens, ACE-V and the scientific method, J. Forensic

Identification 60 (2010) 87–103.[26] J.A. Swets, R.M. Dawes, J. Monahan, Psychology can improve diagnostic decisions,

Psychol. Sci. Public Interest 1 (2000) 1–26.[27] I.E. Dror, On proper research and understanding of the interplay between bias and

decision outcomes, Forensic Sci. Int. 191 (2009) 17–18.[28] R. Koppl, R. Kurzban, L. Kobilinsky, Epistemics for forensics, Episteme 5 (2008)

141–159.[29] I.E. Dror, J. Mnookin, The use of technology in human expert domains: challenges

and risks arising from the use of automated fingerprint identification systems inforensic science, Law Probab Risk 9 (2010) 47–67.

fingerprint comparison validity and reliability: A review and critical

Page 15: Experimental results of fingerprint comparison validity and reliability: A review and critical analysis

15R.N. Haber, L. Haber / Science and Justice xxx (2014) xxx–xxx

[30] I.E. Dror, K.Wertheim, P. Fraser-MacKensie,Walajtys, The impact of human-technologycooperation and distributed cognition in forensic science: biasing effect of AFIS contex-tual information on human experts, J. Forensic Sci. 57 (2012) 343–352.

[31] S.M. Kassin, I.E. Dror, J. Kukucka, The forensic confirmationbias: problem, perspectivesand proposed solutions, J. Appl. Res. Mem. Cogn. (2013)(in press).

[32] D.M. Risinger, M.J. Saks, W.C. Thompson, R. Rosenthal, The Daubert/Kumho implica-tions of observer effects in forensic science: problems of expectation and suggestion,Calif. Law Rev. 90 (2002) 1–56.

[33] I.E. Dror, S.A. Cole, The vision in “blind” justice: expert perception, judgment andvisual cognition in forensic pattern recognition, Psychon. Bull. Rev. 17 (2010)161–167.

[34] J.L. Mnookin, et al., The need for a research culture in the forensic sciences, UCLALaw Rev. 58 (2011) 725–775.

[35] Office of the Inspector General of the Department of Justice (OIG)., Investigation ofthe FBI's Erroneous Identification of Mayfield, US Department of Justice, 2006.

[36] J.J. Koehler, Fingerprint error rates and proficiency tests; what they are and whythey matter, Hast. Law J. 59 (2008) 1077–1110.

[37] SWGFAST, Training to Competence for Latent Fingerprint Examiners, Ver 2.1, 2002,pp. 8–22.

Please cite this article as: R.N. Haber, L. Haber, Experimental results ofanalysis, Sci. Justice (2014), http://dx.doi.org/10.1016/j.scijus.2013.08.007

[38] R.A. Hicklin, J. Buscaglia, M.A. Robert, S.B. Meagher, W. Fellner, J.J. Burge, M. Monoco,D.I. Vera, L.R. Pantzer, C.C. Yeung, T.K. Unnikumaran, Latent print quality: a survey ofexaminers, J. Forensic Identification 61 (2011) 385–441.

[39] S.B. Meagher, Testimony in US v Byron Mitchell, N. 96–4071, 1999. (ED PA, Feb 2000).[40] L. Haber, R.N. Haber, Scientific validation of fingerprint evidence under Daubert, Law

Probab. Risk (2007), http://dx.doi.org/10.1093/lpr.bgm024.[41] J.L. Mnookin, Of black boxes, instruments and experts: testing the validity of forensic

sciences, Episteme 5 (2008) 343–357.[42] The Fingerprint Sourcebook, in: A. McRoberts (Ed.), National Institute of Justice, US

Department of Justice, Washington, 2011, (www.nij.gov).[43] M. Triplett, Is ACE-V a process or a method? IDentification News 42 (2012) 6–7.[44] S.A. Cole, Suspect Identities: A History of Fingerprinting and Criminal Identification,

Harvard University Press, Cambridge, 2001.[45] L. Haber, R.N. Haber, Error rates for human latent fingerprint examiners, in: N.

Ratha, R. Bolle (Eds.), Automatic Fingerprint Recognition, Springer Verlag, NewYork, 2004, pp. 339–360.

[46] Thompson, M. B. (2011) Fingerprints: Defending expertise in Fingerprint Identifica-tion. Unpublished, available from www.mbthompson.com/fingerprints.

fingerprint comparison validity and reliability: A review and critical


Recommended