An NSF Industry/University Cooperative Research Center (IUCRC) focusing on Biometrics
Home > Research >  Biometric System Statistical Design and Evaluation

Biometric System Statistical Design and Evaluation

Funded Projects

Participants

Area Lead:Stephanie Schuckers
Other Members: Stephanie Schuckers

Keywords

confidence Intervals, test size, resampling, random sampling

Area Description

Confidence Intervals and Test Size Requirements for Biometric Testing *
Test performance is an important facet of any biometric identification system. Since, in most cases, it is impossible to test all subjects of a target population, it is necessary to make statements about the entire population from data that is collected from a sample of the population. Such statements are inherently statistical. Specifically the assessments that are often of interest are how will the system perform for the entire population and how sure is that assessment. The two most pressing statistical issues for system performance are test size determination and confidence interval creation. Because we can estimate confidence intervals only for extremely limited and inefficient test protocols, we cannot generally predict required test size. The situation is not totally grim, however, because we do have a clear understanding of what we don’t know and what must be done to solve this problem. We are proposing a research program to answer both the outstanding theoretical and empirical questions, and thereby develop a method of confidence interval and test size estimation for biometric device evaluation using more efficient protocols.

Confidence interval computation is covered in first-year statistical texts only for the case of independent events from a single, stationary sample pool. In the case of an identification system, the results are either a success or failure. If we assume independence of the trials, such events are considered as “Bernoulli Trials” and confidence intervals are estimated from the cumulative binomial equation or a transformed Gaussian equivalent. Some researchers have used this model for biometric test design, tautologically showing results with hypothetical data. Empirically-based studies, however, have shown that this model is not generally in agreement with results through resampling or from random sampling of large experimentally-derived databases. We understand now that biometric tests are not generally “Bernoulli Trials” because events are not independent, are not drawn from a single sample population and may involve an outcome-dependent stopping rule. Consequently, confidence intervals cannot usually be predicted using binomial models or Gaussian equivalents. The proposed international “Biometric Testing Best Practices” document makes no specific recommendation regarding test size for exactly these reasons.

The standard method for expressing the technical performance of a biometric device for a specific population in a specific application is the "Receiver Operating Characteristic" (ROC), or closely related “Detection Error Tradeoff” (DET), curve. The ordinate of the graph of either curve is the “false non-match rate” and the abscissa is the false match rate. Each point on the curve is calculated by integrating "genuine" and "impostor" score distributions between zero and some threshold, t. Confidence intervals for false match and false non-match rates at each threshold, t, will be different and must be computed independently. Traditionally, both have been found through a summation of the binomial distribution under the assumption that each comparison represents a “Bernoulli trial”, defined as an independent event from a single sample population. However, the comparison of biometric measures will not be a Bernoulli trials and, hence, will not be applicable if: 1) trials are not independent; 2) the error probability varies across the population; 3) the stopping rule is determined by event outcome. If cross-comparisons (all samples compared to all templates except the matching one) are used to establish the false match rate, the comparisons will not be independent. The varying error probability across the population ("goats" with high false non-match errors and "lambs" with high false match errors) similarly invalidates the cumulative binomial equation as appropriate for developing uncertainty bounds when single users are sampled (or tested) more than once. In operational tests, users stop after successful use and continue after non-successful use, biasing outcomes in favor of non-successful trials and, again, invalidating the binomial model. The issue of confidence intervals for ROC (or DET) curves is further complicated by the issue of multiple comparisons, which makes confidence level calculations difficult.

Recent work, has shown that confidence intervals can be computed from the binomial model under some limited and inefficient conditions. If each user is allowed a single trial, uses are independent and can be considered as coming from a single population with error rate, p, which is the sum of the “goat” error rate weighted by the percentage of goats and the “lamb” error rate weighted by the percentage of sheep. Confidence intervals on the false non-match error rate can be computed implicitly using the number of tests, N, and the number of errors, K. Further, if each sample is compared to a single, randomly-chosen, non-self (“impostor”) template, the impostor comparisons are Bernoulli trials and the confidence intervals on the false match error rate can also be computed. Required test size can then be calculated to achieve a specified level of confidence.

This test protocol, however, is inefficient. Experimental subjects are difficult and expensive to obtain. Why not collect multiple uses from each individual user? Multiple uses by each individual raises the problem of non-uniform error probabilities across the events. This is because it is known that individuals have varying error probabilities from a trial. Solution methods for this revised equation are not widely known, although several papers considering this problem have been received by the authors through the federal government’s NSASAG program. With the goal of developing equations for confidence interval and test size estimation for both false match and false non-match rates under efficient test conditions, we are proposing an ambitious research program to do the following:

1) Determine the effect, if any, in template creation from sample averaging.

2) Determine the effect, if any, when multiple samples from single individuals are allowed.

3) Verify theoretical finding against experimentally derived data using resampling and/or random sampling.

4) Collect extensive experimental data for several devices allowing multiple attempts in multiple sessions from each user.

5) Determine general “rule of thumb” goat and sheep score distributions from experimental data across all tested devices.

6) Determine appropriate general confidence interval model for independent “genuine” tests under conditions of multiple, discrete error rates.

7) Find inversion methods for above equations for test size and confidence interval estimation.

8) Test equations against experimental data using resampling and/or random sampling from a large database.

9) Find empirical estimates of operational data bias resulting from “try-until-successful” stopping rule.



* for text with full references, see the section under the same title in the CITeR Planning Grant Proposal.


Contact CITeR

About this Site
Copyright © 2009 CITeR