Tools, Technologies and Training for Healthcare Laboratories

Estimating Clinical Agreement for a Qualitative Test: A Web Calculator for 2x2 Contingency Table

How do you validate a qualitative test?  Here's an introduction to a tiny little tool you might find useful for virus assay validation

Estimating Clinical Agreement for a Qualitative Test: A Web Calculator for 2x2 Contingency Table

James O.Westgard, Patricia E. Garrett, Paul Schilling
April 2020

In the midst of this Covid-19 pandemic, it might be helpful to provide medical laboratories with some support for validation of qualitative tests. Larger laboratories performing rRNA-PCR tests for the virus and serology tests for antibodies to SARS-CoV-2 likely have access to statistics programs that follow FDA and CLSI recommendations for method validation. Smaller laboratories are less likely to have those programs, yet will still need to perform some minimum validation or verification studies as simpler or more automated methods become available.

To be clear, there are two new tests: (1) tests for the SARS-CoV-2 virus itself, and (2) tests for antibodies to the virus. Each of these markers is now being analyzed by multiple methods that are being approved at a rapid rate by the FDA under the Emergency Use Authorization (EUA). The methods for the virus are mostly, but not all, based on PCR, and the methods for the antibodies essentially all fall into the category of serology tests. PCR and other nucleic acid or molecular methods are generally performed in one section of the laboratory, according to where the instruments for those technologies are already established. Serology tests are usually performed in another section of the laboratory. With introduction of simple lateral flow tests, tests will also be performed in Point of Care situations. All of these tests are qualitative tests, meaning that they have one medical decision point (cutoff) to classify the result as positive or negative.

See even more stories about COVID-19 Laboratory Challenges...
covid coverage westgard320x50

Laboratories approved by CLIA for performing moderate and high complexity tests are eligible to implement manufacturers’ tests that have been approved under Emergency Use Authorization (EUA). Validation studies must still be performed, plus positive and negative QC samples should be analyzed with each analytical run of patient samples [1].

Clinical Agreement Study

For validation, FDA recommends a “clinical agreement study”, as well as Limit of Detection (LoD) and cross-reactivity studies. We focus on clinical agreement here, which usually involves comparing the results from two different methods. FDA states that “contrived clinical specimens” may be used, which means it is acceptable to spike samples with a (preferably inactive) high level control material. The FDA recommendation is for 30 reactive (20 low reactive at 1 to 2 times the LoD and 10 higher that span the testing range) and 30 non-reactive specimens. The FDA also requires that the first 5 positive and first 5 negative real patient results be confirmed by a previously authorized EUA method.

Guidance for comparing a new “candidate” test method with an available “comparative” test method is provided in the CLSI EP12-A2 document [2]. We expect most laboratories will have access to test specimens that have already been analyzed by another laboratory in their region, perhaps a reference laboratory that is used by their hospital network or a larger high complexity testing laboratory that is used for sendouts. Alternatively, a laboratory may analyze specimens from documented infected patients and specimens from non-infected patients.

Data Analysis by 2x2 Contingency Table

Given a comparison study where the candidate and comparative test results are classified as positive or negative, those results can be summarized as follows:

a = Number of results where both tests are positive;
b = Number of results where the candidate method is positive, but the comparative is negative;
c = Number of results where the candidate method is negative, but the comparative is positive;
d = Number of results where both methods are negative.

These results can then be summarized in a 2x2 contingency table, which is sometimes called a “truth table.”

  Comparative Method
Candidate Method (Test) Positive Negative Total
Positive a b a + b
Negative c d c + d
Total a + c b + d n

 Calculation of Performance Characteristics

This tabulation provides the basis for calculating the Percent Positive Agreement (PPA), Percent Negative Agreement (PNA), and the Percent Overall Agreement (POA), as follows:

PPA = [a/(a+c)]*100

PNA = [d/b+d))]*100

POA = [(a+d)/n]*100

PPA and PNA should ideally be 100%, which would occur if the figures for b and c were zero. Lower values represent less ideal performance, thus these estimates for PPA and PNA may be useful for judging the acceptability of a candidate method. POA is less useful because it may be high even when PPA or PNA may be low. [Note: if the comparative method were a “gold standard” for diagnostic classification, then PPA would be considered the “diagnostic sensitivity”, PNA would be “diagnostic specificity” of the candidate method, and POA is sometimes called “efficiency”.]

To interpret these calculated characteristics, it should be helpful to know their approximate confidence intervals, i.e., the reliability of these numbers. See the appendix for how these confidence limits are calculated.

  Comparative Method
Candidate Method (Test) Positive Negative Total
Positive 285 15 300
Negative 14 222 236
Total 299 237 536
  95% Confidence Intervals
Summary statistics Percent Lo Limit Hi Limit
Positive Agreement PPA 95.3% 92.3% 97.2%
Ngative Agreement PNA 93.7% 89.8% 96.1%
Overall Agreement POA 94.6% 92.3% 96.2%

For the example shown here [which is taken from EP12-A2, pages 30-31], the PPA is estimated as 95.3% and is reliably between 92.3% and 97.2%. PNA is estimated as 93.7% and is reliably between 89.8% and 96.1%. While we have calculated POA as 94.6% with an approximate 95% confidence interval of 92.3% and 96.2%, this characteristic is not as useful and need not be considered in judging acceptability.

Note that for low numbers of specimens, the confidence limits are expected to be wide. For example, for 5 positives and 5 negatives, no false positives or false negatives, the lower limits will be about 57%; for 10 positives and 10 negatives, the lower limits are about 72%; for 30, about 89%; for 40, about 91%; for 50, about 93%. All of these limits are for comparisons that are a perfect match. The smaller the number of samples tested, the deeper the drop in confidence with even one mismatch (false positive or false negative). [See Table A1 in EP12-A2, page 35.] This illustrates the reason FDA recommends accumulating a minimum of 30 positive and 30 negative results to achieve minimally reliable estimates.

Westgard QC 2x2 Contingency Calculator

This calculator requires the user to enter 4 numbers corresponding to a (true positives), b (false positives), c (false negatives), and d (true negatives) in the contingency table. Then click the “Calculate” button to get the summary statistics for Positive Agreement (PPA), Negative Agreement (PNA), and Overall Agreement (POA), along with their lower and upper 95% confidence limits. To see the example discussed in this lesson, click “Load Example Data.” Print the page to provide documentation of your results.

[An even more detailed lesson, by Dr. Paulo Pereira, on qualitative testing validation can be found here.]


  1. US Dept Health and Human Services. FDA Policy for Diagnostic Tests for Coronavirus Disease-2019 during Public Health Emergency: Immediately in Effect Guidance for Clinical Laboratories, Commercial Manufacturers, and Food and Drug Administration Staff. March 16, 2020.
  2.  Garrett PE, Lasky FD, Meier KL. CLSI EP12-A2. User Protocol for Evaluation of Qualitative Test Performance. Clinical and Laboratory Standards Institute, 940 West Valley Road, Suite 1400, Wayne, PA, 2008.

About Patricia E. Garrett, Ph.D.

Pat Garrett earned a Ph.D. in organic chemistry in 1970, and in 1978, after five postdocs, began her clinical lab career directing labs in Boston area hospitals.  In 1988, she started work in quality control and standards for infectious disease diagnostics for a tiny startup, and helped that company grow through 24 years and two acquisitions.  From 2013 to now, she has worked as a consultant and as Principal Investigator for an NIH grant held by another small Boston company. Her passion is helping clinical labs and diagnostics manufacturers ‘get it right’.

Appendix: Calculation of Confidence Limits

These calculations are not difficult, but a bit messy. They are described in two stages, first calculating some quantities (Qi) from the table, then calculating the upper and lower confidence limits from these Qs. [Described on pages 23-25 of CLSI EP12-A2.]

Calculations for confidence limits for PPA

Q1 = 2a + 3.84

Q2 = 1.96*[3.84 + 4a*c/(a+c)]1/2

Q3 = 2(a+c) + 7.68

PPA lo limit = 100*(Q1-Q2)/Q3

PPA hi limit = 100*(Q1+Q2)/Q3

Next, similar calculations for PNA.

Q4 = 2d + 3.84

Q5 = 1.96*[3.84 + 4bd/(b+d)]1/2

Q6 = 2(b+d) + 7.68

PNA lo limit = 100*(Q4-Q5)/Q6

PNA hi limit = 100*(Q4+Q5)/Q6

Finally, similar calculations for POA.

Q7 = 2(a+d) + 3.84

Q8 = 1.96*[3.84 + 4(a+d)(b+c)/n]1/2

Q9 = 2n + 7.68

POA lo limit = 100*(Q7-Q8)/Q9

POA hi limit = 100*(Q7+Q8)/Q9