Tools, Technologies and Training for Healthcare Laboratories

Part II: Touchstone Test Methodology

November 2004
with Sten Westgard, MS
There are hundreds of laboratory tests available today, thus any attempt to provide a comprehensive characterization of the quality of current laboratory tests is a huge undertaking and well beyond our resources. Our approach, therefore, is to identify “touchstone tests” that can be considered representative of the quality of laboratory tests today. Depending on the findings for these touchstone tests, the extent of the studies that are needed should become evident.

These touchstone tests should be important in the practice of medicine today based on their widespread use in diagnosis, their importance in treatment, and their possible impact on the cost of medical care. They should be widely available in a variety of laboratory and testing sites. Performance data should be available from proficiency testing surveys and peer-comparison programs to objectively characterize the quality of different methods and different laboratory settings. There should be some agreed upon standard or requirement for quality from regulatory requirements for proficiency testing, clinical practice guidelines, or expert groups.

Touchstone Tests

Tests that have been identified as exemplars for evidence based medicine are good candidates for touchstone tests. Kling and Hess, in their discussion of the relationship between laboratory tests and clinical outcome, identify cholesterol, prothrombin time (PT), troponin, and prostate specific antigen (PSA) as examples of laboratory tests that are important for diagnosis and management [1]. St. John and Price identify glycated hemoglobin as a surrogate for long-term outcome of diabetic patients [2]. Glucose, of course, is the test with most immediate impact on the daily management of diabetes. We also add calcium to this list because of a recent report that suggests calibration errors may cost $60M to $199M for this test alone [3].

Even this short list is not without problems and limitations when it comes to assessing the quality of laboratory testing today! There are no readily identifiable U.S. quality requirements for glycated hemoglobin, troponin, and PSA. Methods for PT and troponin are NOT sufficiently well standardized to be able to compare test values from method to method and laboratory to laboratory. Of these tests, we can most readily and objectively assess the quality of cholesterol, glucose, and calcium tests. Those will be the starting point for our investigation here. Each of the others – PT, troponin, glycated hemoglobin, and PSA – present special problems that require a more complicated assessment methodology, therefore, we shall have to be more cautious in our analysis and guarded in our conclusions.

Cholesterol. National guidelines for cholesterol testing go back almost 20 years. The National Cholesterol Education Program was initiated by the National Institutes of Health in 1987 [4]. Furthermore, widespread public awareness of problems with pap smears and cholesterol testing were triggers for the CLIA-88 regulations. Wiebe provides a good review of the evolution of cholesterol measurements and the impact of the Lipid Research Clinics and their efforts to standardize cholesterol testing [5], which contributed to the well-established standards for cholesterol testing today.

  • CLIA sets a criterion for acceptable performance as 10% of the target value (TV) in proficiency testing surveys [6].
  • NCEP sets the maximum specifications for method CV at 3.0% and method bias at 3.0% [7].
  • NCEP sets a clinical decision interval of 20% based on a desirable cholesterol level of 200 mg/dL or less and an undesirable level of 240 mg/dL or greater, i.e., 40 mg/dL at a level of 200 mg/dL [7].
Glucose. One of the oldest laboratory tests, glucose is also one of the most widely performed tests in clinical laboratories, point-of-care sites, and home testing. It has taken on new importance with new treatment guidelines from the American Diabetes Association (ADA) and the U.S. Health and Human Services (HSS) [8]. These new diagnostic guidelines identify individuals with a fasting glucose in the 110-126 mg/dL range as having impaired fasting glucose and individuals >126 mg/dL (if confirmed) as diabetic.

  • The CLIA criterion for acceptable performance in proficiency testing is given as 6 mg/dL or 10%, whichever is greater. At a level of 126 mg/dL, the 10% criterion would apply. At a level less than 60 mg/dL, the 6 mg/dL criterion would apply.
  • ADA/HSS sets a clinical decision criterion of 14.5% at a decision level of 110 mg/dL (16 mg/dL/110).
Calcium. In the 1970s, calcium was selected as a model analyte for a demonstration of the role of reference materials and reference methods in achieving accuracy in clinical laboratory measurements [9]. The National Bureau of Standards (NBS), now known as the National Institute for Standardization and Testing (NIST), sponsored the development of reference materials that were then made available to the public, plus the development of an isotope dilution mass spectrometry method for definitive measurement of calcium in clinical specimens, which was then made available to organizations such as the College of American Pathologists to provide official reference values for proficiency studies of field methods [9].

  • The CLIA criterion for acceptable performance in proficiency testing is 1.0 mg/dL.
  • A recent NIST/Mayo Clinic study assessed the impact of method bias due to calibration errors in the range of 0.1mg/dL to 0.5 mg/dL as having a potential cost of $60M to $199M. This suggests that the CLIA PT criterion may not be stringent enough for some clinical applications.

Methodology for Proficiency Testing (PT) Data

We start this assessment of test quality by looking at PT data. Laboratories are required under CLIA to participate in PT three times per year. Each PT event consists of five different specimens. Laboratories are required to perform the testing in the same manner as patient testing, which usually means making a single measurement on each specimen. From this PT data, we can make three different estimates of the quality of laboratory tests.

National Test Quality (NTQ). To assess the quality of the PT group as a whole, any difference from laboratory to laboratory and method to method, whether due to systematic (inaccuracy) or random error (imprecision), will be included in the assessment. You can picture the data as a histogram, where each and every laboratory is represented by one point. The location (mean) and distribution (SD, CV) of these test results would describe the test performance for the whole group. By defining limits for a correct test result, or conversely, the allowable total error (TEa), the overall quality of that test can be assessed. This measure of quality relates to getting the same answer for the test in all laboratories, which lumps any method to method and laboratory to laboratory differences into this assessment. Sigma is estimated as TEa/SD or /CV, where both parameters are expressed in the same units, either concentration units (TEa/SD) or percentage units (TEa/CV).

Local Method Quality (LMQ).
It is also possible to assess the quality for each method or instrument group, disregarding any bias vs other methods. This corresponds to the quality that would be observed locally when physicians are served by a single laboratory method or a healthcare organization with harmonized methodology. Sigma is estimated as TEa/CV, where CV is estimated for the specific method group.

National Method Quality (NMQ). Finally, it is also possible to assess the quality of different method groups taking into account the bias observed for that method group versus the whole group. Sigma is estimated as (TEa – Bias)/CV, where Bias is estimated as the difference between the method mean and the overall mean, and CV is estimated for the specific method group.

CMS Approved Proficiency Testing Providers

Currently, there are 14 approved PT providers:

  • Accutest, Inc.
  • American Association of Bioanalysts (AAB)
  • American Academy of Family Physicians (AAFP)
  • American Proficiency Institute (API)
  • California Thoracic Society
  • College of American Pathologists (CAP)
  • Medical Laboratory Evaluation (MLE)
  • Commonwealth of Pennsylvania
  • Idaho Bureau of Laboratories
  • New Jersey Department of Health
  • Puerto Rico Department of Health
  • State of Maryland
  • State of New York (NY)
  • Wisconsin State Laboratory of Hygiene

These PT providers were not all created equal, e.g., they serve different types of laboratories and different geographical regions of the country.

  • AAFP focuses on family physician office laboratories and provides PT and quality assurance programs for members. See
  • MLE is part of an alliance with the American College of Physicians, which is an organization of internal medicine physicians, rather than family physicians. They focus on physician office laboratories, but these might also be quite large laboratories in large group practices. See
  • AAB identifies itself as the “voice of the community clinical laboratory,” which indicates an orientation to small laboratories and physician office laboratories. See
  • API bills itself as one of the largest proficiency testing services, with over 12,000 clients. API collaborates with the American Society for Clinical Pathology (ASCP) to provide educational and testing services. See
  • CAP is the accreditor of most large clinical laboratories in hospitals and has one of the largest and most extensive PT services. See

The order of these programs – AAFP, MLE, AAB, API, to CAP – represents both the size of the PT programs from small to large and the size of the laboratories they serve, also from small to large. Use of the PT survey results from all of these programs should provide some assessment of the quality of testing available in different laboratory settings.

State health departments will usually have a narrower geographic focus (CA, PA, ID, NJ, PR, MD, NY, WI), though the WI program is also available nationwide and the NY program includes many national reference laboratories.

Example Data Analysis and Assessment

To illustrate our “touchstone test methodology,” we will use a dataset from the NY program as an example. The NY program is thought by many to be the most demanding of all regulatory programs. It serves approximately 400 laboratories, most of which employ high volume automated analytical systems.

Table 1. Cholesterol PT data from New York state survey with all method groups.
Instrument Labs Mean SD CV Sigma wo/Bias Bias Sigma w/Bias
All Methods 371 208.1 5.09 2.45 4.09 NA NA

Abbott Aeroset 11 209.2 2.85 1.36 7.34 0.53 6.95
Beckman Coulter CX 46 208.4 3.51 1.68 5.94 0.14 5.85
Beckman Coulter LX-20 32 207.8 5 2.41 4.16 0.14 4.10
Bayer ADVIA 1650 15 207.5 4.84 2.33 4.29 0.29 4.16
Bayer Express 3 212.8 1.54 0.72 13.82 2.26 10.70
Dade Behring Dimension 66 207.6 4.93 2.37 4.21 0.24 4.11
Hitachi 717 7 212.4 5.83 2.74 3.64 2.07 2.89
Hitachi 747 11 206.3 3.9 1.89 5.29 0.86 4.83
Hitachi 911 5 212.6 8.37 3.94 2.54 2.16 1.99
Hitachi 917 8 207.9 8.22 3.95 2.53 0.10 2.50
Hitachi MODULAR 25 210.7 3.83 1.82 5.50 1.25 4.81
Johnson & Johnson Vitros 64 206.9 4.65 2.25 4.45 0.58 4.19
Olympus AU400/600/640/2700/5400 30 204.1 4.69 2.30 4.35 1.92 3.52
Olympus AU5000/5200 6 207.6 3.81 1.84 5.45 0.24 5.32
Roche Cobas INTEGRA 10 206.2 5.71 2.77 3.61 0.91 3.28
Roche Cobas MIRA 10 213.3 4.15 1.95 5.14 2.50 3.86
Alfa Wasserman ACE 4 205.9 1.88 0.91 10.95 1.06 9.79
Weighted average performance figures

2.19 4.83 0.69 4.48

Local method quality is represented by the Sigma metrics for each method subgroup in that column. These estimates of quality depend only on the within-subgroup or within-method SD or CV, therefore they will often be higher than the Sigma observed for the whole group. Note, however, that some of these subgroups are very small, e.g., 3 instruments for Bayer Express, 7 Hitachi 717, etc. In many PT surveys, subgroups smaller than 10 are not considered to provide reliable estimates for means and CVs. For example, the Alfa Wasserman ACE with N=4 has a Sigma-metric of 10.95 in this survey. This compares with sigmas of 3.38 (N=29, AAFP survey), 3.88 (N=77, AAB survey), and 3.57 (N=229, API survey). Thus, the Sigmas for subgroups may not be reliable for small subgroups.

To properly reflect the size of each subgroup in the overall performance figures, the performance figures shown at the bottom of the table are weighted averages (where the weights are the ratio of the subgroup size to the total number of the labs). Note also that the total number of labs calculated from the subgroups does not necessarily agree with the "all methods" number of labs, most likely due to the effects of the outlier criteria on the overall group as well as the subgroups. In this example, the sum of the labs identified in the subgroups is 353 in contrast to 371 being included in the overall group statistics.

The weighted average for the column "Sigma without Bias", 4.83, reflects the most probable estimate of method quality throughout the country when method bias is not considered. This assumes that local reference values and local decision levels and cutoffs would be employed in the interpretation of test results, thus minimizing and hopefully eliminating any effects of method bias.

National method quality is represented by the Sigma-metrics in the last column “Sigma with bias”. In this calculation, the bias vs the group mean has been determined for each method subgroup. For example, for the Abbott Aeroset subgroup, the observed mean of 209.2 is compared with the overall group mean 208.1, which shows a bias of 1.1 mg/dL or 0.53% (1.1/209.2). This bias is subtracted from the allowable total error, then divided by the method CV of 1.36%, e.g., (10-0.53)/1.36, which gives a Sigma-metric of 6.95 for the subgroup. The weighted average of all method subgroups is given at the bottom of this column, which is seen to be 4.48.

In summary, the quality of cholesterol testing, according to the NY survey, is somewhere between 4.09 to 4.83 Sigma. The 4.09 Sigma figure represents the quality that would be seen for a patient who moves about the country and is tested in several different laboratories. The 4.83 figure represents the quality for a patient who is treated locally within a single healthcare organization that has harmonized or standardized methods. A figure of 4.48 may be more realistic of the quality observed for a patient who experiences a variety of laboratories whose methods are not harmonized or standardized.

What do these metrics mean?

Six Sigma quality management provides benchmarks for the quality of products and processes. A Sigma of 6 represents “world class quality” and is a universal goal for the quality of products and processes. A Sigma of 3 represents the minimum level of quality needed to put a product or process into production or routine service.

An important point in laboratory testing is that the amount of QC needed depends on the Sigma performance observed for the analytical method. As described elsewhere on this website, methods that provide 6 Sigma performance can be easily monitored with only 1 or 2 control measurements, whereas methods with 5 Sigma performance require 2 or 3 control measurements and methods having 4 Sigma performance require 4 or more control measurements – if laboratories are to guarantee the desired quality is actually being achieved. Methods that provide only 3 Sigma performance, or less, do not provide the necessary quality, are not controllable by current laboratory QC practices. Methods with Sigma performance less than 5 are not suitable for application in settings where the operators have minimum laboratory skills and perform only daily QC.

If laboratory tests do provide world class quality, then current laboratory quality management practices can be considered to be optimal, with no need for further improvements. Such a finding might provide justification for reducing the current levels of QC, as proposed by CMS in the new “equivalent QC” options. If QC is to be reduced from daily to weekly or even monthly, we need to be utilizing high quality methods having 6 Sigma performance or better.


  1. Kling E, Jess HR. The relationship between Test and Outcome. Chapter 3 in Evidence-Based Laboratory Medicine, Price CP, Christenson RH, eds. Washington DC, AACC Press, 2003, pp 39-55.
  2. St. John A, Price CP. Measures of Outcome. Chapter 4 in Evidence-Based Laboratory Medicine, Price CP and Christenson RH, eds. Washington DC, AACC Press, 2003, pp 55-74.
  3. Downer K. How much does test calibration error cost? NIST report suggests $60-$199M for calcium testing alone. Clin Lab News 2004;30(August).
  4. Lenfant C. A new challenge for America: the National Cholesterol Education Program. Circulation 1986;73:855-6.
  5. Wiebe DA, Westgard JO. Cholesterol – a model system to relate medical needs with analytical performance. Clin Chem 1993;39:1504-1513.
  6. Revisions of the Laboratory Regulations for the Medicare, Medicaid, and Clinical Laboratories Improvement Act of 1967 Programs: final rule with comment period. U>S> Dept. of Health and Human Services, Fed Reg, March 14, 1990;55:9538-610.
  7. National Cholesterol Education Program Laboratory Standardization Panel. Current status of blood cholesterol measusrements in clinical laboratories in the United States. Clin Chem 1988;34:193-201.
  8. Sainato D. A new attack on the diabetes epidemic. Clin Lab News 2002;38(June):1-5.
  9. Cali JP, Mandel J, Moore L, Young DS. A reference method for the determination of calcium in serum. NBS Special Publication 260-236, 1972.
  10. Gilbert RK. The accuracy of calcium analysis in the United States. Am J Clin Pathol 1975;63:974-983.



James O. Westgard, PhD, is a professor of pathology and laboratory medicine at the University of Wisconsin Medical School, Madison. He also is president of Westgard QC, Inc., (Madison, Wis.) which provides tools, technology, and training for laboratory quality management.