Tools, Technologies and Training for Healthcare Laboratories

Mini-Review of Milan Meeting

For those unable to attend the 2014 Milan Meeting: the 1st EFLM Strategic Conference on Defining analytical performance goals 15 years after the Stockholm Conference, here's a brief mini-review (with links) to the presentations.

The Milan Mini-Review: Deep Thinking on Quality Specifications

By Sten Westgard, MS
December 5th, 2014

The Distinguished Speakers of the Milan Conference 2014What follows is a abbreviated review of the Milan 2014 1st EFLM Strategic Conference on Defining analytical performance goals 15 years after the Stockholm Conference. The full presentations are available online at The journal Clinical Chemistry and Laboratory Medicine will produce a special issue with papers from all the presenters in 2015.

For those impatient, however, I offer a mini-review. This review is of course, highly biased by my personal recollection, and my particular interests. I don’t pretend this is a comprehensive or in-depth audit of all the slides and key points. These are the highlights as I see it.

The distinguished co-chairs of the meeting, Dr. Sverre Sandburg and Dr. Mauro Panteghini, started off the day and set the tone for the conference by noting that an extraordinary number of participants were present, more than 200 from 38 different countries. They proceeded to give an outline of the goals to achieve at the meeting – to revisit and revise the 1999 Stockholm Consensus hierarchy. Then there was an admission that the conclusions had already been reached in June and that the meeting consensus had already been drafted in August, by the scientific committee. This was, Dr. Sandburg declared, how things are done these days.
Dr. Sandburg then concluded the introduction by invoking some famous quotations, particularly by George Box: “All models are wrong, but some are useful.” A similar quotation: “The best models are not necessarily the most useful models.”  [Aside – Dr. Box was the founder and longtime chairman of the Department of Statistics at the University of Wisconsin and was noted for his practical assessments and solutions to real-world problems.]

In other words, the meeting organizers had already set the path for the consensus: there will be change to the Stockholm hierarchy, but the driving principle behind the changes should be to find the “ least wrong” and “most useful” models. Whichever model could satisfy both goals should be the model that laboratories should adopt.

Next, Dr. Callum Fraser gave not only a quick historic review of the 1999 Stockholm Conference, where he served as a co-chair, but also  a general survey of the history of laboratory quality specifications (from Tonks to Barnett to Cotlove, Harris and Williams, to the Odense group, and finally to the Fraser-Hyltoft-Petersen-Libeer-Ricos work on biologic goals, which was incorporated into the Stockholm hierarchy and came to be the dominant set of quality specifications in use). He also noted recent work by Dr. Haeckel proposing a new set of 5 quality classes, and another recent publication by Dr. George Klee to propose establishing 6 approaches to quality specifications.

Session 1: Performance criteria based on clinical needs

The first in-depth session tackled the issue of defining performance criteria based on clinical needs. Dr. Andrea Horvath of Australia began this session by discussing outcome-related studies. She prefaced her discussion with an important note: that few tests have a definitive role in managing a medical condition, and it is only those tests with singular and well-defined purposes which can be usefully characterized with outcome studies (for example, hs-Troponin and HbA1c have specific diagnosis cutoffs). Once that hurdle is passed, then the clinicians and experts must be surveyed – in a way that allows investigation of the impact of analytical performance on their medical decisions and patient management. The studies may express the impacts as misclassification rates, which in turn may depend on another model of clinical outcomes, incorporating assumptions drawn about the disease prognosis, treatment benefits, harms, etc. In other words, this is complex stuff. Complex but, Dr. Horvath noted, not impossible.

2014-targets-fuzzybwNext, Dr. Per Hyltoft-Petersen, another one of the original Stockholm organizers, expanded the discussion to cover the use of simulation studies as a way to model the probability of clinical outcomes and the impact of analytical performance upon those outcomes. Most of his discussion is based on a study published in Clin Chem Acta:

Reprint of “Influence of analytical bias and imprecision on the number of false positive results using Guideline-Driven Medical Decision Limits” Per Hyltoft Petersen, George G Klee, Clinical Chimica Acta 432 (2014) 127-134.

The most useful summary of Dr. Hyltoft-Petersen's presentation is to quote the conclusion of his paper:

“Analytical quality specifications for bias and imprecision should be given separately, and only combined in relation to specific biological and clinical situations.

“It is possible to estimate the percentages of false positive results, both for the diagnosis of diabetes based on HbA1c and risk of coronary heart disease in adults based on serum cholesterol.

“The effect of poor analytical performance is reduced considerably by the criterion of two independent samplings measured above the decision point. Hereby, the effect of imprecision is greatly reduced. While the effect of bias is moderately reduced, it is still significant with approximately a doubling of FP for +2% bias. Because the use of a sharp  decision limit makes the diagnostic decisions strongly sensitive to bias, the analytical specifications for bias should be very strict and probably set to no more than ±1%.

“The current application of sharp decision limits does not consider biological variation and their sensitivity towards disease.

“A probability function as an alternative to the sharp decision limit is proposed based on the biology or the relative ordering frequency for specific tests, e.g. parathyroid hormone tests ordered following a serum calcium concentration. Such function curves can also be used to determine the incremental change in test ordering as a function of the measurement bias for the target decision analyte.”

Call it the “Getting the Right Result the Second Time” approach. If labs can double their testing volume and eliminate single cutoffs in clinical guidelines, this approach will be very useful to them.

Next Dr. Sandberg presented a discussion of performance criteria derived from the surveying of clinician opinions. This is one of those virtuous cycles: if we know how the clinicians use the test, we can set specifications for quality, to ensure that the assay performance supports the use of the test by clinicians. The real trick is in teasing out the clinician opinions, either through probing their interpretation of case studies, or by asking them to define critical differences in test results. The hard reality is that there is a sometimes a large “inter-clinician” variability, not only from clinician to clinician but also country to country. Further, clinicians may suppose that the quality of the assay is far better than it actually is. Again, this is a method that is only applicable to analytes that play a main role in specific clinical situations.

Session 2:  Performance criteria based on biological variation

2014-targets-bigcolorDr. Carmen Ricos gave a history of the biologic variation database, which currently encompasses 358 analytes. True consensus has been reached that this is a monumentally important work, and that typically the allowable imprecision has been defined as ½ of the estimated within-subject biologic variation. It has often been declared that the Ricos goals are too stringent, particularly when compared with state of the art goals (for example, CLIA), but Dr. Ricos shared EQA data from the SEQC (Spanish EQA program) showing that nearly 90% of 888 participants were able to achieve the Total Allowable Error targets in 2013. Dr. Ricos did note some of the weaknesses of the database, such as the fact that over 200 of the analytes are estimated by only 1 study, and only 27 analytes have more than 10 studies used to make the estimate. She also acknowledged that the desirable specifications are not universally applicable – analytes like sodium, albumin, chloride and HbA1c have performance specifications that no current assay can achieve, while analytes like CRP, triglycerides, and urea may have performance specifications that are too permissive because of high biological variation.

Dr. Anna Carobene took up the next discussion of the biologic variation database, pointing out more flaws. A troubling trend is that fewer papers have been published on biologic goals in recent years, even as the number of new analytes has been increasing. Only 25% of the 247 reference papers in the database are from the last 14 years. This means that estimates of biologic variation may be based on old data and old assays. Further, it is not generally known if all the papers within the database correctly followed all the recommended protocols to produce useful estimates. Without more information about data in the studies, confidence intervals around the estimates of biologic variation cannot be constructed, and without knowing the confidence intervals around the biologic variation estimates, it’s difficult to know if the database estimates are reliable. Looking at the papers used to generate the HbA1c estimate of biologic variation, for example, Dr. Carobene noted that 2 of the 8 papers were not conducted on healthy patients and only 2 were conducted within the last 4 years. Dr. Carobene next pointed out that any time an estimate of CV was larger than 33.3%, it was likely an indication that the distribution of the values was not Gaussian, and therefore the usual statistical handling was not appropriate (logarithmic transformation is probably necessary). Despite the questions regarding the reliability of many of the database estimates, Dr. Carobene still asserted the importance of the database’s existence. She revealed that a new 7-hospital study is underway to collect new data on biologic variation, funded by Becton Dickinson and SIBioC, the Italian Society of Clinical Biochemists.

Dr. Bill Bartlett followed Dr. Carobene with a theoretical discussion about evaluation of the quality of studies of biologic variation. Now the head of a new EFLM working group on Biologic Variation, Dr. Bartlett will in fact be developing a critical appraisal checklist for papers on biologic variation. He rightly pointed out that the same rigor that is applied to reference interval studies needs to be applied to biologic variation studies, since they have become so important in deriving quality specifications. This means standardizing  the methods of collection, the number of subjects, the population studies, the state of health, number and type of samples, the sample analysis, the data analysis, and the data sets that are produced. Ultimately, there should be a “data archetype” which identifies a minimum set of attributes that will qualify the study to be effectively transmitted and used in the biologic variation database. The data archetype and the critical appraisal checklist that Dr. Bartlett develops will be applied both prospectively and retroactively. That is, the current biologic variation database will be reviewed and unsuitable studies will be eliminated – so likely an interim update will be generated that will significantly change the specifications and may in fact reduce the number of analytes that are covered in the database. Then, going forward, all new studies will have to pass the critical appraisal checklist if they want to be included in the database, and must conform to the “data archetype” so that the data can be effectively and safely transmitted across healthcare systems.

During the discussion, Dr. Carobene noted that it is likely that many of the current estimates of biologic variation will be reduced once all the checklists and archetypes are applied. Thus, the current estimates of allowable imprecision, bias, and total error are all probably too large. However, the organizers of the meeting did not call for the immediate cessation of the use of the current specifications. Instead, it is hoped that the current “Ricos goals”, while not perfect, are good enough to work with until the better estimates can be established.

Session 3.  Performance criteria based on state of the art or laid down by regulation

The first presentation was to be given by Dr. Haeckel, but he could not attend, so his colleague Dr. Thomas Streichart, presented instead. Dr. Streichert introduced a raft of new statistics, starting with the empiric biologic variation, CVE, which is a derivation of the reference interval and may be used as a surrogate for within-subject biological variation. This assumes of course that the laboratory has conducted a reference interval study for all of its analytes on all of its pertinent populations (at least in the US, that’s a huge assumption to make). Taking into account that true normal distributions are unlikely, Haeckel recommends to assume a logarithmic distribution and calculate CVE*, an empirical (biological) coefficient of variation. The CVE* correlates with the CVI from the biologic variation database. From that you could derive pCVA, the permissible analytical coefficient of variation, which would then also become your permissible standard uncertainty. Permissible bias, pB would be calculated as a fraction (0.7) of pCVA. Permissible expanded uncertainty, pU%, therefore, using a combination of imprecision and bias, is 1.96 times the square root of the combined squares of pCVA and pB. Another pU could be calculated for EQA programs, pUEQAS%, which would be a 95% interval around the pU.

Dr. Orth next discussed the Rilibak specifically, and regulation-driven performance criteria in general. I suppose CLIA has given a bad name to regulation-driven quality specifications, but Rilibak has recently proven itself superior in at least one way: it actually evolves. While CLIA has not changed its numbers, Rilibak just updated their specifications in mid-2014. Dr. Orth also noted that a new policy of Health Technology Assessment has been put in place, requiring new methods to prove they provide gains in efficiency and improvements in health. Laudable as this goal is, it has in effect frozen the marketplace, since many companies are not willing to perform the additional work and studies to gain that level of clearance.

Thus, the first day ended, and after the discussion and summary, participants were finally given a copy of the Draft Consensus that the committee had pre-written for the meeting. The second day was more of a potpourri collection, since the heavy discussion of the models was complete.

Day 2

Session 4. Performance criteria in different situations

2014-targets-closebwDr. Schimmel began by reviewing the Traceability context, and the importance of designing reference materials to a common standard. Even the concepts of ISO 17511 are changing, as the standard takes into account the fact that the Measurands might be different along the chain, but even if that is true, they should still be known.

Dr. Panteghini took up the uncertainty baton, and expanded the “Temple of Laboratory Standardization” to 7 pillars: Reference Methods, Reference Materials, Accredited Reference Laboratories, Traceable Reference Intervals and Decision Limits, Appropriately organized analytical quality control, and Targets for uncertainty and error of measurement (fitness for purpose). Most of us have seen the image of the Traceability chain, with measurement uncertainty ever increasing as you pass down towards the routine sample result. Dr. Panteghini noted that the measurement uncertainty budget should encompass the uncertainty of references, the uncertainty of system calibration, the system imprecision, and individual laboratory performance (an IQC safety margin). If we recall that Expanded uncertainty U = 2*u, that means we’re trying to fit a lot of uncertainties into a small sum. Dr. Panteghini showed that this means less than 33% of the U budget can be consumed by the uncertainty of references, about 50% of the budget will be consumed by the manufacturer’s calibration and value transfer level. The rest is available for the system imprecision and individual laboratory performance. For this system to work properly, manufacturers will need to take more responsibility and ensure the traceability of the combination of platform, reagents, calibrators, and controls materials. Indeed, they should report the combined (Expanded) uncertainty “associated with their calibrations when used in conjunction with other components of the analytical system (platform and reagents). This is more than what manufacturers are currently providing as traceability information – typically they only provide the name of the higher order reference material or procedure to which the calibration is traced.

Dr. Ferruccio Ceriotti took the last leg of the uncertainty relay. His lecture described how to transfer all of these uncertainty calculations onto the IQC chart. Using CLSI C24-A3 as his template, he reworked the IQC implementation process so it was completely uncertain. Using an overall CV (a weighted mean of CV from for example 6 months of data), a laboratory can estimate its uncertainty related to random variability. The systematic component of the uncertainty (Bias) can be estimated by using the calibrator as a surrogate (use the uncertainty of the value assigned to the calibrator). These two components are squared, combined and then a square root is taken. This is now the combined uncertainty, u. Again, the expanded uncertainty, U = 2*u. From this Dr. Ceriotti then defines upper and lower specification limits (USL and LSL, which could come from Ricos goals, for example) such that the actual measurement must fall between USL-U and LSL+U. In this manner, no typical statistical control rules are used: no 1s, 2s, or 3s, no “Westgard Rules” etc. It’s all uncertainty.

Session 5. Performance criteria for extra-analytical phases in the total testing process

2014-targets-manyringsbwDr. Graham Jones brought the conference to a new topic: how to set criteria for EQA (PT) programs. One of the most important statements he made, and I’m paraphrasing here, is that “any single result in an EQA scheme is inherently judged by Total Error limits.”  It follows that Quality standards set by EQA programs therefore set goals based on Total Error implicitly. With more measurements, it is indeed possible to tease out separate estimates of bias and imprecision, but the global system of EQA and PT is truly grounded in Total Error.  This was not to deny that individual specifications for CV and Bias were useful in other quality management applications.  Dr. Jones noted that there was an embarrassing variation in the performance criteria being used by EQAs around the world, partly driven by different perceptions of the purpose of EQA. Is EQA really only a technique to exclude the worst labs? Is it to set an expected standard that most can pass, or is it supposed to set an aspirational standard that some will pass, but others won’t, unless better methods are introduced? The severity of failure penalties also explains why some standards are tighter than others. If CLIA’s penalty for failure is de-registration and non-payment, this is high-stakes testing. Dr. Jones stated that his organization, the RCPA (Royal College of Pathologists of Austral-Asia) had made a concerted effort to set specifications (allowable limits of performance) that were from higher levels of the Stockholm Hierarchy.

Dr Wytze Oosterhuis began his talk by quoting from his 2011 letter to the editor in Clinical Chemistry, Gross Overestimation of Total Allowable Error Based on Biological Variation (Clin Chem 2011; 57: 1334). In this letter he notes that Fraser and Hyltoft Petersen, when they published their biologically-derived quality goal specification, applied the total error to represent the sum of 1.65 * 0.5*CVI and 0.25*the square root of the combination of CVI squared and CVGsquared. However, that bias specification traces back to Gowans et al and assumed that the specifications for CV and Bias were maximums (that is, if Bias was at its maximum, SD had to be zero, and vice versa).  Dr. Oosterhuis presented an example where for CK, the Ricos goal is set at 30.3%, but by his calculation, it should be maximum 18.9%. As with Dr. Carobene, the implications of this presentation are that the Total Error as calculated from Petersen-Fraser’s model will be too large, and that the appropriate maximum allowable bias and maximum allowable imprecision are much smaller.

Switching gears again, Dr. Gunnar Nordin talked about setting performance criteria for “qualitative test” procedures. Most of the conference had been concentrating on assays that generate a numerical, quantifiable result. What about tests that are “qualitative” or “semi-quantitative?” Dr. Nordin began with a question, “What is a ‘Qualitative’ Test?” The answer can be further categorized into two types of tests: tests use an ordinal scale (all types of grading, including negative/positive), and tests that use a nominal scale (classification of disease, “yes/no”) . Nominal tests are like blood typing; ordinal scale tests are like pregnancy tests. Ordinal measurements have no units, and it is not clear how to establish “traceability”, estimate precision, and express uncertainty for the results. Dr. Nordin suggested defining the c5, c50, and c95 quantities, with the manufacturer declaring the c50 value and describing metrological traceability.  [See CLSI EP12 guideline on User Verification of Qualitative Test Performance for information on these quantities.]

Finally, we reached the last session, which covered the remaining parts of the Total Testing Process or Brain-to-Brain Loop.

Dr. Mario Plebani, a pre-eminent expert on sources of errors in the Total Testing Process, discussed his latest project, which is to establish and monitor Quality Indicators in the pre-analytical and post-analytical phase. A hierarchy of criteria has not yet been defined for the pre-analytical phase, and the quality performance specifications are still in development. However, recently progress has been made to standardize the reporting metrics: errors will be expressed as percentages, parts per million (ppm), and/or as Six Sigma metrics. The IFCC Working Group on Laboratory Errors and Patient Safety (WG-LEPS) has built a universal set of quality indicators, with approximately 28 in the pre-analytical phase, 6 in the analytical phase, and 11 in the post-analytical phase. Examples of pre-analytical quality indicators include Pre-OutpTN, which monitors the number of outpatient requests with erroneous data entry. After establishing these monitors, Dr. Plebani proposes to set three levels: optimum, desirable, and minimum, similar to the Fraser specifications for biologic variation. For example, through an 80-lab survey of the monitors, Dr. Plebani identified the 25th, median, and 75th percentile errors rates for misidentification. The optimal goal for misidentification errors is obviously 0. But realistically, it is unlikely labs can achieve that. Is it however desirable if they could achieve in the top 25th percentile, or a rate of 0.010. They should be able to at least achieve a rate of 0.040, which is therefor the minimum quality specification. The full pre-analytical quality indicator list and specifications are expected in a forth-coming highly anticipated paper.

The last presentation was by Dr. Ken Sikaris about the post-analytical phase of testing and performance criteria. The main concern of the post-analytical phase is test interpretation, and Dr. Sikaris presented many troubling examples of how variable the interpretation can be. He referenced a paper showing how the Stockholm criteria can be applied to Reference Intervals and Decision Limits, as well as a paper showing how Australia harmonized its reference intervals. Clearly there is a lot of room for improvement in standardizing how laboratory professionals report their results.

Summing up: Where do we go from here?

2014-targets-chutesFinally, at the end of the day, the conference organizers opened the discussion up on the Draft Consensus Statement, as well as the Five Working Groups they had decided to establish. 

  • Task Group 1 is to allocate different tests to different models.
  • Task Group 2 is to harmonize the quality specifications for EQA analytes.
  • Task Group 3 is to debate the utility of Total Error and possibly to amend the calculation of Biological Total Allowable Errors.
  • Task Group 4 is to define the quality specifications for the pre-analytical and post-analytical phases of testing.
  • Task Group 5 is to examine the biological variation database and cull the unacceptable studies from the current database, while developing a checklist for assuring that future studies meet the appropriate criteria to be included in the database.

  • Summing up and further work - Sverre Sandberg (NO)

These groups are expected to work over the next two years and produce multiple deliverables, mainly in the form of papers to be published. While the organizers declared that the working groups were open to participants, there werre no confirmations of participation given during the meeting. Whether or not you will be included in any working group will depend upon a later decision of the conference organizers and/or the scientific committee.

So that’s the mini-review. Remember the slides are available on the ELFM website and CCLM will publish papers by each of the presenters on their topics in a special issue (coming in 2015).

The road ahead looks like it will have many new models, and importantly, you may have a choice between what quality specification is appropriate for your test. Of all the models introduced and discussed, only the Total Error model seems to have incurred the disfavor of the organizers. That is the one model they are specifically criticizing and attempting either to modify or eliminate. As long as that model isn't the one you use in your QC, EQA, PT, method validation, instrument choice, etc., there isn't anything to worry about.