Good Data Wanted - Bad Data Should Not Apply (The STARD Initiative)

James O. Westgard, PhD, FACB

In 2003, the premier scientific journals announced a new initiative called STARD (Standards for Reporting Diagnostic Accuracy) What does STARD mean for you? And what does it say about the data being reported in journals today? Dr. Westgard investigates. (Preview)

The STARD Initiative

What does "good data" look like?
The STARD Initiative
STARD checklist
No QC required!
Validation requires statistical control!
STARD: a step in the right direction
Clinical Laboratory Scientists: the real way to get good data
References

**An updated version of this essay appears on the Nothing but the Truth about Quality book.**

January 2003

Some people tire of my preaching about the need for correct test results and the importance of quality control. However, everyone accepts that good data is a prerequisite to good management. But somehow, many people think they will know it when they see it, i.e., they can tell bad data from good data.

What does "good data" look like?

What is it about bad data that makes it look good? What gives bad data the appearance that we can trust the results and conclusions based on it? Basically, it has the appearance of exactness because it has been printed or published in some form. Just printing a test result on a computer report makes the number look reliable. Somehow the act of being published, printed, and/or cited enough times, a kind of magic takes place: whatever its origins, the bad data transmutes into good data. I've cited some examples in earlier essays about such magic acts:

The magic of numbers on the distribution of errors between pre-analytic, analytic, and post-analytic parts of the total testing process, which were originally contained in an abstract, then subsequently cited in peer-reviewed papers, which were then cited (rather than the original abstract) in later peer-reviewed papers. See Errors in reasoning about laboratory errors
The magic of recommendations for method performance and quality control in national guidelines for medical treatment, such as the recent ADA/HSS guidelines for glucose testing for diagnosis of diabetes. See Cooking the books: Is it happening with laboratory error budgets?.
The magic of specifications for method performance and quality control as published under the guise of evidence-based medicine, even though there is no scientific basis or evidence for those recommendations. See Why not evidence-based method specifications .
The magic of numbers on the performance of new diagnostic tests, such as high sensitivity CRP, based on epidemiological studies that consider large groups of patients, but don't consider the known biologic variability of individual patients. See Quintiles and quality. See also the January 2003 issue of Clinical Chemistry for some interesting discussion of this issue [1,2].

STARD initiative

Other papers in the January 2003 issue of Clinical Chemistry bring new recognition of the problem with published numbers. There are three papers [3-5] on "Standards for Reporting Diagnostic Accuracy, " or STARD as it is called. A total of 22 pages are devoted to STARD, which indicates this must be a pretty important subject.

The URL at Clin Chem for the STARD statement is http://www.clinchem.org/cgi/content/full/49/1/1
The Clin Chem URL for the explanatory document is http://www.clinchem.org/cgi/content/full/49/1/7
URL for the Clin Chem editorial is http://www.clinchem.org/cgi/content/full/49/1/19

The STARD initiative is an attempt to provide more uniform guidelines for publication of studies on "diagnostic accuracy" in scientific journals. Diagnostic accuracy has to do with the initial validation of the medical usefulness of new diagnostic technology, including new laboratory tests. This phase of validation involves the testing of patients who are known to have a certain disease, as well as patients who are known NOT to have that disease, in order to assess the ability of the test to classify patients correctly. Typically such studies report the diagnostic sensitivity and specificity, which can be presented graphically in the form of receiver operator characteristics (ROC). The results of these studies then become the basis of further "data mining" and "meta-analysis" publications, which pool the data from previously published studies, giving higher numbers of test subjects and generally better and better numbers for diagnostic performance.

According to David Bruns, editor of Clinical Chemistry, STARD has been endorsed by several other major journals and is published in BMJ, Annals of Internal Medicine,Radiology and AJCP, among others, with more on the way. There is also a recent editorial in JAMA endorsing STARD, available at http://jama.ama-assn.org/issues/v289n1/ffull/jed20079.html

One ominous conclusion you can reach from this initiative: there must be an awful lot of bad data out there if a group of journals finds it necessary to cooperatively issue these standards

STARD checklist

The first article in the series of three papers is a literature survey of protocols and recommendations on diagnostic research methodology [3]. The results were summarized as follows:

"The search of the published guidelines on diagnostic research yielded 33 previously published checklists, from which we extracted a list of 75 potential items. The consensus meeting shortened the list to 25 items, using evidence on bias whenever available…"

The second article details the list of the 25 items in the checklist [4]. Of interest here are some of the items that have to do with the reliability of the measurement data:

"Item 8. Describe the technical specifications of material and methods involved including how and when measurements were taken, and/or cite references for index tests and reference standards.

"Item 9. Describe the definition of and rationale for the units, cutoffs and/or categories of the results of the index tests and the reference standard.

"Item 10. Describe the number, training and expertise of the persons executing and reading the index tests and the reference standards.

"Item 11. Describe whether or not the readers of the index tests and reference standard were blind (masked) to the results of the other test and describe any other clinical information available to the readers.

"Item 12. Describe methods for calculating or comparing measures of diagnostic accuracy, and the statistical methods used to quantify uncertainty (e.g., 95% confidence intervals).

"Item 13. Describe methods for calculating test reproducibility, if done." (emphasis added here)

The only item that deals directly with the analytical performance of the test is item 13, which addresses "Reproducibility, if done". The only recommendation seems to be that reproducibility, if done, should be reported, though later on there is some discussion that this estimate of reproducibility, if done, should be useful to "show the reader the range of likely values around an estimate of diagnostic accuracy."

We invite you to read the rest of this article

This complete article and many more essays are available in the Nothing but the Truth about Quality manual. You can also download the Table of Contents and additional chapters here.

Tools, Technologies and Training for Healthcare Laboratories

Quality Requirements and Standards