Outcomes Research

CHAPTER 5 Outcomes Research

The time when physicians chose treatment based solely on their personal notions of what was best is past. This era, although chronologically recent, is now conceptually distant. In a health care environment altered by abundant information on the Internet and continual oversight by managed care organizations, patients and insurers are now active participants in selecting treatment. Personal notions (“expert opinion”) are replaced by objective evidence. And the physician’s sense of what is best is being supplemented by patients’ perspectives on outcomes after treatment.

Outcomes research (clinical epidemiology) is the scientific study of treatment effectiveness. The word effectiveness is a critical one, because it pertains to the success of treatment in populations found in actual practice in the real world, as opposed to treatment success in the controlled populations of randomized clinical trials in academic settings (“efficacy”).1,2 Success of treatment can be measured using survival, costs, and physiologic measures, but frequently health-related quality of life (QOL) is a primary consideration.

Therefore, to gain scientific insight into these types of outcomes in the observational (nonrandomized) setting, outcomes researchers need to be fluent with methodologic techniques that are borrowed from a variety of disciplines, including epidemiology, biostatistics, economics, management science, and psychometrics. A full description of the techniques in clinical epidemiology3 is beyond the scope of this chapter. This chapter provides the basic concepts in effectiveness research and a sense of the breadth and capacity of outcomes research and clinical epidemiology.


In 1900, Dr. Ernest Codman proposed to study what he termed the “end results” of therapy at the Massachusetts General Hospital.4 He asked his fellow surgeons to report the success and failure of each operation and developed a classification scheme by which failures could be further detailed. Over the next 2 decades, his attempts to introduce systematic study of surgical end results were scorned by the medical establishment, and his prescient efforts to study surgical outcomes gradually faded.

Over the next 50 years, the medical community accepted the randomized clinical trial (RCT) as the dominant method for evaluating treatment.5 By the 1960s, the authority of the RCT was rarely questioned.6 However, a landmark 1973 publication by Wennberg and Gittelsohn spurred a sudden re-evaluation of the value of observational (nonrandomized) data. These authors documented significant geographic variation in rates of surgery.7 Tonsillectomy rates in 13 Vermont regions varied from 13 to 151 per 10,000 persons, even though there was no variation in the prevalence of tonsillitis. Even in cities with similar demographics and similar access to health care (Boston and New Haven, Conn.), rates of surgical procedures varied 10-fold. These findings raised the question of whether the higher rates of surgery represented better care or unnecessary surgery.

Researchers at the Rand Corporation sought to evaluate the appropriateness of surgical procedures. Supplementing relatively sparse data in the literature about treatment effectiveness with expert opinion conferences, these investigators argued that rates of inappropriate surgery were high.8 However, utilization rates did not correlate with rates of inappropriateness, and therefore did not explain all of the variation in surgical rates.9,10 To some, this suggested that the practice of medicine was anecdotal and inadequately scientific.11 In 1988, a seminal editorial by physicians from the Health Care Financing Administration argued that a fundamental change toward study of treatment effectiveness was necessary.12 These events subsequently led Congress to establish the Agency for Health Care Policy and Research in 1989 (since renamed the Agency for Healthcare Research and Quality, or AHRQ), which was charged with “systematically studying the relationships between health care and its outcomes.”

In the past decade, outcomes research and the AHRQ has become integral to understanding treatment effectiveness and establishing health policy. Randomized trials cannot be used to answer all clinical questions, and outcomes research techniques can be used to gain considerable insights from observational data (including data from large administrative databases). With current attention on evidence-based medicine and quality of care, a basic familiarity with outcomes research is more important than ever.

Key Terms and Concepts

The fundamentals of clinical epidemiology are best understood by thinking about an episode of treatment: a patient presents at baseline with an index condition, receives treatment for that condition, and then experiences a response to treatment. Assessment of baseline state, treatment, and outcomes are all subject to bias. We begin with a brief review of bias and confounding.

Assessment of Baseline

Most physicians are aware of the confounding influences of age, gender, ethnicity, and race. However, accurate baseline assessment also means that investigators should carefully define the disease under study, account for disease severity, and consider other important variables such as comorbidity.

Disease Severity

The severity of disease strongly influences response to treatment. This reality is second nature for oncologists, who use tumor-node-metastasis stage to select treatment and interpret survival outcomes. It is intuitively clear that the more severe the disease, the more difficult it will be (on average) to restore function. Yet this concept has not been fully integrated into the study and practice of common otolaryngologic diseases such as sinusitis and hearing loss.

Recent progress has been made in sinusitis. Kennedy identified prognostic factors for successful outcomes in patients with sinusitis and has encouraged the development of staging systems.16 Several staging systems have been proposed, but most systems rely primarily on radiographic appearance.1720 Clinical measures of disease severity (symptoms, findings) are not typically included. Although the Lund-Mackay staging system is reproducible,21 often radiographic staging systems have correlated poorly with clinical disease.2226 As such, the Zinreich method was created as a modification of the Lund-Mackay system, adding assessment of osteomeatal obstruction.27 Alternatively, the Harvard staging system has been reproducible21 and may predict response to treatment.28 Scoring systems have also been developed for specific disorders such as acute fungal rhinosinusitis,29 and clinical scoring systems based on endoscopic evaluation have likewise been developed.30 The development and validation of reliable staging systems for other common disorders, as well as the integration of these systems into patient care, is a pressing challenge in otolaryngology.


Comorbidity refers to the presence of concomitant disease unrelated to the “index disease” (the disease under consideration), which may affect the diagnosis, treatment, and prognosis for the patient.3133 Documentation of comorbidity is important, because the failure to identify comorbid conditions such as liver failure may result in inaccurately attributing poor outcomes to the index disease being studied.34 This baseline variable is most commonly considered in oncology, because most models of comorbidity have been developed to predict survival.32,35 The Adult Comorbidity Evaluation 27 (ACE-27) is a validated instrument for evaluating comorbidity in cancer patients and has shown the prognostic significance of comorbidity in a cancer population.36,37 Because of its impact on costs, utilization, and QOL, comorbidity should be incorporated in studies of nononcologic diseases as well.

Assessment of Outcomes

Fundamentals of Study Design

A variety of study designs are used to gain insight into treatment effectiveness. Each has advantages and disadvantages. The principal tradeoff is complexity versus rigor, because rigorous evidence demands greater effort. An understanding of the fundamental differences in study design can help interpret the quality of evidence, which has been formalized by the evidence-based medicine (EBM) movement. EBM is the “conscientious, explicit, and judicious use of current best evidence in making decisions about the care of individual patients.”41 EBM is discussed in detail elsewhere in this textbook. The following paragraphs summarize the major categories of study designs, with reference to the EBM hierarchy of levels of evidence (Table 5-1).41,42

Other Study Designs

There are numerous other important study designs in outcomes research, but a detailed discussion of these techniques is beyond the scope of this chapter. The most common approaches include decision analyses,44,45 cost-identification and cost-effectiveness studies,4648 secondary analyses of administrative databases,4951 and meta-analyses.52,53 Critiques of these techniques are referenced for completeness.

Grading of Evidence-Based Medicine Recommendations

EBM uses the levels of evidence described earlier to grade treatment recommendations (Table 5-2).54 The presence of high-quality RCTs allows treatment recommendations for a particular intervention to be ranked as grade A. If no RCTs are available, but there is level 2 or 3 evidence (observational study with a control group or a case-control study), then the treatment recommendations are ranked as grade B. The presence of only a case series would result in a grade C recommendation. If only expert opinion is available, then the recommendation for the index treatment is considered grade D.

Table 5-2 Grade of Recommendation and Level of Evidence

Grade of Recommendation Level of Evidence
A 1
B 2 or 3
C 4
D 5

Measurement of Clinical Outcomes

Clinical studies have traditionally used outcomes such as mortality and morbidity, or other “hard” laboratory or physiologic endpoints,55 such as blood pressure, white cell counts, or radiographs. This practice has persisted despite evidence that interobserver variability of accepted “hard” outcomes such as chest x-ray findings and histologic reports are distressingly high.56 In addition, clinicians rely on “soft” data, such as pain relief or symptomatic improvement to determine whether patients are responding to treatment. But because it has been difficult to quantify these variables, these outcomes have until recently been largely ignored.

Psychometric Validation

An important contribution of outcomes research has been the development of questionnaires to quantify these “soft” constructs, such as symptoms, satisfaction, and QOL. Under the Classical Test Theory, a rigorous psychometric validation process is typically followed to create these questionnaires (more often termed scales, or instruments). These scales can then be administered to patients to produce a numeric score. The validation process is introduced herein; a more complete description can be found elsewhere.5759 The three major steps in the process are the establishment of reliability, validity, and responsiveness; in addition, increasing consideration is also given to burden.

Responsiveness. A responsive scale detects clinically important change.62 For instance, a scale may distinguish a moderately hearing impaired individual from a deaf individual (the scale is “valid”), but can it detect a different score if the individual’s hearing improves mildly after surgery? Alternatively, the minimum improvement in score that represents a clinically important change might be provided.63,64

More recently, item response theory (IRT) has been used to create and evaluate self-reported instruments. A full discussion of IRT is beyond the scope of this chapter. In brief, Item Response Theory uses mathematic models to draw conclusions based on the relationships between patient characteristics (latent traits) and patient responses to items on a questionnaire. A critical limitation is that IRT assumes that only one domain is measured by the scale. This may not fit assumptions for multidimensional QOL scales. However, if this assumption is valid, IRT-tested scales have several advantages. IRT allows for the contribution of each test item to be considered individually, thereby allowing the selection of a few test items that most precisely measure a continuum of a characteristic. In other words, because each test item is scaled to a different portion of the characteristic being tested, the number of questions can be reduced.6568 Therefore, IRT lends itself easily to adaptive computerized testing, allowing for significantly diminished testing time and reduced test burden.65 In the future, IRT will likely be the basis to more and more new questionnaires evaluating outcomes, including QOL.

Jun 5, 2016 | Posted by in OTOLARYNGOLOGY | Comments Off on Outcomes Research

Full access? Get Clinical Tree

Get Clinical Tree app for offline access