Distinction of patient-reported health status information into prediction and outcome variables
Within each group there are several variables of potential interest which can be further differentiated. Figure 7.2 shows an adaptation for tinnitus patients of a well-established theoretical framework for the assessment of patient-reported outcomes in chronic conditions published by Wilson and Cleary (1995).
There is a large amount of literature investigating potential determinants for the development of tinnitus and the perception of tinnitus symptoms. Within a clinical practice setting, we believe it is important to limit the assessment to a reasonable number of constructs, which may either be important to monitor the treatment success or to assess potential predictors of the treatment success which may also be treatment targets, like the assessment of perceived stress, self-management skills, or the mental health status of the patient.
It seems important to distinguish between the assessments of tinnitus symptoms, e.g., noise sensation and hearing lost, and the impact of both symptoms on the patient, his health-related quality of life, or even patient’s global quality of life perception.
On the outcome side of the measurement model, it seems important to distinguish between the assessments of tinnitus symptoms, e.g., noise sensation and hearing lost, and the impact of both symptoms on the patient, his health-related quality of life, or even his global quality of life perception. Whereas the assessment of the symptoms should be descriptive, the assessment of the disease impact on the patient’s life inherently includes the appraisal and coping processes of the patients. Both are important treatment targets in particular in situations when the primary symptoms cannot be reduced. Although almost 10–15% of the population suffer from tinnitus (Henry et al. 2005), just 1% report a serious impact on their quality of life (Axelsson and Ringdahl 1989) which demonstrates the relevance to distinguish conceptually different treatment outcomes in clinical settings.
7.2 Overview of Commonly Used Assessments
7.2.1 Patient-Reported Outcomes
There are several disease-specific outcome measurement tools, which are frequently used for scientific research questions. However, measurement properties of psychometric assessments in clinical practice settings need much higher standards. Whereas in clinical research, larger sample sizes can account for measurement error, this is not possible in clinical practice with a sample size of n = 1. In addition, instruments need to provide a suitable, real-time report with reference values to enable clinical decision-making. As it will be further discussed below, an ideal situation would be to have different tools for different purposes but still measuring the same construct on the same scale, to make scores comparable between different settings. By today, this goal has not yet been achieved. Thus, at this time we need to choose between different tools to apply the most feasible for clinical practice settings.
There are two established tools measuring primarily the tinnitus severity: Tinnitus Severity Index and Tinnitus Severity Questionnaire. Both instruments are discussed in Chap. 8.
220.127.116.11 Disease Impact
Five commonly used tools provide an assessment of the impact of tinnitus, with some using composite scores, including the assessment of tinnitus symptoms. All are discussed in detail in Chap. 8.
18.104.22.168 Health-Related Quality of Life
There is a plethora of different tools assessing the generic self-reported health status. Within the scope of this chapter, we would only like to mention the Short Form 36 Health Survey (SF-36 ® ), which is the mostly used tool of all psychometric instruments that has been applied in over 10,000 published studies.
The SF-36 (Bullinger 1995) is an interdisciplinary measuring instrument for recording the health-related quality of life of patients and in tinnitus studies already regarded as an important survey instrument (Muluk 2009). A total of eight dimensions are assessed and can be classified into the areas of “physical health” and “mental health” and the areas “physical function,” “physical function,” “physical pain,” “general health awareness,” “vitality,” “social functioning,” “emotional role,” and “mental well-being,” The evaluation is made by adding the crossed responses per subscale. A computerized program is available for the evaluation. In order to facilitate the interpretation, all scales are usually reproduced on a 50/10 scale standardized by means of representative population data. A value of 50 corresponds to the mean value of the normal population. A higher value corresponds to a better health condition. The inner consistency of the subscales is between r = 0.57 and r = 0.94.
With the SF-12, a short form is available to measure the physical and mental component score allowed. The SF-8 short form is not suitable for clinical use but rather for large epidemiological studies.
22.214.171.124 Global Quality of Life
Global quality of life is an emergent and idiosyncratic construct, i.e., it is perceived as more than a compilation of different subcomponents. Thus, the assessment of the overall quality of life is typically being measured by one single item, with a considerable measurement error. However, large discrepancies between the HRQL assessment and the global quality of life score may add in particular situations valuable information for the clinical encounter.
One of the theoretically best grounded global quality of life assessments is the Anamnestic Comparative Self-Assessment (ACSA). The ACSA is a visual analog scale (VAS), but different to similar tools, both extremes are anchored to individual reference points (best and worst time in the patient’s life). We believe avoiding normative anchoring (e.g., comparing to others) is the most consequent approach to access this construct. Although being a highly abstract measurement, still the ACSA has shown to be effective with for the assessment of tinnitus patients (Kamalski et al. 2010; Mazurek et al. 2006). It has been shown that in many cases, tinnitus patients experience a considerable limitation in their quality of life. The measurement of the global quality of life is carried out by means of an open question, such as “What is your current quality of life compared to the most beautiful and the worst time in your life?”. Factors such as the social desirability or rehabilitation, which result from comparative processes, are circumvented because of the individual reference in the test. The coding is numeric. Validity was performed on the basis of a pilot study with cancer patients (Bernheim 1986). The scale achieves high interrate and retest reliability (Ledure et al. 1981).
There are several parameters which have shown (or are supposed) to influence the development of tinnitus symptoms or the success of tinnitus treatment. Within the scope of this chapter, only a few domains can be mentioned which are comparatively easy assessable with patient self-assessments or with standardized clinical interviews.
Many studies have shown a close correlation between the subjective tinnitus exposure and psychological comorbidity (Hiller and Goebel 1999; Langguth et al. 2007; Weber et al. 2008). Affective disorders such as depression and dysthymia, somatization, anxiety, panic, and obsessive-compulsive disorder are both comorbidities and operative factors in the rehabilitation or habituation process (Andersson and Westin 2008; Stobik et al. 2005; Hesse 2008; Schaaf and Gieler 2010).
The use of self-report measures for case identification, severity assessment, and treatment monitoring of depression has been advocated by a growing number of practice guidelines for different chronic conditions. The benefit of self-report questionnaires is that they may help to identify mental comorbidities in busy practice setting without large effort for the healthcare provider. In general such screening tools are useful for mental health disorders with a higher prevalence.
There are dozens of well-validated, self-report questionnaires available for depression screening in clinical care. They differ from each other with respect to their theoretical background, or content, but—unfortunately—also with respect to their screening results (Thombs et al. 2008; Cameron et al. 2008; Kendrick et al. 2009). Whereas some instruments favor measurement precision, or range, others focus on respondent burden (Kroenke 2001; Kroenke et al. 2003). For example, the updated National Clinical Practice guideline (2009) of the National Institute for Health and Clinical Excellence (NICE) recommends the use of two questions about depressed mood and anhedonia in the past month for case finding of depression in primary care patients and patients with physical illnesses (National Collaborating Centre for Mental Health 2010a, b; Whooley et al. 1997; Mitchell and Coyne 2007).
One instrument with the most favorable screening properties is the PHQ-9. It has been developed to capture the nine key aspects of depressive disorders as defined in the DSM-V or ICD-10 classification system with one item each. The PHQ-9 self-rating has recently also been recommended for depression measurement by the International Consortium for Health Outcomes Measurement (www.ichom.org).
Another commonly used tool is the Center of Epidemiological Studies Depression Scale (CES-D, in Germany ADS). Unfortunately, several different versions of this scale are being used. However, all of them show favorable psychometric characteristics. Different to the PHQ-9, the CES-D was constructed to measure a more purely defined depression construct. Thus, items assessing physiological symptoms of depression, e.g., appetite lost, which are included in the PHQ-9, are missing. The long version of ADS (Hautzinger and Bailer 1993; Fuhr et al. 2016) contains 20 items. The total value of all responses can vary between 0 and 60 points. It serves as a characteristic value of the depressive symptoms. Increased ADS score (>23 points) indicates a depressive disorder. The response is based on a four-step response scale (Fig. 7.2). ADS has already been used for the treatment evaluation of chronic tinnitus in order to determine the presence of depressive symptoms (Seydel et al. 2010). The ADS is a valid and reliable measuring instrument, whose long form has an internal consistency of r = 0.89. Longer instruments which also have been used frequently, like the BDI (Titov et al. 2011), did not show to our knowledge a clear added benefit for use with tinnitus patients over the instruments just being mentioned.
The state of the art for the assessment of present mental health conditions are structured diagnostic interviews, like the Composite International Diagnostic Interview (CIDI) (Wittchen 1994). Both require an trained interviewer, and thus, only in rare cases, clinical practice setting will have the resources available to use standardized diagnostic interviews for all their patients. However, they are very useful tools to confirm a screening result from patient self-reported assessment, like the ones mentioned above. We have used the CIDI and identified depressive disorders in 37%, anxiety disorders in 32%, and somatoform disorders in 27% of our tinnitus patients (Zirke et al. 2013).
Patients with severe (decompensated) tinnitus-induced distress had significantly more affective and anxiety disorders than patients with compensated tinnitus.
126.96.36.199 Perceived Stress
One of the most important self-assessments, next to assessment of the tinnitus symptoms themselves, is the assessment of the perceived stress. There are several instruments, which have shown to be effective in patients with tinnitus. The most commonly used instruments are presented and discussed in Chap. 9.
Although theoretically “stress” has to be seen as a determinant for the development or perception of tinnitus symptoms, clearly, it is also one of the treatment targets. Thus, monitoring of perceived stress scores also helps to determine the efficiency of our treatment attempts in clinical practice. Figure 7.3 shows that within a sample of 192 patients receiving a structured, 7-day inpatient treatment, stress levels and the depressive symptoms decrease in parallel to tinnitus-induced distress.
Example of treatment-related changes in tinnitus patients measured by self-rating instruments for tinnitus (a) (TQ) (Goebel and Hiller 1994), stress (b) (PSQ) (Fliege et al. 2005), and depression (c) (ADS) (Hautzinger and Bailer 1993). Filled squares – patients with disturbing (non-compensated) tinnitus, open circles – patients with non-disturbing (compensated) tinnitus; T0 – study onset, T1 – 7 days after therapy onset, T3 3 month after therapy, T4 – 12 months after therapy
There are number of personality traits which interfere with the mental health status of the patient, as well as with their ability to cope with the occurrence or persistence of tinnitus. Among them, we find that the assessment of the self-efficacy expectation and optimism of the patients may add important information to other constructs that were mentioned before. However, if a respondent burden is a limiting factor, those additional assessments may be omitted.
One simple tool to assess the self-efficacy expectation and possibilities of coping is the SWOP. The SWOP measures self-efficacy, optimism, and pessimism as independent scales and represents a further development of the SWO (Scholler et al. 1999). It comprises a total of nine items, of which five items record the “self-efficacy” of a person and each two items “optimism” and “pessimism.” Finally, the evaluation is carried out by means of the formation of mean values for the individual scales. Self-efficacy and optimism are meaningful parameters for the evaluation of therapy. In various clinical studies, the self-efficacy expectation with regard to health behavior and pain management has been demonstrated as a significant influencing factor (Buckelew et al. 1994; Litt et al. 1993). An investigation by Sirois et al. (2006) showed that a tinnitus subset and apoptosis could be better achieved with a rather severe tinnitus stress if the patient had a higher self-efficacy expectation. The test quality criteria prove to be satisfactory.
A related construct which we find informative is the assessment of the sense of coherence as one potential stress buffer.
The SOC-L9 is based on the model of the salutogenesis of Antonovsky (1993, 1995), which focuses on factors that keep people healthy despite stress and stress. It is a relationship between health, stress, and coping. The SOC-L9 is a short circuit with nine items (Antonovsky 1995; Scholler et al. 1999) and measures coherence, which is defined as a global orientation that reflects the extent to which a person is a generalized and persevering and has dynamic feeling of the confidence that its own inner and outer environment is predictable and that it is very likely that things will develop as one might reasonably expect.
The measured items can be grouped into the three scales of “understandability” (two items), “manageability” (three items), and “meaningfulness” (four items) and are considered subcomponents of the coherence feeling (Schneider et al. 2004). A raw sums value is calculated for all items, which can be compared with the mean value of a standard sample. The questionnaire has a sufficiently high degree of reliability and validity. Schumacher et al. (2000), during a statistical evaluation of the SOC-L9 on a representative sample of the German population (n = 2005), have determined a separation severity coefficient with values between r = 0.56 and r = 0.68, which is exactly the same as the calculated one. Internal consistency (Cronbach-α = 0.87) can be considered as good. The normalized values obtained by Schumacher et al. (2000) from a German representative survey are similar to the results of Hannover et al. (2004). Investigations by Söderman et al. (2001, 2002) reported that the SOC had an influence on the results in the HADS in Meniére disease and tinnitus patients and proved to be a relevant influencing factor of psychosocial dimensions. There was a high correlation between the SOC and the quality of life. This questionnaire is particularly useful when choosing the individual design of therapeutic measures.
7.3 The Next Generation of Instruments
Clinical assessments of PROs typically call for short, yet very precise measurements. Thus, the psychometric requirements for instruments used for this purpose are high. Among these, measurement precision is of crucial importance, while small changes over time must be interpreted and a vast array of different sources of variance that can hardly be controlled under real-life conditions.
Today, many validated outcomes tools could be used (McDowell et al. 2004) and allow for increasing specification of a range of domains related to health and well-being of tinnitus patients. However, the use of these tools has important limitations. One is that all of the tools mentioned above and in the following chapters were not developed for clinical practice. Thus, the most precise and comprehensive questionnaires are rather lengthy and complex, leading to a level of respondent burden that hampers their use in clinical routines.
An additional major limitation has been that results from different questionnaires are difficult to compare, even when two similar instruments are used to assess the same outcomes. The situation is as if body temperatures assessed in different settings were not comparable with one another but were dependent on the particular thermometer used (Ware 1993, 2008). To make the measurement of psychological constructs more similar to biomedical ones, a standardized, efficient approach for a variety of applications including clinical practice and clinical trial research needs to be developed, so that results can be compared across conditions, therapies, trials, and patients.
7.3.1 Common Metric
The use of the so-called item response theory (IRT) for the development of PRO tools provides a solution to many of the limitations of existing instruments. IRT methods were developed more than four decades ago (Lord 1965; Rasch 1966), and numerous attempts have been made to exploit their potential (Bech et al. 1978; Fisher 1993). Today, IRT-based tests are well established in the educational field (Haley et al. 2004; Anonyms 1988) but have just been widely introduced into health care during the past decade (Ware et al. 2003; Cella and Chang 2000; Ware et al. 2000; Bjorner et al. 2003a).
Like factor analysis, IRT models assume that the measured construct is a latent variable, referred to as the IRT score, theta, or θ, which cannot be observed directly, but can be estimated based on responses to different items measuring the construct. An IRT item bank consists of items measuring the same construct and a mathematical description of the items’ measurement properties (Bjorner et al. 2003b). The IRT model (Martin et al. 2007; Fischl and Fisher 2007) describes the probability of choosing each response on a questionnaire item as a function of theta (Embretson 2000, 2006). One important distinction of all IRT methods from classical test theory methods is that theta can be estimated from the responses to any subset of items in the bank (Bjorner et al. 2003b). Accordingly, researchers or clinicians can select items that are most relevant for a given group or an individual patient and score the responses on one common metric that is independent of the choice of items. If the item bank contains items from established questionnaires, the scores of these questionnaires can be predicted from estimates of theta even if the questionnaires themselves have not been used. Thus, comparisons of results from different questionnaires are expected to be facilitated with the introduction of comprehensive IRT item banks (Bjorner et al. 2003c). There a few of such common metrics already available for key health constructs, like depression, and some websites can assist to report theta scores or corresponding scores of similar tools (www.common-metric.org, www.prosettastone.org). Figure 7.4 shows an easy to use lookup table.
7.3.2 Individually Tailored Tests
This new generation of PRO tools also promises to provide very short but still reliable assessments (Haley et al. 2004; Ware et al. 2003; Embretson 2000, 2006; Bjorner et al. 2003c; Revicki and Cella 1997). The goal of the so-called computerized adaptive test (CAT) is to select and administer only the most informative items from an IRT item bank for every individual patient according to her or his estimated theta value. After each item has been administered, an IRT score is reestimated to choose and apply the next best suited item for the current score estimate. By omitting irrelevant, uninformative items, higher measurement precision is achieved, while at the same time, respondent burden can be controlled (Cella and Chang 2000; Hambleton 2006; Hays et al. 2000). CATs generally use two different ways to end the assessment (“stopping rules”): the CAT either stops after a predefined measurement precision (confidence interval) has been achieved or after a predefined total number of items have been administered.
7.3.3 Assessment Across Different Settings
Whereas clinical researchers can limit their research questions to a specific setting, clinical practice applications need to be embedded within the healthcare delivery system to be useful. To connect different assessment points, one could imagine, for instance, that future patients might be asked to assess their most relevant symptoms regularly at home, e.g., on a smartphone (Fig. 7.5). If self-reported health status declines, a case manager could be automatically alerted to call the patients for further evaluation and may send them to the doctor’s office. At the office, a second test might be used to confirm the home assessment. If the patients are transferred to other settings, more comprehensive health assessments also could be applied. Each setting calls for the use of different tools. At the patients’ home, practicality and low respondent burden will be the priorities, and lower measurement precision generally will be acceptable, whereas in a clinical setting, more comprehensive tests will be favored. Ideally, all instruments used in this chain of healthcare delivery would be scalable on the same metric as described above.
Distinction of patient-reported health status information into prediction and outcome variables
Once the decision has been made to collect one or more PROs in clinical practice, and the intent and application of that collection have been considered, the attention needs to turn to the logistics of collecting them, so that they indeed serve the purpose that was intended. The logistical issues that require particularly careful consideration and implementation are (1) the methodology of collecting PROs itself and (2) the support that will be needed for the collection of data in clinical practice.
7.4.1 Mode of Assessment
There are several methods of administering standardized questionnaires to collect PROs in clinical practice. The traditional methods are face-to-face interviews and paper-and-pencil completion of questionnaires. More commonly, computerized methodologies are used, as they provide date entry, administration, analysis and, printout of data in real time, to make assessment results immediately available for the clinical encounter. Computerized technologies include traditional personal computers (PCs) (which may be situated, e.g., in the clinic waiting room), mobile tablet PCs, handheld computers, smartphones, or personal digital assistants (PDAs) given to patients for one-time use or to keep them for the duration of the evaluation (e.g., over an entire treatment or follow-up period). A variation of the computer method is the use of SmartPen technology. Instead of typing or pressing buttons, a pen-like device is used to check boxes on an individually printed paper questionnaire, which uniquely identifies the patient as well as the assessment. From the users’ perspective, this assessment mode is alike traditional paper-pencil methods but still maintaining the advantages of computerized data entry.
Whereas the mentioned technologies are primarily used for the assessment within a clinical practice setting, a new field of PRO assessments for individual case management emerges with the assessment and monitoring of the subjective health status of the patients at home; this requires different technical solutions. Telephone is still the most common way to collect PRO data at the patients’ home, either via traditional interview (i.e., a person posing questions to the patient over the telephone) or via automated telephone interviews, using interactive voice recognition (IVR) or pressing a number to select answers to questions (Fig. 7.6).
Distinction of patient-reported health status information into prediction and outcome variables
Generally, most studies comparing paper-and-pencil and computerized administration modes (PDA, online, pen, tablet, touch screen) suggest psychometric equivalence between both modes of administration (Bettinville et al. 2005; Folk et al. 2006; Norman et al. 2010; Webb et al. 1999; Heuser and Geissner 1998; Velikova et al. 1999; Cook et al. 2004; Schaeren et al. 2005; Bliven et al. 2001; Kleinman et al. 2001; Ryan et al. 2002; Saleh et al. 2002; Wilson et al. 2002), but some studies also report differences (Beebe et al. 2006; DeAngelis 2000). The literature on mode effects between paper-pencil versus phone administration is more heterogeneous. Some studies suggest no mode effects (Duncan et al. 2005; Hepner et al. 2005; de Vries et al. 2005); others report and account for them (Powers et al. 2005; Beebe et al. 2005; Kraus and Augustin 2001). Literature on mode effects using IVR technology is rare, probably due to the novelty of IVR; one large-scale study reports IVR mode effects (Rodriguez et al. 2006) and suggests to adjust for it. A significant limitation is that many studies were underpowered to detect small but meaningful clinical differences.