(1)
Charlotte Eye Ear Nose & Throat Associates, Charlotte, NC, USA
Abstract
The most important purpose of 4-aminoquinoline retinopathy (4AQR) screening, detecting overdosing, requires no ancillary testing. However, a subsidiary purpose, detecting the rare occurrence of 4AQR among properly dosed patients, requires ancillary testing, because clinical examination is insensitive. In Bayesian reasoning, the sensitivity and specificity of an ancillary test interact with the estimated prior probability of disease to determine the estimated posterior probability of the disease. For a condition of low prior probability a single test is rarely dispositive for management. Instead, the common role for each test is to adjust the clinician’s estimate of the probability of 4AQR before applying another test, until a level of certainty is achieved indicating that the 4-aminoquinoline (4AQ) will be stopped or continued. Testing with 10-2 visual fields, multifocal electroretinography, spectral domain optical coherence tomography (SD-OCT), and fundus autofluorescence is useful for this purpose when applied selectively rather than universally. SD-OCT is especially helpful because it has the lowest test variability, the highest specificity, and similar sensitivity to the other tests.
Abbreviations
4AQs
4-Aminoquinolines (chloroquine and hydroxychloroquine)
4AQR
4-Aminoquinoline retinopathy
asb
Apostilb
C
Chloroquine
COR
Coefficient of repeatability
COV
Coefficient of variation
dB
Decibel
DTL
Dawson–Trick–Litzkow electrodes
ERG
Electroretinogram
FA
Fluorescein angiography
FDP
Frequency doubling perimetry
HC
Hydroxychloroquine
ICC
Intraclass correlation coefficient
ISCEV
International Society for Clinical Electrophysiology and Vision
MD
Mean defect
mfERG
Multifocal electroretinography
nm
Nanometer
RNFL
Retinal nerve fiber layer
SAP
Standard automated perimetry
S
Standard deviation of a sample of a normally distributed variable
SD-OCT
Spectral domain optical coherence tomography
SITA
Swedish Interactive Treatment Algorithm
TD-OCT
Time domain optical coherence tomography
X
Mean of a sample of a normally distributed variable
The most important function of screening patients taking hydroxychloroquine or chloroquine is to detect overdosing, which can be corrected in almost all cases. Of secondary importance is detecting retinopathy that may occur despite appropriate dosing. This function is less important because there is no guarantee that retinopathy, if detected, can be reversed (see Chap. 6). Nevertheless, the probability of reversing toxicity is higher if the condition is detected earlier.
Unfortunately, the signs of 4-aminoquinoline retinopathy (4AQR) detectable on routine clinical examination occur late in its progression. Visual acuity is not a useful variable in screening for antimalarial retinopathy if it is normal. One review reported sensitivity and specificity of a visual acuity criterion for detection of 4AQR of 22 % and 77 %, respectively [1]. There are many patients with advanced 4AQR with dense paracentral scotomas, yet normal visual acuity (see Chap. 6) [2]. Neither is funduscopy sensitive for 4AQR. The bull’s-eye lesion of 4AQR spares the fovea until late, and earlier signs are nebulous. Detecting the earliest funduscopic change of 4AQR is dependent on the examiner’s skill and experience, which varies [3–5].
For these reasons, ancillary testing is important [6]. Threshold perimetry (e.g.,10-2 visual field testing (10-2 VF)), multifocal electroretinography (mfERG), spectral domain optical coherence tomography (SD-OCT), and fundus autofluorescence imaging (FAF) can reveal abnormalities of macular function and structure. These abnormalities influence the frequency of follow-up when inconclusive, and lead to cessation of the drug when conclusive. Familiarity with ancillary tests and their limitations helps in the optimal management of patients taking 4AQs.
Many ancillary tests have been proposed as useful for detecting 4AQR but have been eventually discarded, generally for one main reason and several lesser ones [7]. The most important reason is that the results of testing are too variable [6, 8–11]. Among currently popular tests, the mfERG has this drawback (Fig. 8.1). In 4AQR it takes a large change in the value of the measured variable to discern a disease-induced change. This makes serial tests over time difficult to interpret [7].
Fig. 8.1
Multifocal electroretinograms (mfERGs) on three consecutive visits in a 64-year-old woman with rheumatoid arthritis who was placed on hydroxychloroquine in 2007. She was 67 in. tall and weighed 185 lb. She had always taken 400 mg/day. Her best corrected visual acuity was 20/25 in both eyes secondary to early nuclear sclerotic cataracts. Her maculas at baseline were normal. She had no renal or liver disease. (a) Yearly 10-2 visual field testing with a III, red test object was normal for six consecutive years (this example was from 15 August 2013). When the American Academy of Ophthalmology revised guidelines were published, mfERG testing was begun. (b) mfERGs for three consecutive years shown in the retinal view (the left record in each study is from the right eye of the patient). Because of reductions in amplitudes noted in 2013, the question arose whether she had toxicity and needed to be taken off hydroxychloroquine. For example, note that the R 1 amplitude for the right eye was 27.6 nV/mm2 on 15 August 2011 and 31.3 nV/mm2 on 20 August 2012 (red circled cells). Compare this to the value of 16.2 nV/mm2 (red circled cell) in study from Fig. 8.1 (continued) 19 August 2013, a decrease of 41 %. Similar behavior is shown in the topographic surface map with a decreased amplitude shown in the last study (yellow arrows). The screening ophthalmologist worried that hydroxychloroquine retinopathy was responsible for the decreases. However, this change probably represents measurement variability. The coefficient of repeatability (COR) for R 1 is 60 %. The 41 % decrease in this case lies well within this COR
Among lesser reasons for discarding ancillary tests, the first is that the test may be too sensitive [12]. For example, in a study of contrast sensitivity testing in patients taking hydroxychloroquine, 44.4 % of patients taking 200 mg/day for 1–9 years had abnormalities on contrast sensitivity testing [13]. A test that suggests an abnormality in such a high percentage of cases when the preponderance of other evidence suggests no clinically important retinopathy is not useful.
An ancillary test may demonstrate large overlaps between normals and patients with retinopathy implying difficulty discriminating cases. This is a problem with red Amsler grid testing, color vision testing, electrooculography (EOG), and mfERG [3, 5, 6, 14, 15].
On the other hand, a test may be too insensitive. Loss of foveal reflex, fluorescein angiography, home Amsler grid testing, and global electroretinography (ERG) are examples of such tests [6, 16–19]. Many patients with reproducible 10-2 ring scotomas have crisp foveal reflexes, normal Amsler grids, and normal global ERGs rendering these tests of little use for screening purposes [6].
Some tests, such as macular photostress testing, are not standardized [20]. Performance statistics obtained by one investigator mean little to others because the test is done differently by different clinicians.
Other tests are not specific for the condition of interest. For example, 10 of 758 consecutive patients taking hydroxychloroquine and screened for retinopathy had positive red Amsler grid testing, but none of the 10 had retinopathy [21]. The rate of false positives was 100 %, an obviously unacceptable performance.
Lastly, some tests are not reimbursed by health care payers. Examples include contrast sensitivity testing and macular photostress testing. Lack of reimbursement is as strong a disincentive for test adoption as reimbursement is for adoption. For one or more of these reasons, all of the following tests have been embraced and subsequently discarded: color fundus photography, fluorescein angiography, dark adaptometry, global electroretinography, electrooculography, macular photostress testing, Amsler grid testing, tangent screen testing, and color vision testing [6, 7, 22–24].
Interpretation of ancillary testing is typically not scientific, always involves assumptions, and is often controversial. It is important to identify the underlying assumptions and definitions of retinopathy used by the interpreter to be able to compare interpretations. Easterbrook co-wrote the American Academy of Ophthalmology (AAO) guidelines for 4AQ screening in 2002, when Amsler grid testing was advocated [25]. No counter-revelatory evidence regarding the performance characteristics of the Amsler grid was published between 2002 and 2011, when the guidelines were rewritten by a different set of authors without Easterbrook [26]. The new guidelines dropped the Amsler grid recommendation. The facts had not changed, but the interpretation had.
Only static automated perimetry (SAP) has withstood 30 years of use as an ancillary test. Although widely adopted by clinicians, its performance has not been quantitatively evaluated in patients taking 4-aminoquinolines (4AQs). This test is covered in detail in Sect. 8.4. In the past decade mfERG, SD-OCT, and FAF have been introduced and lauded [26]. They have received the compliment of being more objective than 10-2 VF, but the claim is dubious for mfERG and FAF [26]. Although in the United States one or more of them is currently recommended for universal application where available [26], their ultimate niche is more likely to be selective [27].
Ancillary testing is expensive. Because financial resources for health care are scarce, ophthalmologists need to judge whether a test adds sufficient value to the care of a patient to make obtaining it worthwhile. There can be a conflict of interest in fee-for-service systems of health care. An ophthalmologist or optometrist can profit by ordering more ancillary tests depending on the practice setting. Therefore, the topic is not only important to discuss, but also potentially inflammatory [28].
This chapter covers general principles of diagnostic testing and applies those concepts to their use in screening for 4AQR. Commonly used abbreviations in this chapter are collected in “Abbreviations” for reference. Each term will be first used in its full form, along with its abbreviation.
8.1 Defining Normal
For any ancillary test, it is necessary to decide what is considered abnormal. For a normally distributed variable, a common choice is that the value being measured lies more than two standard deviations from the mean value of a control group of clinically normal persons [7, 9, 29]. For a non-normally distributed variable, it may be that the value lies beyond the 5th or 95th or 99th percentile of a control group, or outside the range of the control group [23, 30]. The definition chosen will affect the frequency with which measurements are labeled abnormal.
These definitions are probabilistic and simply state where the given patient lies relative to a normative database. They do not make a diagnosis of retinopathy [31]. The clinician needs to understand that they imply that a certain percentage of normal patients will be classified as abnormal by definition. For example, suppose that an abnormal test result is defined as one in which the value lies beyond the 95th percentile of a group of controls not taking the drug. If the prevalence of 4AQR among those taking the drug is 1 %, and one applies the ancillary test to a random sample of 100 persons taking the drug, then on average six tests will be abnormal. Of these six, on average only one will have 4AQR. The other five persons labeled by the test as abnormal will in fact be normal, but happen to have measurements by the ancillary test that put them in a range defined to be abnormal.
The Normal Distribution
Many of the measurements made in ancillary testing for 4AQR either follow or are assumed to follow a normal distribution given by the formula
where μ is the mean of the distribution, and σ is the standard deviation of the distribution.
Often a raw measurement is converted to a Z-score, defined by Z = (X − X )/S. In this case X is the mean of the sample measurements and S is the standard deviation (SD) of the sample measurements. The Z-score shows how far a measurement is from the average value, expressed in units of standard deviations. A Z-score that is one, two, or three SDs from the mean value lies at a position encompassing 68, 95, and 99.7 % of the measurements for the sample, respectively. An example of the use of Z-scores arises in considering height, which determines the ideal body weight (IBW), an important concept in studies concerning 4AQR (see Chap. 7).
How Not to Define an Abnormal Ancillary Test
Legitimate methods for defining an abnormal ancillary test have been discussed. In all cases, the concept of a comparison of a test subject’s value to the distribution of values from normal subjects is involved. Unfortunately, examples not to emulate have been published; studying one of these is instructive. For example, in a study of SD-OCT in 4AQR, the authors assert that the thickness of the outer nuclear layer is the most sensitive SD-OCT measurement to follow in detecting 4AQR [32]. As evidence they show a figure in which a scan from an eye with retinopathy is juxtaposed to two similar scans from eyes of normal subjects. They compare the thickness of the outer nuclear layer delineated by freehand of two scans from a single patient with 4AQR to the mean ± SD values from seven normal eyes. The thickness of the outer nuclear layer for one of the scans was 46.7 μm compared to 59.8 ± 7.1 μm for the normal subjects’ scans [32]. The patient’s value was 13.1 μm less than the mean value of the normals or 13.1 μm/7.1 μm = 1.84 standard deviations less than the sample mean of the normal subjects. Such an occurrence will happen on average 5.2 % of the time even in normal eyes.
There are several problems with the authors’ approach:
They don’t define how their freehand method of delineating the outer nuclear layer was done. What were the landmarks? How reproducible was the method? No one can replicate the finding without these details.
The sample of normals is small (seven).
An unconventional cut-point for “abnormal” was used. A typical cut-point would be 2 standard deviations, not 1.84 standard deviations.
The authors cherry-pick individual scans from the patient said to have 4AQR to compare to the normal controls. One wonders what the result of comparing all the scans in the raster would show. The authors reported that throughout the raster the thickness of the ONL of the affected patient was reduced compared to the normal subjects, but the size of the reductions was omitted. The size of the reductions may not have been inconsequential.
8.2 Principles Common to Ancillary Tests Used in Screening for 4-Aminoquinoline Retinopathy
To use an ancillary test for drug toxicity effectively, several pieces of information must be known:
The values of the test in a normal population
The values in the diseased population not taking the drug—for example, patients with rheumatoid arthritis (RA) and systemic lupus erythematosus (SLE)
The intraindividual variability of the test
A baseline value for the test in the patient before the drug is begun
The performance characteristics of the test: sensitivity and specificity
In the case of chloroquine and hydroxychloroquine, the baseline test is frequently taken after the drug has already been started, which is less than satisfactory [34]. A tacit assumption has been made that the drug, in the absence of toxicity, has not changed the value of the test in the patient. This assumption may not be true. For example, for the mfERG, some investigators have claimed that taking hydroxychloroquine does change the values of the test measurements [35]. They distinguish this effect from true toxicity, by which they mean irreversible changes [36]. This also appears to be the case with the macular photostress test and the visual evoked potential (VEP) [20, 35–37].
Many studies on ancillary testing use normal subjects as a control group [38, 39]. This is flawed because the possibility exists that the diseases for which these drugs are used produce changes in the ancillary test independent of using the drug [9, 38]. For example, electrooculogram (EOG) values are affected by rheumatologic diseases [8, 9]. Intraindividual variability in normal subjects and patients with rheumatoid arthritis differs. The coefficient of variation (COV) is 10 % in the first, but 15 % of the second group [9]. Thus, a 30 % decrease in the Arden ratio would be necessary before one could conclude with 95 % confidence that the decrease represented a real change in the test as a result of taking a 4AQ and was not noise in the measurement (see Sect. 8.3) [9]. On the other hand, retinal nerve fiber layer (RNFL) thickness did not differ between patients with rheumatologic disease not taking aminoquinolines and normal subjects [40]. In the absence of a proper control group, evidence of a dose–response relationship between 4AQs and the test variable of interest can be helpful in increasing the probability that a purported effect is real [38].
The value of a diagnostic test is determined by its sensitivity, specificity, reproducibility, and cost. The first two characteristics, known as the performance characteristics of the test, are defined by referring to a 2 × 2 table that displays the true health status of the patient compared to the status as defined by the test (Fig. 8.1) [33, p. 891]. The usefulness of an ancillary test depends on the prevalence of the disease (see Chap. 5) as well as the sensitivity and specificity of the test. If a disease is rare, then there will be a number of patients who test positive for the disease but do not have it (false positives). The concepts of positive predictive value (PPV) and negative predictive value (NPV) capture the importance of disease prevalence in ancillary testing. The definitions of these terms and some corollaries follow.
Sensitivity—The proportion of truly diseased patients deemed so by the test. In the symbols of Fig. 8.2, specificity equals a/(a + c). Sensitivity is most important in screening for disease, because a clinician does not want to say mistakenly that a diseased patient is healthy. Therefore, high sensitivity in a test is desirable [41].
Fig. 8.2
2 × 2 table used in the definition of sensitivity and specificity
Specificity—The proportion of truly nondiseased patients deemed so by the test. In the symbols of Fig. 8.2, specificity equals d/(b + d). Specificity is most important in making a decision about beginning or stopping treatment, because a clinician does not want to risk side effects caused by treatment, or risk losing the benefit of effective treatment, based on an erroneously positive test. Therefore, high specificity in a test is desirable [41].
Positive predictive value (PPV)—The probability that a patient has the disease in question if the ancillary test is positive; the equation for calculating this probability is
Negative predictive value (NPV)—The probability that a patient does not have the disease in question if the ancillary test is negative; the equation for calculating this probability is
Likelihood ratio—The ratio of the probability that a particular test result would occur in a patient with the disease compared to the probability that the result would occur in a patient without the disease [31]. Likelihood ratios come in two varieties—positive and negative. The positive likelihood ratio is defined as sensitivity/(1 − Specificity). For example, the sensitivity and specificity of 10-2 VF testing in patients with 4AQR has been reported to be 85.7 % and 92.5 %, respectively [42]. The positive likelihood ratio is therefore 0.857/0.075 or 11.4. In other words, this positive likelihood ratio means that an abnormal 10-2 VF is 11.4 times more likely in a patient with 4AQR than in a patient taking a 4AQ without retinopathy [31].
Odds—The ratio of probability of having the disease to the probability of not having the disease. The probability of having the disease is usually estimated by the prevalence of the disease. In some cases, other information allows a more refined estimate of the probability of having the disease than the prevalence by Bayesian inference (see Sect. 8.4). The prevalence of hydroxychloroquine retinopathy in nonoverdosed patients after 6 years of therapy has been reported to be 0.5 % [43]. Therefore, the odds is 0.005/0.995 equals 0.005. As exemplified in this case, for small values of prevalence, the odds is approximately the prevalence. The odds is typically figured before and after an ancillary test. The pretest odds is based on the prevalence. The posttest odds is based on the pretest odds modified by the results of the ancillary test. Specifically, posttest odds equals pretest odds times the positive likelihood ratio [31]. If the odds of having the disease is known, the probability of having the disease can be calculated as Odds/(Odds + 1) [31].
Defining the performance characteristics of an ancillary test implies that there is a gold standard against which the test can be compared, but in many cases there is no such standard. Instead, the gold standard may be the consensus of a panel of graders using some other method of assessment or a composite action such as discontinuation of therapy based on the totality of clinical evidence [21, 44]. Often, the gold standard is unreproducible. For example, the gold standard for mild retinopathy in some studies was presence of mild macular pigmentary changes alone, which will be dependent on the judgment of the examining clinician [34, 45]. In other cases the gold standard is another ancillary test, most often SAP [46–48]. The problem with this approach is that the new test has no chance of exceeding the test used as the gold standard in relative sensitivity or specificity. The best that the new test can do is match the performance of the ancillary test chosen as the gold standard. For example Adam and colleagues used the 10-2 VF as the gold standard for testing mfERG [47]. They found mfERG to have a sensitivity of 89 %. By the way the study was set up, 10-2 VF testing had 100 % sensitivity. Therefore the study design biased a comparison of relative sensitivity in favor of SAP [47]. Despite the inelegance of real life, the assumption of a gold standard for diagnosis is useful in understanding the underlying concept.
Is There a Gold Standard for 4AQR?
The literature is inconsistent on the issue of a gold standard for defining 4AQR (Table 8.1). Various gold standards are used, and in some cases none is defined. If there is no gold standard, then one cannot specify performance characteristics of an ancillary test of interest. This is a variation on the problem of defining 4AQR (see Chap. 4). Gold standards based on 10-2 VF interpretation alone are difficult to sustain because of the variation in interpretation of SAP by different clinicians.
Table 8.1
Published gold standards for 4-aminoquinoline retinopathy (4AQR)
Study | Gold standard |
---|---|
Farrell [16] | 10-2 VF is abnormal and compatible fundus abnormalities (unspecified) are present |
Adam [47] | Two of the three following are present: (1) 10-2 visual field defects, (2) compatible fundus abnormalities (unspecified), (3) compatible SD-OCT abnormalities (unspecified) |
10-2 VF is abnormal | |
Fleck [45] | Compatible fundus pigmentary abnormalities to ophthalmoscopy |
Based on the totality of the clinical evidence (e.g., for Michaelides, this is the history, examination, 10-2 VF, global ERG, mfERG, SD-OCT, and FA; for Grierson this is history, examination, and red Amsler grid testing) | |
None defined for early 4AQR |
Of the performance characteristics of an ancillary test, the sensitivity is more important when the damage incurred by missing the presence of the disease is high. Specificity receives more emphasis when the costs of intervention are high. In the case of 4AQR, the cost of missing the diagnosis is some degree of visual loss. The cost of intervention—either reduction of dosing or cessation of drug—is reactivation of quiescent autoimmune disease.
Sensitivity and specificity are characteristics of a test, but to apply these statistics in clinical practice, one must know the prevalence of the disease in the population being studied. As covered in Chap. 5, there are no reliable data on prevalence of 4AQR, thus an assumption of prevalence must be made based on the available crude estimates. One way to handle this situation, which this chapter employs, is to assume a range of plausible prevalences based on the imperfect data and conduct a sensitivity analysis over that range of values.
An Example of Calculating PPVs and NPVs and How They Are Used by the Clinician
The ability of treating rheumatologists to detect early maculopathy, before bull’s-eye changes develop, has been proposed as a screening test for 4AQR [45]. The gold standard in this case was the findings of a single ophthalmologist [45]. The sensitivity and specificity of the rheumatologists for detecting maculopathy were 80.0 % and 89.3 %, respectively [45]. For a plausible range of prevalences of 4AQR, the calculated PPVs and NPVs are shown (Table 8.2). A recurring theme in the use of ancillary testing for 4AQR is illustrated—the dominating effect of prevalence on the PPVs and NPVs and the relative unimportance of small differences in performance characteristics of the ancillary test.
Table 8.2
Example of a spreadsheet calculation of positive and negative predictive values from sensitivity, specificity, and assumed prevalences
Assumed prevalence (%) | Sensitivity (%) | Specificity (%) | PPV (%) | NPV (%) |
---|---|---|---|---|
0.1 | 80.0 | 89.3 | 0.7 | 100 |
1 | 80.0 | 89.3 | 7 | 99.8 |
3 | 80.0 | 89.3 | 18.8 | 99.3 |
5 | 80.0 | 89.3 | 28.2 | 98.8 |
Regardless of the result of this single ancillary test, it would be unlikely for a clinician to stop the prescribed 4AQ. At the most, a positive test would increase the clinician’s estimate of the probability of having 4AQR from 5 to 28.2 %—not enough to stop the medication. Instead, 28.2 % would become the new prior probability in a subsequent step of Bayesian inference. It is likely that another ancillary test would be applied to the patient. This test would have its own sensitivity and specificity, and a new PPV and NPV would be generated, which might be sufficient to change management.
When the definition of an abnormal test depends on a cut-point of a continuous variable, changes in the cut-point will change the sensitivity and specificity of the test. Receiver operating curve analysis may then offer the best choice of a cut-point that gives the best balance of sensitivity and specificity. However, not all authors go to the trouble to do this. For example, Lyons chose to maximize specificity of the mfERG in discriminating hydroxychloroquine retinopathy by defining an abnormal R 1/R 2 as lying above the 99th percentile for the normal population [52]. Had a lower percentile been chosen as the cut-point, the specificity would have been lowered but the sensitivity increased.
The Problems of Judging Usefulness of an Ancillary Test
To judge whether a test is useful in detecting retinopathy, it is necessary to apply it to patients who clearly do and others who clearly do not have retinopathy. If this rule is not followed, one can decide erroneously that a test is not useful. For example, Fleck and colleagues assessed the value of SAP with red targets using the Friedmann Mark 1 visual field analyzer against patients who were taking 4AQs but had no retinopathy [45]. There were some differences in the proportions of minor macular pigmentary abnormalities between the group of patients taking and not taking 4AQs, but the macular pigmentary changes could not be classified as severe enough to diagnose 4AQ retinopathy. When they found no difference in the proportion of scotomas within 10 deg of fixation between groups, they concluded that SAP and surveillance by ophthalmologists might not be required. The conclusion does not follow, however, because they did not test their screening program against any patient with definite retinopathy.
It is also necessary to apply a test to a sufficient number of patients who have 4AQ retinopathy to be able to reach a valid conclusion. An illustration of the pitfall here was the series of six patients with hydroxychloroquine retinopathy reported by Bienfang [4]. All six had color vision abnormalities. This led him to conclude that the test had high sensitivity [4]. Yet the result was probably an artifact of a small sample as other studies have not been able to replicate his observation [26].
An Erroneous Calculation of PPV and NPV in the Literature of 4AQR
Vu and colleagues have published PPV and NPV for various color vision tests in detecting chloroquine retinopathy that appear to be erroneous [53]. Consider the evidence: They report the results for a number of color vision tests including Ishihara, Spp-2, D-15, Dsat-15, CU, and AO-HRR tests. Only the results reported for the SPP-2 are reviewed here, but the same analysis applies to all the other results in the paper. They calculate PPV and NPV for the SPP-2 as 90 % and 91.7 %, respectively, which would make this test valuable to a clinician if true.
The authors do not disclose the prevalence of chloroquine retinopathy that they used in their calculation, but by working backward it can be determined. By taking the formula for PPV and rearranging, one can show that
Prevalence = (PPV − PPV × S p)/(S n − PPV × S n − PPV × S p + PPV), where S n = sensitivity and S p = specificity. They report S n and S p for SPP-2 as 93.3 % and 88 %, respectively [53]. Therefore,
The prevalence of chloroquine retinopathy is not 62.7 % (see Chap. 5). Using more realistic values, one can calculate low values for PPV for SPP-2 and all the other color vision tests. These values correspond with the clinical observation that color vision testing, when abnormal, is poorly predictive that chloroquine retinopathy is present.
A high NPV means that a test rarely misclassifies a person with 4AQR as unaffected, but provides no information about the risk of a healthy person having 4AQR. A high PPV increases the probability that the person has the disease. A low PPV means that many of the positive tests will be false positives, and that a more reliable follow-up test is needed. A test with a high PPV rarely misclassifies a person without 4AQR as having 4AQR but does not bear on the tendency to misclassify a person with 4AQR as healthy.
The PPV of a diagnostic test should not be used to evaluate the efficacy of the test when the pretest likelihood of the presence of disease is low (e.g.,10 % or less). Instead, in such a case the NPV is the appropriate concept to invoke. A high NPV confirms the entering clinical impression that disease presence is unlikely. The appropriate use of PPV is in a situation in which the pretest probability of presence of the disease is high (e.g., 90 % or greater). In this case, a high PPV supports the clinical impression that the disease is present. For cases with intermediate pretest probabilities, a combination of NPV and PPV can substantially improve clinical decision-making. For example, if the pretest probability of disease presence is 80 %, then a PPV of 90 % can improve one’s confidence that disease is truly present.
Because the estimated prevalence of 4AQR is less than 10 % (see Chap. 5), the main use of a positive ancillary test is to raise the prior probability from a low value (the estimated prevalence) to a higher posterior probability which will become the prior probability for another ancillary test. If this next test is also positive, the resulting PPV may be sufficient to change management (i.e., stop the 4AQ). It would be rare for management to change based on the results of a single ancillary test.
Because chloroquine and especially hydroxychloroquine retinopathy are not common, often the statistics for sensitivity and specificity are based on small numbers of patients. For example, in one report assessing color vision testing, statistics were based on four cases of advanced retinopathy [34]. One way to determine the reliability of a published statistic for sensitivity or specificity is to calculate the marginal error assuming that an additional case were added to the data and that the test misclassified the case. How would the sensitivity or specificity statistic be affected? An example of such a calculation has been published [54]. The desired outcome is that the marginal sensitivity and specificity would not be much affected by misclassification of the next patient. In reports with small sample sizes, marginal sensitivity and specificity would be influenced, making the reported estimates of sensitivity and specificity suspect.
What Does It Mean When There Is No Gold Standard?
In some studies of ancillary tests, no attempt is made to define a gold standard and no action is taken as a result of applying the ancillary testing. The only aim appears to be to establish what proportions of patients have abnormalities in one or another of several tests. For example, Xiayun and colleagues compared 10-2 VFs, mfERGs, and RNFL thickness measurements with a scanning laser polarimeter in patients with rheumatoid arthritis (RA) taking chloroquine, patients not taking chloroquine, and normal subjects [40]. No patient was taken off chloroquine, leading the reader to infer that retinopathy was not diagnosed in any patient. The authors reported that mfERGs were abnormal in 42 of 60 (70 %), RNFL measurements were abnormal in 40 of 60 (66.7 %), and 10-2 VFs were abnormal in 2 of 60 (3.3 %).
The percentages are dependent on the definitions chosen for abnormality. In this case, the authors defined an abnormal mfERG as having a reduced N1 or P1 amplitude in either of ring R 1 or R 2 [40]. They defined an abnormal 10-2 VF as one with a paracentral point having less than a 1 % chance of being normal without regard to whether the abnormality persisted on repeat field testing [40]. The definition of an abnormal RNFL measurement was not given, but may have been compared to a proprietary table of normal values provided by the instrument manufacturer. Had the authors required R 1/R 2 to be greater than 2.6 (another common definition for mfERG abnormality [52]) for an mfERG to be labeled as abnormal, the percentage of abnormal results may have been lower than the 70 % reported. In the absence of a gold standard that implies a clinical action, it is hard to know what to do with the information reported. Similarly, Almony and colleagues never define a gold standard, reducing the clinical impact of their report [17].
The clinician would often like to know how much value is gained by performing an additional test, because the use of multiple ancillary tests is common in screening for 4AQR. This depends on the number of additional cases turned up by adding the additional test. A statistical rule termed the Rule of Three is used in analyses of these situations. It holds that if no additional patients out of n tested are detected with 4AQR by adding a test, then one can have 95 % confidence that the true rate of additional detection of 4AQR were the test to be applied universally to the population is no more than 3/n [55, 56].
8.3 Reproducibility of Ancillary Tests
All of the instruments used in ancillary testing for detecting 4AQR are subject to measurement error. There are many sources of variation in measurements including intra-observer test–retest variability, inter-observer variability, variability across different machines, short-term variation, and long-term variation. The clinician needs to know the size of the measurement error in order to properly interpret a change in a measurement. Reproducibility gives the clinician an idea of how trustworthy the test result is [57]. If a test has poor reproducibility, then one is not confident in making the diagnosis of 4AQR based on that test alone [23]. For example, mfERG is poorly reproducible, which is a source of controversy over how useful this test could be in screening for 4AQR (see Sect. 8.6). Other examples include fundus photography, tangent screen testing of central visual field using a red test object, and Amsler grid testing [10, 11, 58].
To motivate a way of approaching the problem of reproducibility, consider the mfERG as used to assess macular function in patients taking 4AQs. To take one index of interest, R 1/R 2, the clinician wishes to know how big a change in the R 1/R 2 ratio is required to consider that a change in the patient has truly occurred and not simply natural variation. The true value of a variable is not known. The measurements are attempts to estimate the true value. The best estimate of the true value is the mean of multiple measurements. Therefore, the study of the variability involves a study of the errors of the measurements from the mean values. For example, if two measurements S1 and S2 are made, the best estimate of the true value is (S1 + S2)/2. The estimated errors of the individual measurements would be
S1 − (S1 + S2)/2 = (S1 − S2)/2.
S2 − (S1 + S2)/2 = (S2 − S1)/2.
This points to the usefulness of plotting the difference in values of a pair of measurements against the average of the values, which is the basis of the Bland–Altman analysis [57]. From the Bland–Altman analysis comes the coefficient of repeatability (COR), which allows one to state how large a change in a measurement must occur to be 95 % confident that the change is real and not measurement variability (1.96 × COR) [59, 60, p. 236]. This is the difference between any two measurements above which one can be 95 % confident that the change is real, and not due to measurement variability [57].
For example, in the measurement of ring-averaged mfERG amplitudes in normal volunteers from one study, the COR ranged from 17.4 to 30.3 % in rings R 1 to R 5 [61]. If we suppose that a baseline R 1 amplitude is 50 nV/deg2, then we can be 95 % confident that a measurement of less than 32.9 nV/deg2 at a follow-up visit represents a true diminution of the patient’s R 1 amplitude because the greater than 17.1 nV/deg2 decrement in voltage exceeds 1.96 × 17.4 % of the baseline measurement.
Test equipment improves continuously. Therefore, reproducibility depends on technology. For example, reproducibility with SD-OCT is better than with time domain OCT (TD-OCT) [62, 63]. Repeatability cannot be assessed by simply performing measurements on a sample of patients or eyes at times one and two and comparing the means and standard deviations of the groups at the two times, as has been done [64]. These statistics may be comparable, but this says nothing about the reproducibility, which is assessed by comparing measurements in each patient at times one and two [57].
Which Measure of Reproducibility Is Best?
There are other methods of assessing reproducibility besides the COR. Some studies use the COV defined as the standard deviation divided by the mean value of a set of repeated measurements usually expressed as a percentage [65]. When the COV is used, another statistic is often used—the smallest measurable change, defined as the measurement times the COV. Table 8.3 lists reproducibility data for various OCT machines for macular thickness using this conceptual framework.
Table 8.3
Reproducibility of macular thickness measurements with optical coherence tomography instruments
OCT instrument | CSMT (μm) | COV (%) | Smallest measurable change (μm) |
---|---|---|---|
Spectralis | 289 | 0.46 | 1 |
OCT SLO | 244 | 2.23 | 5 |
RTVue | 247 | 2.77 | 7 |
Stratus | 212 | 3.33 | 7 |
Cirrus | 277 | 3.09 | 9 |
Copernicus | 249 | 3.50 | 9 |
Yet another method of assessing reproducibility is to calculate the intraclass correlation coefficient (ICC). This is a statistic that scales the test–retest variability (the undesirable variability) by the true variability of the quantity being measured across the subjects in the sample. The ICC has values ranging from 0 to 1. The closer the value is to 1, the more reproducible the measurement is. A typical ordinal scale for ICCs follows:
Slight reproducibility—ICC between 0 and 0.2
Fair reproducibility—ICC between 0.21 and 0.4
Moderate reproducibility—ICC between 0.41 and 0.6
Substantial reproducibility—ICC between 0.61 and 0.8
Almost perfect reproducibility—ICC between 0.81 and 1.0 [65]
Reassuringly, different measures of reproducibility tend to vary together. Altemir and colleagues presented reproducibility data on SD-OCT measurements of RNFL and macular thickness as assessed by both COV and ICC. As Fig. 8.3 shows, the two measures were correlated (r 2 = 0.7216) [65].
Fig. 8.3
Graph of intraclass correlation coefficient versus coefficient of variance in spectral domain optical coherence tomography measurements of macular thickness and retinal nerve fiber layer (RNFL) thickness. Data from Altemir [65]
All three methods of analyzing reproducibility are described in the literature covering ancillary testing in ophthalmology. There seems to be no consensus about which form of reproducibility analysis is superior, and attaining familiarity with each method is worth the effort required [66].
For some ancillary tests, the greatest challenge is distinguishing long-term fluctuation (LTF) from progressive change due to toxicity. This is the case with 10-2 VF testing and mfERG, but notably does not apply to SD-OCT [67, p. 86]. SD-OCT has a privileged position as the most reproducible ancillary test and the one with the smallest LTF.
8.4 Establishing a Prior Probability and Bayesian Inference
To properly use an ancillary test one should have in mind a prior probability of the patients having 4AQR. This will depend on multiple characteristics of the patient and a knowledge of the epidemiologic facts contained in Chap. 5. In the absence of other information, the best estimate of the prior probability is the prevalence of 4AQR among patients similar to the one under review.
Bayesian inference refers to modification of the clinician’s estimate of the probability of having a disease based on the addition of further information. Thus, for example, if a patient is a 6 ft 4 in. male taking hydroxychloroquine 400 mg/day for 1 year, our initial estimate of the probability that he has hydroxychloroquine retinopathy is vanishingly low—perhaps 0.001 % or less. In such a circumstance, an initial 10-2 visual field with paracentral loci of increased threshold would be discounted given the entire clinical picture. The significance of such a field would be different if a female patient happened to be 5 ft tall and was in her 20th year of therapy on the same daily dose.
It is useful to formally state Bayes’ theorem. If H is the hypothesis to which the evidence E refers, then the formula expressing Bayes’ theorem is:
where
P(H) is the prior probability that H is true before we have the evidence E.
P(H/E) is the probability that H is true after we know E.
P(E/H) is the probability that the evidence E occurs when H is true.
P(E) is the marginal likelihood of E, which is the probability of having that piece of evidence in a random person drawn from the population [68].
In the example we posited, the marginal probability is the probability that a random person taking a 10-2 VF would have paracentral loci of increased threshold.
The importance of Bayes’ theorem for the clinician is that it refocuses attention away from the characteristics of the ancillary test and puts more emphasis on the probability of disease presence before the test is applied. In this way awareness of Bayes’ theorem leads the clinician to screen those at higher risk and provides a rational basis for relative neglect of those at low risk, because in those patients extensive testing would represent an unwise expenditure of money [69, 70].
8.5 Static Automated Perimetry
Perimetry encompasses many methods, including tangent screen perimetry, kinetic perimetry with a Goldmann perimeter, and SAP, which is the standard method of testing the central visual field in screening for 4AQR. Advantages over the older methods include its standardization and greater sensitivity to visual field loss [71]. Testing protocols have been developed that allow gathering of data in an acceptable amount of time in ways less dependent on operator expertise, a significant variable in Goldmann perimetry [67, p. 89]. In the United States, and perhaps the world, the Humphrey visual field analyzer is the most commonly used instrument [67, p. 86]. The examples included in this book largely come from this instrument.
Static perimetry measures differential light sensitivity to a stimulus of varying intensity against a background of constant luminance. Luminance is the amount of light reflected or emitted from a surface. The unit of luminance is the apostilb (asb). One apostilb is the luminance of a perfectly diffusing surface that is emitting or reflecting 1 lumen/m2; other conversions are one apostilb (asb) = 0.3183 cd/m2 = 0.1 millilambert [67, p. 90]. The maximal stimulus luminance of the Humphrey perimeter is 10,000 apostilbs. Lower luminances are achieved by interposing neutral density filters of increasing strength between the light bulb and the perimeter. The resulting stimulus intensity is measured in decibels (dB), where 10 dB is equivalent to 1 log unit of stimulus luminance, and 1 log unit is equivalent to a tenfold change in stimulus intensity.
Stimulus intensity and retinal sensitivity are related concepts that have inverted scales relative to each other. The highest stimulus intensity, 10,000 asb, corresponds to a retinal sensitivity of 0 dB. Failure to respond to this stimulus at a location is termed an absolute scotoma. The minimal stimulus intensity is 0.1 asb, which corresponds to a retinal sensitivity of 50 dB. An increase in retinal sensitivity of 10 dB corresponds to a reduction by a factor of 10 in stimulus intensity that is seen. Thus, a retinal sensitivity of 20 dB corresponds to recognition of a stimulus of intensity of 100 asb, calculated by 10,000 divided by 10 (corresponding to 10 dB of the 20 dB total) =1,000, which is divided by 10 again (corresponding to the other 10 dB of the 20 dB total) equaling 100. Therefore, in the parlance of visual field interpretation, a threshold with a higher number of decibels implies that the retina is more sensitive at that locus. Likewise, a threshold with a lower number of decibels implies less sensitivity at that location. Figure 8.4 shows the conversion scale of thresholds and gray scale shadings for the Humphrey visual field analyzer.
Fig. 8.4
Gray scale symbols used in the Humphrey 10-2 visual field plot. The symbol used in the gray scale visual field appears on the top row. The second row shows the stimulus brightness that corresponds to the gray scale symbol above it. The third row shows the strength of the neutral density filter interposed between the stimulus and the patient that corresponds to the stimulus strength and gray scale symbol listed above it
The three major commercial static perimetry instruments are the Humphrey Visual Field Analyzer, the Octopus perimeter, and the Rodenstock perimeter. Table 8.4 lists some differences in these instruments. The luminance values attached to the decibel scale are not standardized across different machines because the maximal luminance for the different machines is not the same. The maximal luminance is brighter on the Humphrey Visual Field Analyzer than on the Octopus perimeter. A 20 dB stimulus on the Humphrey machine is the same brightness as a 10 dB stimulus on the Octopus machine.
Table 8.4
Comparison of specifications of three static automated perimeters
Variable | Instrument | ||
---|---|---|---|
Humphrey visual field analyzer | Octopus 900 perimeter | Rodenstock peristat | |
Background luminance | 31.5 asb | 4 asb | 3.1 asb |
Stimulus size | III (4 mm2) or V | I, II, III, IV, V | III, V |
Stimulus | White or red | White | Blue, green |
Radius of tested VF | 10, 24, or 30 depending on program | 30 | 25 |
Number of tested points | 68, 54, and 76 for the 10-2, 24-2, and 30-2 | 59 | 78 |
Minimal stimulus intensity | 0.1 asb | 0.1 asb | 0.3 asb |
Maximal stimulus intensity | 10,000 asb | 6,000 asb | 10,000 asb |
Step size of increasing stimulus intensity | 3–4 dB steps | 2–10 dB steps | 2 dB steps |
The measured threshold at a given location using the Humphrey visual field analyzer is determined using an initial stimulus methodology followed by a pointwise bracketing methodology. The initial stimulus values are chosen by determining thresholds at four points in the visual field, one per quadrant, with each one 9 deg from the horizontal and vertical meridians. The initial stimulus is set at 25 dB and then stimulus intensities are changed in 4 dB steps until the threshold from seeing-to-nonseeing or nonseeing-to-seeing is crossed. Then the direction of change is reversed and the steps are reduced in size to 2 dB until the threshold is recrossed. The last-seen stimulus luminance is then recorded as the threshold for the location. From the thresholds at these four locations, initial stimulus strengths are determined by an algorithm based on the correlation of stimulus strengths at different locations in normals. The same bracketing strategy, termed the 4-2 strategy, is then applied to each location tested in a random order to reduce the probability of anticipation by the patient.
Threshold sensitivity at a retinal location cannot be precisely measured because it is a probabilistic concept. Instead, it is estimated by the strategy of bracketing. The threshold is the dimmest target identified 50 % of the time at a given location. When the frequency of seeing curve is steep, the estimate for the threshold is more reliable. When it is shallow, the estimate is less reliable [67, p. 91]. Threshold deviation refers to the difference between the patient’s threshold sensitivity at a particular location and the age-matched normal retinal sensitivity for that location. A list of several terms and their definitions that are used in displays of SAP follow.
False positive—The instrument provides a sound cue with a subthreshold or no stimulus; if the patient responds, a false positive is recorded. In the SITA strategy, false positive implies the response occurred within the response time for the patient. A false positive rate greater than 33 % implies low reliability of testing [40, 49, 69, 71, 72, p. 102]. Others have used a stricter criterion, such as requiring a false positive rate less than 25 or 15 % to be considered reliable (Fig. 8.5) [39].
Fig. 8.5
Example of an unreliable 10-2 visual field (VF). This is the 10-2 VF from the right eye of a 79-year-old woman taking 200 mg/day of hydroxychloroquine for unspecified arthritis. There are many flags that this test is unreliable. The fixation losses were 5 of 18 trials (28 %) (blue arrow). The false negative responses were 15 % (green arrow). The false positive responses were 15 % (orange arrow). Therefore the accuracy of the various indicators of abnormality, such as the depressed mean defect (brown arrow), pattern standard deviation (PSD) (red arrow), and paracentral scotomas (green-circled and red-circled locations), must all be interpreted with skepticism
False negative—The instrument registers the threshold at a locus and then returns to test the same locus with a stimulus 9 dB brighter than the threshold stimulus determined previously. Failure of the patient to respond to this stimulus is a false negative. Fatigue is a common source of false negative responses. A false negative rate greater than 20–33 % implies low reliability of testing [40, 49, 67, 71, 72, p. 102]. Others have used a stricter criterion, such as requiring a false negative rate less than 15 % to be considered reliable (Fig. 8.5) [39].
Fixation losses—Maintenance of fixation can be tested by periodically retesting the location of the physiologic blind spot which has dimensions 5 × 7 deg (Heijl-Krakau method) or by using a video monitor to detect pupil movement. A fixation losses rate greater than 20 % is considered significant [40, 67, 71, 72, p. 102]. Others have used a stricter criterion, such as requiring a fixation loss rate less than 15 % to be considered reliable [39]. Some have used a looser criterion, such as requiring that fixation losses must be less than 33 % (Fig. 8.5) [49].
Sensitivity—A threshold expressed in decibels. A higher number indicates that the retina has a lower threshold for seeing, or that the retina is more sensitive, or that the retina sees a dimmer light.
Deviation plot—A map of sensitivity versus location. The numbers shown are the sensitivities in dB in some displays or a symbol in others that expresses the probability of measuring the observed sensitivity compared to age-matched normal subjects. A positive deviation implies that the retina was more sensitive at the given location than normal. A negative deviation means that the retina was less sensitive at the given location than normal.
Defect depth plot—A map of the amplitude of the deviations relative to the average age-adjusted sensitivities by location. A positive defect depth implies that a scotoma exists at the location. A negative defect depth implies that the retina at that location is more sensitive than normal. Deviations within 4 dB of expected are displayed on a defect depth plot as normal.
Total deviation plot—A plot in which the number appearing at each point is the difference in the light sensitivity for the patient compared to an age-matched normal subject. The numbers represent the stimulus expressed in decibels. The larger a negative number, the more abnormal and less sensitive the retinal sensitivity is at the given point.
Pattern deviation plot—A plot related to the total deviation plot in which the seventh largest deviation in the total deviation plot is subtracted from the deviation at each point [71]. The effect is to remove generalized depression of the visual field and reveal localized depressions (scotomas).
Pattern standard deviation (PSD)—A location weighted standard deviation of the threshold values that quantitates the variation of the hill of vision. In the vernacular of Octopus visual fields, the same concept is captured by the term “loss variance.”
In SAP, the stimulus spot size is indicated by a roman numeral. The 10-2 VF stimulus typically has a spot size III which subtends 0.43 deg of visual field, small enough to detect small scotomas, and yet to be unaffected by refractive error [73, p. 8]. The stimulus duration is 0.2 s, which is shorter than the latency time of 0.25 s for voluntary eye movements, a necessary condition to prevent a saccade to follow a stimulus [73, p. 10]. The stimulus may be a red or a white light in the 10-2 VF. Results of visual field testing using the red light tend to be more sensitive and less specific than when using the white light [74, 75].
SAP is a subjective psychophysical test that depends on the cooperation, effort, and mental status of the patient [67, p. 91]. The effort involved in a visual field test is substantial. For a 30-2 visual field, the patient is presented with approximately 550 stimuli, which takes on average 15 min per eye. Although fatigue is less of a problem with the 10-2 VF, which takes on from 3 to 7 min per eye (Fig. 8.6), the problem of inattention is not trivial, especially among the older patients. However, the subjectivity is not confined to the patient. The perimetrist must monitor and coach the patient on attention and fixation. Some perimetrists do this more successfully than others, and the performance of a given perimetrist will vary over time.
Fig. 8.6
This 10-2 visual field (VF) was done with the III, red light using the FASTPAC protocol. The field is normal, and illustrates several points to consider in interpreting such VFs. The test took 6 min, 47 s (red arrow). The gray scale commonly shows dark areas particularly near the edge of the visual field that are not worrisome, because the thresholds at the involved locations, as shown from the defect depth display, all lie within 4 dB of the expected thresholds for age-matched normal subjects. In the defect depth display, there is only one location with an abnormal threshold (red-circled locus). At this location the threshold was 16 dB (green-circled locus), which can be seen to be a higher threshold (reduced sensitivity) compared to its neighboring loci, which range from 19 to 23 dB. Note the inexact congruence of the gray scale and defect depth display. For example, on the gray scale display there appears to be a relative scotoma at the green arrow, but the defect depth display shows that the threshold at this location lies within 4 dB of normal as represented by the 0 at the location (blue-circled location)
The terms for visual field programs using the Humphrey Field Analyzer have the format “program X − Y”. The X refers to the radius of visual field tested relative to fixation. Thus, a 10-2 visual field tests the field from fixation out to 10 deg from fixation. The Y implies that the test points lie on either side of the horizontal and vertical axes, not on the axes. The other possible value for Y is 1, in which case the test points lie on the axes. This latter protocol is not used [76].
The 10-2 VF test is the preferred program to use in screening for 4AQR [40, 72]. It tests 68 points at 2 deg intervals from fixation outward to 10 deg, which is the same area tested by the Amsler grid and is the region where the earliest scotomas of 4AQR appear [74]. The normal control value for retinal sensitivity at each point of the 10-2 VF is age-matched [6]. There are no published results of 10-2 VF testing with either white or red programs in patients taking 4AQs, although they were promised as an outcome of the prospective, multicenter North American Plaquenil Study, which apparently collapsed [6].
Although 10-2 VF testing is the most commonly used form of SAP, others have been used, including the Friedmann visual field analyzer with red targets, the Humphrey 24-2, 30-2, and macular visual field programs [23, 45, 77–80]. The Friedmann visual field analyzed tests 14 points within 10 deg of fixation compared to 68 test points for the 10-2 VF. The 24-2 and 30-2 programs extend testing further radially and suffer from the disadvantage that they minimize attention to the affected paracentral visual field [78, 79]. The macular visual field program tests 16 points in the central 5 deg of visual field at 2 deg intervals.
Commonly used variations of 10-2 VF protocols are the Swedish Interactive Threshold Algorithm (SITA) protocol with a white III target, the FASTPAC protocol with the red III target, or the FASTPAC protocol with the white I target [72, 75]. The literature often depicts threshold graytone visual field displays for the red and white target protocols, but only shows pattern deviation plots for the SITA protocol for white III targets [75]. In the SITA protocol symbols are shown with the probability of having a defect of the recorded size relative to an age-matched normal population (Fig. 8.4). SITA is a program that determines whether to recheck thresholds at more points based on the results of selected rechecks at a small sample of points. SITA-standard is a program that is stricter in its requirements for reproducibility. SITA-FAST has looser criteria. SITA-standard takes approximately 50 % as long, and SITA-FAST approximately 20 % as long as older pre-SITA programs. In the FASTPAC protocol, the bracketing strategy for determining the retinal threshold is modified. The stimulus intensity is adjusted in 3 dB increments until the threshold is crossed once. This saves time compared to the 4-2 strategy. In normal or near normal fields, the test time is reduced approximately 40 %. The FASTPAC protocol with the red III target has a defect depth display. In this display, the more positive the defect depth the denser the scotoma (Fig. 8.6).
In following patients taking 4AQs, the clinician looks for changes in the 10-2 VFs over time. As with all ancillary tests, discriminating fluctuation in measurements from true changes reflecting retinopathy is important [67, p. 86]. Many have complained that 10-2 VFs are often variable, inconsistent, and difficult to confirm [30]. Fluctuation not associated with retinopathy has been subcategorized into short-term fluctuation (STF) and LTF. STF refers to variation in threshold during the course of a single visual field examination. The Humphrey visual field analyzer measures the threshold twice at 10 loci and displays the standard deviation of the repeated threshold determinations. Normally STF is less than 5 dB for the 10-2 VF [81]. STF greater than 5 dB suggests poor reliability [67, p. 93]. STF increases at the borders of scotomas and in patients who are inconsistent [82]. The variability of 10-2 VFs in some patients implies that SD-OCT or other ancillary tests may be more reliable and needed for making screening decisions (Fig. 8.7) [83]. It is rare to stop a patient from taking a 4AQ based on a single abnormal 10-2 VF, especially if adjusted dosing is appropriate.
Fig. 8.7
10-2 visual fields and spectral domain optical coherence tomography (SD-OCT) of a 65-year-old female taking 400 mg/day of hydroxychloroquine since 2000 for arthritis. She was 67 in. tall, weighed 250 lb, and had no renal or liver disease. Her cumulative dose of hydroxychloroquine was 1,752 g. Her 10-2 VFs showed a number of scotomas that were not reproducible over time. Because she was on a nontoxic dose of hydroxychloroquine, and had a normal SD-OCT bilaterally, the medication was continued and the dosage not reduced. The cumulative dose of hydroxychloroquine placed her in a risk group indicating a need for yearly screening according to American Academy of Ophthalmology guidelines, but in the presence of nontoxic daily dosing, the risk was still extremely low. (a) Serial 10-2 visual fields and an SD-OCT of the left eye. The locations circled in red show scotomatous points that vanish from one test to the next. The new scotomatous points that appear in the field of 18 February 2013 (blue-circled area) are not credible given the past history and the presence of a normal SD-OCT. (b) Serial visual fields and an SD-OCT of the right eye. The locations circled in red show scotomatous points that vanish over time. The new scotomatous points that appear in the field of 18 February 2013 (blue-circled area) are not credible given the past history and the presence of a normal SD-OCT
LTF refers to variation between tests occurring over time (not within a single test) and does not include learning curve effects.
Two types of LTF are recognized—homogeneous and heterogeneous. Homogeneous LTF refers to variation over time throughout the visual field. Heterogeneous LTF refers to incongruous variation at different locations. LTF increases as the initial sensitivity of a location decreases and as distance of a location from the fovea increases [67, p. 94]. LTF limits the clinician’s ability to detect subtle changes caused by 4AQR. In glaucoma, greater than 3–4 dB can indicate early glaucomatous damage. For example, Hoskins and colleagues studied how much change in sensitivity was necessary between a first and second visual field to predict that a third visual field would be decreased compared to the first field [84]. This analysis was based on 30-2 visual fields obtained in patients with glaucoma and minimal (mean sensitivity for the studied region was greater than 25 dB) or moderate (mean sensitivity for the studied region was 25 dB or less) visual field damage. In patients with minimal field damage a 4.7–5.6 dB change in mean sensitivity was required to have 95 % confidence that the negative trend would be confirmed in the third visual field. In patients with moderate visual field damage a 5.5–7.2 dB change in mean sensitivity was necessary for 95 % confidence [84]. Caution is necessary in extrapolating these results to patients taking 4AQs and tested with 10-2 VFs. In 10-2 VFs, estimates of LTF have not been published.
There are many variables relating to the patient that account for the common experience that some of them are unable to cooperate and provide reliable test results. Uncorrected refractive error can reduce sensitivity to a stimulus. A value attributed to this effect is 1.26 dB per diopter of uncorrected refractive error. Media opacification, most commonly from cataract, can reduce sensitivity. Miosis reduces sensitivity, becoming more of a problem when the pupillary diameter is less than 2.5 mm. Sensitivity to the visual stimulus is age-dependent, generally declining with increasing age. For the central visual field tested in chloroquine and hydroxychloroquine screening, a reduction of 0.5 dB per decade of age can be expected. There is a learning curve in visual field testing that affects results (Fig. 8.8). Fatigue, psychological factors, and clarity and uptake of pretest instructions can influence the results of testing.
Fig. 8.8
This 70-year-old woman with systemic lupus erythematosus (SLE) had been taking 200 mg/day of hydroxychloroquine for 7 years. She was 66 in. tall and weighed 135 lb. Her visual fields showed improvement over the years, presumably as she became more accustomed to testing. Note that on serial fields the locations of high thresholds vanish (blue arrows for left eye, red and green arrows for right eye)
Visual field interpretation involves analysis of global indices and of local abnormalities. The definitions of the important global indices follow.
Mean deviation (MD)—A location-weighted mean of the values of the total deviation plot. It provides an overall index of the height of the hill of vision and is insensitive to localized scotomas. It is a good index for judging the size of diffuse loss of sensitivity as can be caused by cataract. Negative values mean subnormal overall sensitivity. The equation for mean deviation is
where x i is the measured threshold of test location i, z i is the normal reference threshold at location i, S 1i 2 is the variance of the normal field measurement at location i, and m is the number of tested locations excluding the blind spot. For the 30-2 visual field, m = 76 (19 points per quadrant). For the 24-2 visual field, m = 56. For the 10-2 visual field, m = 68. Mean deviation of SAP correlates with mfERG R 1 ring amplitude in patients taking hydroxychloroquine [72].
Pattern standard deviation (PSD)—A statistic that represents the unevenness of the hill of vision. This is an index of localized loss of sensitivity. It is calculated as the location-weighted standard deviation of all threshold values. It is insensitive to overall height of the hill of vision and is sensitive to localized scotomas.
Corrected pattern deviation (CPSD)—A statistic based on PSD but with a correction based on the STF.
SAP can be done with a white or a red test object. When of equal size, a white object is seen more easily than a red test object [67, p. 31]. Therefore 10-2 VF testing using the III, red test object is more sensitive but less specific than testing with a III, white test object [46, 74]. The sensitivity and specificity of the 10-2 VF using the III, red test object were 91.3 % and 57.8 %, respectively. For testing with the 10-2VF using the III, white test object the sensitivity and specificity were 78 % and 84 %, respectively [74]. Clinicians disagree as to which test is preferable for screening with approximately equal numbers favoring the test with red and the test with white targets [75, 85]. To manage the problem of false positives, some have labeled scotomas in 10-2 VF testing as significant if they are reproducible 2 months later upon repeat testing [78].
In the 10-2 VF, the difference displays (total deviation plot and PSD plot) identify threshold values that deviate more than 4 dB from those of a sample of normal subjects [40]. The defect depth display of the same information shows the size of the scotoma at a point relative to the mean value for normal subjects at that point. A comparison of the two kinds of displays is shown in Fig. 8.9. Many observers have noted that 4AQR is easier to detect on the pattern-deviation plot than on the gray scale display (Fig. 8.10) [75, 78]. Less well recognized is that the gray scale display is more sensitive than the defect display if one uses the 10-2 VF with red III test objects (Fig. 8.11).
Fig. 8.9
Two data displays of the same information after 10-2 visual field testing. (a) The defect depth display shows the size of the difference between the patient’s visual threshold for a given locus and the mean value for normal subjects at that locus. Only values that exceed 4 dB are shown, as departures less than or equal to 4 dB are considered to be within normal limits. In the defect display a positive number means that the patient has a subnormal retinal sensitivity or an abnormally high threshold compared to normals. A negative number means that the patient was more sensitive than normal subjects at the location. In the context of 4AQR screening, negative numbers are ignored. Sometimes they imply a trigger—happy patient who may show a high number of false positives. (b) The same information is shown using the threshold display, total deviation display, and pattern deviation display with probability symbols. The defect depths located at the red-circled and blue-circled locations both are seen in less than 1 % of normal subjects and thus both get the darkest shading in the PSD display (green-circled symbol) even though the values are different at the two locations (13 dB for the red-circled location and 5 dB for the blue-circled location)
Fig. 8.10
10-2 visual fields (10-2 VFs) of a 37-year-old woman with SLE who had taken hydroxychloroquine 400 mg/day for 8 years. She had no renal or liver disease, but did have sickle cell anemia. Her height was 5 ft 6 in., and her actual body weight had been as low as 132 lb, which was less than her ideal body weight (IBW). Her daily dose adjusted for the lesser of ABW and IBW was 6.7 mg/kg. Her cumulative dose was 1,168 g. The 10-2 VFs show that a paracentral scotoma may be better visualized on the pattern deviation plot (red arrow) than the gray scale plot (blue arrow). This is not always the case. In the left eye, the paracentral scotoma is equally apparent on either display (green and orange arrows)
Fig. 8.11
Serial visual fields (VFs) and multifocal electroretinograms (mfERGs) of the right eye of a 69-year-old woman with Wegener’s granulomatosus treated with 400 mg/day of hydroxychloroquine for 17 years. She received a cumulative dose of 2,450 g. She was 63 in. tall, weighed 147 lb, and had an IBW of 135 lb. Her adjusted daily dose based on IBW was 6.52 mg/kg/day. (a) Gray scale display of the 10-2 VF using a red test object of size III. The first indication of retinopathy was on the field of 16 January 2008 when a right paracentral scotoma had developed. By 28 January 2009 there was a clear indication of a paracentral scotoma, which atypically began inferior to fixation (red arrow). Retinopathy should have been recognized and the drug stopped. Instead, the visual field was interpreted as normal and hydroxychloroquine continued for an additional 38 months before retinopathy was recognized on 28 March 2012 and the drug was stopped. Note another practice fraught with pitfalls—the mixing of different types of visual fields. The visual field of 28 March 2012 (blue arrow) was a 24-2 visual field which is not easily compared to the preceding 10-2 visual fields. Although the drug was stopped on 28 March 2012, the retinopathy progressed with worsening of the ring scotoma on 10-2 VF testing (green arrow). (b) Defect depth displays serial 10-2 VFs using a III, red test object showing the insensitivity of this display relative to that of the gray scale display. In (a), the retinopathy is easily discernible on the visual field of 28 January 2009, but the defect depth display does not depict the abnormality (red arrow). Instead the retinopathy is not discernible on this display until 28 March 2011. The PSD display of the 24-2 VF on 28 March 2012 shows that hydroxychloroquine retinopathy appears as a central scotoma rather than a ring scotoma as seen in the 10-2 VFs (blue arrow). The defect depths increase in amplitude from 28 March 2011 until 25 October 2012. Negative defect depths are ignored when interpreting 10-2 VFs obtained for 4AQR screening. Only positive defect depths correspond to depressed retinal sensitivity. (c) Serial mfERGs of the right eye of the patient. At the time the drug was stopped the R 1/R 2 ratio was 4.00 (red-ringed value), but at the follow-up mfERG of 25 October 2012 the R 1/R 2 ratio had decreased to normal. It reflected the progression of retinopathy centrally with progressive loss of response density in the central circular zone, R 1 (orange-arrowed peak in the topographic response density plot)
Interpretation of computerized visual fields in the context of screening for 4AQR is particularly difficult because one wishes to detect early field loss, which is the loss hardest to differentiate from normal physiologic variation [86]. The size of physiologic variability in visual field threshold increases with increasing eccentricity from fixation [81, 86]. Therefore, locus-invariant rules, e.g., that a threshold greater than 4 dB anywhere in the 10-2 is abnormal, do not reflect the complexity of normal threshold variability [86].
There are no consistent criteria for judging an SAP abnormality in 4AQR [30]. Unlike the situation in glaucoma care, there are no longitudinal programs for following 10-2 visual fields, and the sophistication of visual field interpretation is rudimentary. Lyons understates, “It has been difficult to develop clear criteria for abnormality” [30]. Marmor states that “any points of parafoveal loss should be taken seriously; initiate retesting (or testing with the alternative color target) and, if consistent, initiate corroborative testing with objective modalities such as SD-OCT or mfERG” [75]. However, we do not know how often this leads to unnecessary retesting.
Different clinicians have different rules for declaring an abnormality in a 10-2 VF and a change in a visual field (Table 8.5). It is worth recalling that in any 10-2 VF with 68 test points, one can expect 0.05 × 68 = 3.4 points (that is, three or four points, on average) to be labeled with the P < 0.05 probability symbol [71]. Taking Marmor’s dictum literally is therefore likely to lead to many unnecessary follow-up tests. Therefore, a more judicious evaluation of parafoveal loss is advisable. For example, two or three adjacent scotomatous points and scotomas that are unchanging over consecutive tests deserve more weight than a single scotomatous point in a decision to retest.
Table 8.5
Criteria for declaring a new scotoma and a change in a scotoma on 10-2 visual field testing
Study | Instrument | Program | Criterion for a new scotoma | Criteria for a scotoma change |
---|---|---|---|---|
Fleck [45] | Friedmann Visual Field Analyzer, Mark 1 | Red target | Failure to see any of the 14 points within 10 deg of fixation in a repeatable manner | NG |
Johnson [87] | Humphrey Visual Field Analyzer | 10-2 VF, red target | Two or more adjacent points of 5 dB loss each or one point of 10 dB loss | NG |
Xiaoyun [40] | Humphrey Visual Field Analyzer | 10-2 VF, white target | Threshold for a point has <1 % chance of being normal | NG |
Mititelu [88] | Humphrey Visual Field Analyzer | 10-2 VF, white target | NG | NG |
Mavrikakis [43] | Rodenstock | Central 25 deg, white target | Presence of two or more adjacent points of 0.8–1.2 log units increased threshold | 1. For a single point scotoma, an increase in threshold of ≥1.4 log units 2. For a scotoma of area ≥2 adjacent points, an increase in threshold of ≥0.8 log units |
Missner [23] | Octopus 2000 30 deg field Or Oculus Twinfield version 1.78 | Central 30 deg, white target | No objective standard. Subjective grading by perimetrist: 0 = normal 1 = mild sensitivity reduction of central field 2 = relative pericentral scotoma 3 = absolute pericentral scotoma | NG |
The MD and PSD indices provided with the 10-2 VF printout have not been useful in distinguishing patients taking 4AQs from healthy control subjects (Table 8.6) [89]. Comparison of these variables between patients taking 4AQs with and without retinopathy has not been reported, but comparisons across studies suggest that these indices may discriminate the categories, at least for cases of advanced retinopathy.
Table 8.6
Mean deviation and pattern standard deviation in normal subjects and patients taking 4-aminoquinolines
Study | Group | N (patients) | MD (dB) | PSD (dB) |
---|---|---|---|---|
Tanga [39] | Healthy controls | 36 | −1.27 ± 0.89 | 1.04 ± 0.16 |
Patients taking HC for <36 months | 26 | −1.58 ± 1.23 | 1.09 ± 0.22 | |
Patients taking HC for >36 months | 22 | −2.00 ± 1.39 | 1.26 ± 0.42 | |
Xiaoyun [40] | Patients with RA taking C | 60 | −1.38 ± 1.29 | 2.02 ± 1.85 |
Patients with RA not taking C | 30 | −1.37 ± 1.33 | 2.13 ± 1.91 | |
Normal subjects | 100 | −1.40 ± 1.35 | 2.09 ± 1.88 | |
Lai [72] | Patients taking HC | 13 | −1.31 ± 1.35 | 1.98 ± 2.00 |
Mititelu [88] | Patients with HC retinopathy | 7a | −7.76 ± 4.43 | 7.46 ± 3.44 |
The pattern of scotomas in 4AQR has the shape of a complete or incomplete annulus in 87 % of cases and scattered islands of relative scotomas in 13 % of cases [78]. The eccentricity of the annular scotoma in 4AQR has varied across reports. Depending on the paper, it has been said to occur typically from 2 to 3 deg, 2 to 6 deg, 2 to 8 deg, or 4 to 9 deg from fixation [5, 18, 75, 78, 90, 91]. The different statements likely arise from the different techniques used to check visual field. The 4–9-deg band arose from tangent screen testing with a red object [5], whereas the 2–6-deg band arose from the 10-2 VF. Many cases of more advanced retinopathy produce scotomas that extend into the mid and far periphery; therefore the location of the scotoma depends on the stage of the retinopathy (see Chap. 6) [92].
Visual field defects often occur without symptoms or fundus changes [8]. The earliest 10-2 visual field changes tend to be superior paracentral scotomas (Fig. 8.12). Occasional exceptions are seen in which a paracentral scotoma develops first inferior to fixation (Fig. 8.11) [78, 93]. As retinopathy progresses, the density of the scotoma increases superiorly until a complete ring scotoma and eventually a central scotoma develops [38, 78].
Fig. 8.12
Images showing that paracentral scotomas on 10-2 visual field testing usually begin superiorly. The patient was a 68-year-old female with SLE who had taken hydroxychloroquine 400 mg/day from 1995 to 2011. She was 5 ft 11 in. tall and weighed 183 lb. Her daily dose adjusted for IBW was 5.1 mg/kg. Her cumulative dose was 2,336 g. An inferior paracentral depigmented arc of RPE atrophy is seen on the fundus photograph (black arrow). A hypoautofluorescent arc corresponds to this lesion on fundus autofluorescence (FAF) imaging (red arrow)
The normal variability of static perimetric threshold values has been determined using the 30-2 program of the Humphrey Visual Field Analyzer, but not for the 10-2 program. Nevertheless, there is some information from the 30-2 VF data that can be applied to the interpretation of 10-2 VFs. The mean foveal threshold for a 50-year-old subject is 38 dB. In 95 normal subjects, the mean parafoveal threshold for points less than 6 deg from the fovea was 32.48 ± SD 0.26 [81]. The interindividual variation of the foveal threshold was 1.7 dB. The best point estimate for the interindividual variation of the parafoveal threshold for points less than 6 deg from fixation was 1.83 dB [81]. Mavrikakis and colleagues have suggested, without showing data from patients taking 4AQs, that short-term variation does not exceed 0.5 log units (5 dB) for the 10-2 VF. Therefore, they argue that a change of 0.5 log units (5 dB) in a scotomatous point exceeds measurement variability and raises suspicion of further damage [43].
Some patients have greater than average variability on 10-2 VF testing. In such patients, SD-OCT and mfERG become more important. In a study of 39 patients taking 4AQs to a matched control group of 16 patients not taking these drugs, there were no differences in the number of repeatably non-seen points within the central 10 deg using a red target [45].
The inter-test variation within a single individual for tests separated by 2 months (what was called LTF above) was 2.1 dB for the fovea in one study [81]. For the parafoveal points less than 6 deg from fixation, the intraindividual variation was 1.9 ± SD 0.26 dB [81]. There was an age-dependent decline in the perimetric thresholds at each point of the visual field [81]. For the fovea, threshold value decreases on average 0.6 dB per decade of age. For parafoveal points less than 6 deg from fixation, the threshold value decreased on average by 0.52 ± SD0.03 dB per decade of age [81].
Many patients who take 4AQs are in an age group in which concomitant morbidity with glaucoma and epiretinal membranes can occur. These comorbidities can affect the 10-2 VF, making interpretation for 4AQR more difficult (Fig. 8.13). Occasionally the visual field changes accompanying 4AQR have been misinterpreted as glaucoma [80]. In such cases, the macular examination and the mfERG can be useful, as glaucoma does not typically affect them.
Fig. 8.13
The effect of an epiretinal membrane on the 10-2 visual field (10-2 VF). The patient was a 64-year-old woman with mixed connective tissue disease. She had taken hydroxychloroquine from 1993 until 2013 on a regimen of alternating day 400 mg and 200 mg dosing. She was 5 ft 3 in. tall and weighed 162 lb. Her adjusted daily dose based on IBW was 4.9 mg/kg. Her cumulative dose was 2,190 g. (a) The spectral domain optical coherence tomogram shows an epiretinal membrane distorting the inner retinal contour. The 10-2 VF of the right eye is normal, but the left eye has a paracentral scotoma (red arrow). (b) The multifocal electroretinogram is normal and symmetric bilaterally. In cases of macular comorbidity, the use of several ancillary testing modalities can dissect the possible effect of the 4-aminoquinoline from the effects of the second condition
Misinterpreting the 10-2 Visual Field
Misinterpretation of 10-2 VFs is common in clinical practice (Fig. 8.14) [78, 79]. In failure analyses clinicians have been found to overlook characteristic visual field defects of 4AQR in 20–67 % of instances leading to delay in diagnosis and potentially worsening the prognosis of the patient [78, 79]. The misinterpretation occurs more commonly with 24-2, 30-2, and 40-2 visual fields than with 10-2 visual fields [78, 79, 94]. The 10-2 visual field obtained with the III, red test object yields deeper and broader scotomas with smaller zones of central sparing than the 10-2 VF with the III, white test object [75, 78].
Fig. 8.14
Misinterpreted 10-2 visual fields in a 61-year-old woman taking 400 mg/day of hydroxychloroquine for rheumatoid arthritis. The ophthalmologist screening this patient read all of these fields as normal, but there are suspicious, reproducible paracentral scotomas (red arrows). These are in the zone typical of 4-aminoquinoline retinopathy (4AQR) (2–8 deg), have started superiorly (typical), and are symmetric (typical). The patient needs risk-factor assessment and a secondary ancillary test done to pursue the suspicion of toxicity. Unfortunately, none of the following data had been obtained: height, weight, date of starting therapy, or renal or liver status; nor had spectral domain optical coherence tomography, multifocal electroretinography (mfERG), or FAF, although all of these tests were available in the practice setting
The most common errors made in interpreting SAP for 4AQR screening are:
Failure to recognize patterns of 4AQR on 10-2, 24-2, or 30-2 displays
Not looking at pattern deviation plots but focusing instead on the gray scale plot in 10-2 VFs performed with a III, white test object
Changing back and forth from 10-2 to 30-2 or 24-2 visual fields (Fig. 8.10)
Although the 10-2 VF is the preferred program to use in screening for 4AQR, many clinicians continue to use the 24-2 or 30-2 programs especially in cases where the patient has a concomitant disease such as glaucoma for which the 24-2 or 30-2 programs are more suitable [78, 83, 95]. In one clinic, the proportion of patients studied with the 10-2 VF was 79 %. Twenty-one percent of patients were tested with the 24-2 or 30-2 programs [78]. When the latter are used, 4AQR manifests as central rather than paracentral scotomas because of the compressed display of the visual field [78]. It is important not to switch back and forth from one program to another, because the ability to longitudinally compare visual fields over time is lost (Fig. 8.11) [78]. Authors differ in their preferences regarding the target color. Both red and white are acceptable choices. The main point is to be consistent unless one is specifically seeking a more sensitive (red test object) or specific (white test object) follow-up test because of a suspicion in need of confirmation or refutation [75, 78, 96].
In addition to SAP, blue-yellow perimetry has been proposed as a more sensitive ancillary test, but has not been adopted [97]. Frequency doubling perimetry (FDP) has been used in screening for 4AQR in an attempt to selectively isolate the function of low-redundancy magnocellular ganglion cells, which have been hypothesized to be a population of cells damaged earlier than others in the course of 4AQR [39]. Although the MD of FDP was reduced to a statistically significant extent in patients taking hydroxychloroquine compared to healthy controls, there was no clinical advantage to this modality compared to 10-2 VF testing with a white test object, which had a similar performance relative to controls. FDP has not been adopted as a standard screening test for 4AQR.
The most recent variation of threshold perimetry is microperimetry, in which there is an autotracking feature that approaches the objective of reproducibly placing the stimulus at a particular place in the fundus [98]. Preferential hyperacuity perimetry (PHP) measures visual acuity in the central 14 deg of the retina. The correlation of scotomas by PHP and 10-2 VF testing is variable. The sensitivity and specificity are unknown [51]. A customized form of SAP using red and blue test objects found that chloroquine caused elevated thresholds to red stimuli in a cumulative dose-dependent manner both at fixation and 5 deg eccentric to fixation. Perimetric thresholds were elevated in all patients with cumulative doses greater than 100 g. After cessation of drug the thresholds returned to normal over the course of 1 year [99]. None of these variations in SAP has been adopted clinically.
The sensitivity and specificity of 10-2 VF testing for 4AQR have not been well defined, primarily because the test itself is often the gold standard for making the diagnosis. Some evidence of sensitivity and specificity can be inferred from reports not specifically calculating these statistics. In one study of patients taking hydroxychloroquine without retinopathy, 10 % had abnormal 10-2 VFs compared to 11.4 % of rheumatology patient controls not taking 4AQs. This suggests an upper bound on specificity of 10-2 VF testing of 90 %. Patients with hydroxychloroquine retinopathy defined by fundus changes had a 37.5 % prevalence of abnormal 10-2 VFs, suggesting a low sensitivity. However, the definition of abnormal 10-2 VF was not given, making these inferences tenuous [20]. Another study reported on 39 patients taking 5.5–6.5 mg/kg/day of hydroxychloroquine for a mean of 1.5 years. No paracentral scotomas were observed to 10-2 VF testing using a red stimulus, suggesting a specificity higher than 90 % [100].
Easterbrook and Trope measured the sensitivity and specificity of 10-2 VF testing with red and white test objects against a gold standard defined as an abnormal Amsler grid verified by abnormal Tubinger perimetry in patients taking chloroquine [74]. They reported that 10-2 VF testing with a red test object was 91.3 % sensitive and 57.8 % specific. With 10-2 VF testing using a white test object the sensitivity and specificity were 78 % and 84 %, respectively. Only 19 of the 69 eyes in the study had no retinopathy, weakening the strength of the specificity statistics.
Browning and Lee measured the sensitivity and specificity of 10-2 VF testing in 121 patients taking 4AQs (predominantly hydroxychloroquine). In this study, fields done with red and white test objects were pooled. The gold standard was that the 4AQ was discontinued by the prescribing physician based on the totality of the evidence. Sensitivity and specificity were 85.7 % and 92.5 %, respectively [42]. Only 14 of the 121 eyes in this study had 4AQR, weakening the strength of the sensitivity statistic.
Table 8.7 shows the PPVs and NPVs for a plausible range of prevalences of 4AQR that a clinician might encounter. The NPVs are extremely high and the PPVs are rather low, regardless of the prevalence assumed. In these circumstances, a normal 10-2 VF is useful for confirming the absence of 4AQR. The most that a single abnormal 10-2 VF with a suggestive paracentral scotoma can do is raise the suspicion of 4AQR (increase the posttest probability of 4AQR compared to the pretest probability). By itself, this single test is not dispositive, and should not, by itself, lead to cessation of the 4AQ. With a revised posterior probability of 4AQR, another test should be applied, and if it is also positive, then the second stage posterior probability may indeed be high enough to warrant cessation of the drug.
Table 8.7
Positive and negative predictive values for 10-2 visual field testing over a plausible range of assumed prevalences