The accepted TFOS DEWS II definition (2017) of dry eye includes symptoms as a required element of a diagnosis of dry eye disease and this was also the case with the 2007 definition. However, symptoms alone cannot make a diagnosis of dry eye disease. Standardization is important for a diagnosis and hence in the 2017 TFOS DEWS II diagnostic methodology report reviewed the many available questionnaires and recommended the use of the Ocular Surface Disease Index (OSDI) or the Dry Eye Questionnaire (DEQ-5). “The consensus view of the committee was to use the OSDI due to its strong establishment in the field or the DEQ-5 due to its short length and discriminative ability”. In addition, they noted the continuous nature of visual analogue scales makes them attractive for clinical trials compared to discrete Likert-based question rating.
Design Considerations
The Food and Drug Administration (FDA) of the United States published a report in 2009 highlighting the critical features for the development of Patient Reported Outcome (PRO) measures. Many dry eye questionnaires were developed before this report was available, but it still gives a useful benchmark as to how robust these questionnaires are.
The conceptual framework explicitly defines the concepts measured by the instrument and how the items (questions) interrelate to give the scores produced by a PRO instrument. The instrument may contain several subconcepts that contribute to the overall measurement. Hence, for dry eye it is common to assess severity and frequency of several dry eye type symptoms that combine (usually just summed, with none attributed a greater weight than others if they are presumed to independently contribute) to an overall score.
Content validity is evidence that the instrument measures what it is intended to measure, i.e., it has been validated to differentiate different severities of a certain medical condition in a population of interest.
Other considerations include
- •
Number of items/respondent burden
- •
The longer the questionnaire, the more the burden on respondents and the more likely questions will be missed or due attention will not be paid, affecting the responses. ,
- •
- •
Administration mode
- •
Contact with a clinician when completing a questionnaire (such as asking the questions by telephone or face to face) can artificially reduce the symptoms/reported difficulty of a patient. However, completing it independently could be more burdensome if vision is impaired or the eyes are uncomfortable.
- •
- •
Response options
- •
Free text is useful to collect additional information or for respondents to clarify their approach in scoring a question, but this is more qualitative.
- •
Likert/rating scales allow respondents to identify their intensity of feeling by selecting a number from a range between two or more “anchor” descriptors. They rely on the separation between each number being equal, but the scale is not continuous. A larger range allows more sensitive increments, but may not improve the accurate recording by respondents. There should be a wide enough range and suitable “anchor” descriptors to ensure that ceiling/floor effects do not occur (when a normal distribution is skewed as too many responses fall at the top or bottom of the scale respectively). An even range of numbers forces the respondent from a neutral response if using an agree/disagreement style question.
- •
Visual analogue scales are a line with anchor descriptors at each end, on which respondents identify their intensity of feeling by placing a mark which can then be measured as a proportion of the complete line length ( Fig. 1.1 ). It has the advantage of being a continuous scale (no limited number options) and this approach has been shown in rating contact lens handling to be more repeatable and responsive than a Likert scale.
- •
- •
Instructions
- •
The instructions to the respondent on how to complete the questionnaire, including what to do when a question is not relevant to their situation, must be clear. Of particular importance is the period over which they are reflecting, when generally ranges from the past month to their feelings at the time of completion. A short recall period is susceptible to variability due to environmental conditions and disease fluctuations, whereas a long recall period is susceptible to memory limitations.
- •
- •
Format
- •
There is limited research on whether the formatting (layout) of the questions impacts completion. While a comparison of digital versus paper base questionnaires has not yet been conducted in eye care, a metaanalysis of electronic pain-related data capture methods demonstrated digitally completed PRO questionnaires are comparable with conventional paper methods in terms of score equivalence, data completeness, ease, efficiency, and acceptability.
- •
- •
Language
- •
The language should be generally be about a 11- to 12-year-old reading level (grade 6 USA)
- •
- •
Translation or cultural adaptation availability
- •
It is important that a respondent can understand the language in the questions and the intended meaning. Translation to another language should be done by a native dual-language speaker. The questionnaire should then be back translated by a second native speaker and rechecked to ensure the question meaning has been maintained.
- •
How Should the Questions Be Selected?
The initial PRO questions are usually generated from a literature review of previously reported symptoms/difficulties and the items used by previous questionnaires. Focus groups or interviews with patients, clinicians, and family members can also be used. These should continue until no new relevant on important information emerges (saturation). It is debatable whether a dry eye questionnaire should include items on health-related quality of life, as while the impact on the patient is important, it is difficult to differentiate activities affected by dry eye compared to other health issues. Another relevant issue that is generally not covered is the burden of treatment strategies. These elements are best covered, if relevant, by separate questionnaires, so each one remains unidimensional. The selected questions should be reviewed for relevance (do they reflect the experience of someone with the disease/condition) and coverage (does the questionnaire cover all the likely related experiences) by the target population (including a broad demographic and range of severity) and experts in the field. These individuals perform the initial assessment of clarity (cognitive interviews to check their understanding of what the questions mean) and readability. The questions should be relevant to most patients as “non-applicable” options cause problems for scoring.
How Should the Questionnaire Be Validated?
Following question selection and refinement the questionnaire prototype should be trialled in a target population with a wide range of the “condition” being tested. Additional tests should be performed which are hypothesized to assess the same relationship to stratify this population by the severity of the “condition” should be conducted to test the construct validity of the questionnaire prototype (either as an association/correlation (discriminant and convergent validity) or the ability to statistically separate the identified severity levels (group validity)). This might include an overall single question about how a patient feels. If there is an accepted gold standard for the same concept, the extent to which this correlates with the questionnaire is the “criterion validity.”
The choice of population induces a level of bias as the inclusion and exclusion criteria used to choose the test participants will affect the apparent effectiveness of the questionnaire. The questionnaire prototype is also tested a second time in a subgroup to assess its repeatability. If this is too soon after the initial testing, then it is likely to be influenced by patient recall. If the repeat completion is too long after the initial testing, variation in the “condition” with factors such as the environment can make the questionnaire seem unreliable.
Statistical Analysis
Item Reduction and Scale Optimization
One the data have been collected, statistical analysis is applied to assess whether the questionnaire assesses a single trait, the items it uses are relevant to the majority of the population, that the items discriminate between individuals and that the items are reliable. Classical Test Theory was based upon the assumption that the amount of an attribute is characterized by the “raw” questionnaire score. It has been shown that raw scores derived from ordinal data cannot be used as an accurate measurement of an attribute, while it also cannot be assumed that an attribute is normally distributed within a population. In addition, it cannot be assumed that all tasks in question are of equal difficulty. The more recent Item Response Theory compares individuals to an independent standard rather than each other. It uses a mathematical model that describes the relationship between the level of a latent trait for a particular person and the probability of that person selecting a particular response to an item. Rasch Analysis is a form of Item response Theory that is based on Poisson models and principals of the score indicating the severity (order), the raw score of a questionnaire can be used for the measurement of an attribute (additivity), and only a single attribute is measured by the questionnaire (unidimensionality). Questionnaires that have been developed using Rasch Analysis will therefore be independent of the sample used to obtain the initial responses allowing subsequent use to measure the attribute on any population without variation of the psychometric properties. It should be noted a variety of statistical information is produced in Rasch Analysis, but it is up to the developers to select the most appropriate outputs for item reduction.
All response scales must be scored in the same direction, i.e., a larger number should reflect an increasing amount of the condition, or vice versa. The statistics indicate whether items are too predictable or too random. Unused or rarely used response options are removed from the scale or combined with adjacent response options. Once the response-scale function has been optimized, the procedure for item reduction involves item fit statistics, item targeting, frequency of endorsement, and tests of normality (skew and kurtosis) which have specific requirements that need to be met in order to indicate conformance with the Rasch model. If the questionnaire does not meet all of the criteria, the item that fails the most criteria is eliminated from the questionnaire, and all of the statistics and criteria are recalculated and reassessed. This is repeated in an iterative process until all of the remaining items meet all of the criteria or until the removal of an item causes the separation index to fall below a value of 2, which indicates a loss in questionnaire precision.
Psychometric Properties of the Final Questionnaire
Once the questionnaire has been finalized through statistical refinement of the questions and the response options, its ability to detect change (equally sensitive to gains and losses in the measurement concept and to change across the entire range expected for the population of interest) should be determined. This would typically involve use prior to and after a treatment strategy known to be effective. The ability of an instrument to detect change in a certain population demographic influences the sample size for evaluating the effectiveness of treatments in clinical trials.
The psychometric properties of a questionnaire refer to its reliability and validity :
- •
reliability is defined as the extent to which measurements are repeatable, stable and free from error; usually expressed as a ratio of the variability in observed questionnaire scores to the total variability including error, generating a coefficient between 0 (unreliable) and 1 (indicating higher reliability).
- •
internal consistency—a measure of the interrelationship between items in a questionnaire. Options include the item-total correlations (the observed scores for each item to be correlated in turn to the total questionnaire score excluding that particular item) or “Cronbach’s alpha” (splitting all items of a questionnaire into two halves and then determining correlation between the two). Acceptable values are typically considered to be >0.70 (lower values suggest the PRO is not measuring a single trait/multidimensionality) and <0.90 (greater values suggest items are redundant).
- •
test–retest reliability—the ability of the questionnaire to produce repeatable responses, when complete after a time interval. Intraclass Correlation Coefficient measures concordance (agreement) rather than just correlation (association) which can be strong even if the scores are systematically raised or depressed on the retest. A value of ≥0.8 is typically desired for good questionnaire test–retest reliability, although a value of at least 0.6 is considered acceptable.
- •
- •
validity is a measure of how well the questionnaire is able to measure what it is supposed to measure, although a perfect correlation is not expected as this would indicate the questionnaire is redundant. There are five specific areas that together encompass the meaning of validity. Two are relevant to the development stage:
- •
face validity is whether the questionnaire seems to a person with the condition/disease or expert, to be asking appropriate questions.
- •
content validity requires a judgment on whether the coverage and content of the items is appropriate, in terms of being applicable to all people within the intended target population.
- •
The other three assessments of validity are typically made after the questionnaire has been statistically analyzed.
- •
construct validity is an assessment of whether questionnaire scores are related to other variables or attributes as would be expected to in theory. The process typically consists of a two-phased approach:
- •
Representational validity requires comparison and correlation between the questionnaire scores and other similar measures of the attribute of interest (convergent validity) or by comparison to measures that are known not to tap the attribute of interest (divergent or discriminant validity), confirming that the questionnaire doesn’t measure what it isn’t supposed to measure.
- •
Elaborative validity confirms the need for the existence of the questionnaire, by showing that it can be used in some way, most often to monitor change (“discriminative validity”).
- •
- •
Criterion validity is similar to elaborative validity, but is a demonstration that the questionnaire can discriminate people between groups. This usually involves the use of a Receiver Operating Characteristic (ROC) curve (a plot of the sensitivity of the questionnaire against 100 minus the specificity of the questionnaire; Fig. 1.2 ). Sensitivity is the true-positive rate, i.e., the proportion of people correctly categorized by the questionnaire, while specificity is the true-negative rate, i.e., the proportion of “normal” people correctly identified as being “normal.” The area under the ROC curve is an index of discriminative ability, where an area of 0.5 (diagonal line) indicates that a test has no discriminative ability and the closer the curve is to the top left corner of the plot, the greater the area and therefore discriminative ability (an area of 1.0 indicating perfect discriminative ability). This relies on there being an independent gold standard for comparison which is usually not the case.
- •
factorial validity confirms the number of attributes (subscales) the questionnaire measures. Factor analysis or principal component analysis allows the number of variables or factors in a questionnaire to be identified, as well as concurrently describing the proportion of the variation that each accounts for.
- •
Dry Eye Questionnaires
The development of the common dry eye questionnaires are presented in Table 1.1 along with the design elements and analysis applied to refine the questionnaire. Only those questionnaires that have been well established with attempts at detailed psychometric evaluation have been summarized. For example, questionnaires such as the Women’s Health Study questionnaire, which has a single question on dry eye symptom frequency, along with checking whether the individual has ever been diagnosed with dry eye, but no psychometric design evaluation, were excluded.
Questionnaire | Design Elements | Refinement Analysis | Scoring | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Name/Authors | Items | Recall Period | Dry Eye Sufferers | Literature Review | Saturation of Questions | Scale Type | Participants | Robust Psychometric Testing | Repeatability Period | Sensitivity Testing | |
McMonnies and Ho 1986 | 12 | Not stated | x | x | x | Variable Likerts | 68 | x | x | Rheumatological dry eye patients | Arbitrary weight of question items 0–33 or 0–45 |
SANDE Schaumberg et al. 2007 | 2 | 2 months | x | x | x | VAS | 26–52 | x | ICC 1–2 days ∼0.6–0.8 2 months ∼0.40 | None at time | Multiplication of VAS, rooted 0–100 |
Ocular Comfort Index | 12 | 1 week | Unknown number | Stated but no details | x | 7 rating | 150–452 | ✓ | 14 ± 7 days N = 100 | OSDI n = 337 Ocular lubricants N = 150 | Rasch scaled 0–100 |
Subjective Evaluation of Symptom of Dryness | 3 | Not stated | x | x | x | 5 point Likert | 97 | x | x | x | Categorization or added 0–12 |
Standard Patient Evaluation of Eye Dryness Questionnaire (SPEED) | 8 | Present, 72 h and 3 weeks | x | x | x | 4–5 point Likert | 100 | x | x | Lid wiper presence | Summed for 4 frequency (0–4) and severity (0–5) symptoms; 0–28 |
Dry Eye Questionnaire (5 item) | 5 | Typical day | x | x | x | 5 (frequency) or 6 point (severity) Likert | 50 | x | x | Control versus dry eye Non-Sjögren’s versus Sjögren’s patients | Added; 0–22 |
Ocular Surface Disease Index (OSDI) | 12 | Last week | Over 400 | x | x | 5 point Likert | 139 | x Internal consistency calculated | 2 weeks N = 76 | 109 dry eye versus 30 normals | Added, multiplied by 25 and divided by number of questions answered; 0–100 |
Revised Ocular Surface Disease Index (OSDI-6) | 6 | Past month | Based on original | x | x | 5 point Likert | 264 | ✓ | 1 day N = 50 | 264 dry eye versus normals | Added 0–24 |
Texas Eye Research and Technology Center Dry Eye Questionnaire (TERTC-DEQ) | 28 | Past week and past month | “Extensive focus groups” | Stated but no details | x | 5 point rating | 89 | X Internal consistency calculated | 8–12 weeks N = 13 | 37 dry eye versus 52 normals | Scored 0–94 |
University of North Carolina Dry Eye Management Scale (UNC DEMS) | 1 | Past week | Stated, but not accessible | Stated, but not accessible | x | 10 point Likert | 66 | x | 1 week N = 56 | 46 dry eye versus 20 normals | Measured 0–100 |
Impact of Dry Eye in Everyday Life (IDEEL) | 57 | Past 2 weeks | 6 focus groups N = 45 | x | ✓ | Mainly 4 or 5 point Likert | 210 | ✓ | 2 weeks N = 210 | 162 dry eye versus 48 normals | 0–100 for each of three dimensions |
Dry Eye-Related Quality-of-Life Score (DEQS) | 15 | Past week | 20 | From 3 previous questionnaires | x | 5 point (frequency) or 4 point (severity) Likert | 142 | ✓ Although only on 24 of original 45 items | 2 weeks | 203 dry eye versus 21 normals + punctal plug treatment N = 10 | Added, multiplied by 25 and divided by number of questions answered; 0–100 |