Keywords
Perceptual evaluation of voiceVoice evaluationPerceptual features of children’s voicesChildren’s voice disordersVoice disordersIntroduction
A voice disorder, by definition, exists when the voice of an individual differs from the voices of similar age, gender, geographic location, and cultural group in terms of pitch, quality, or loudness [1]. As such, the clinician’s ear is the gold standard for identifying, quantifying, and describing a voice disorder. Clinicians use perceptual evaluation not only in the initial and subsequent evaluations but also as an ongoing assessment of the effectiveness of therapy throughout the therapeutic process. Perceptual ratings are by nature subjective and dependent on the culture, location, age, and gender of the speaker as well as the listener.
There are additional challenges when applying perceptual ratings to children, as children may not be as consistent in their productions or as willing to participate in a task as adults, and listeners may lack a clear sense of what is “normal” in children’s voices. Perceptual ratings of voice can be completed based on sustained phonation, repeated sentences, reading, and connected speech. When the stimuli are standardized, this allows for more consistent ratings across speakers, raters, and serial visits. Rating systems in use include descriptors such as “mild, moderate, and severe,” equal appearing interval scales, visual analog scales, direct magnitude estimation, or sort and rate systems, to name a few. The rating systems most commonly used in clinical evaluation of voice disorders are the grade, roughness, breathiness, asthenia, and strain (GRBAS) scale [2] and the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) [3, 4].
Perceptual Features of Children’s Voices
When you listen to a child’s voice, you can tell that it is a child and not an adult speaking. Why is this? There are perceptual differences between children’s and adults’ voices. The most obvious is pitch, with mean speaking fundamental frequency declining with age in both boys and girls [5]. However, pitch is not the only factor that makes a child sound like a child. As discussed in Chap. 6, children have shorter vocal folds than adults, resulting in higher pitch [6]. They have incomplete differentiation of the vocal fold layered structure, although the impact this has on perceptual characteristics of voice is not well understood [7]. The larynx is also positioned more superiorly, resulting in a shorter resonating chamber and different formant frequencies than adults, which should also result in perceptual differences. Time-based perturbation measures (jitter and shimmer) vary with age and are higher in children than adults [8]. Children use a higher percentage of their vital capacity during speech and have higher tracheal pressures during speech than adults [9] although it is not clear how this translates to auditory-perceptual characteristics in the voice. Lopes et al. [10] found a correlation between shimmer and listener ratings of breathiness, roughness, and overall grade of severity of dysphonia. Additionally, normal glottic configuration in children features a posterior gap, [11] which may be conjectured to produce a breathier voice quality even in children with normal voices.
Limitations of the Perceptual Evaluation
As important as perceptual evaluation is, it has limitations, including variability in the child’s voice, inconsistency or unreliability of perceptual ratings of any kind, and differences in definition of aspects of voice and in definitions of severity. Throughout history, hundreds of words have been used to describe voice: “nasal, hoarse, squeaky, creaky, harsh, rough, breathy, airy, rich, resonant,” to name only a few [2, 3, 12, 13]. These have been quantified in different ways, which can be more or less precise, “a little rough” or “9/10 in harshness” or “4/5 loudness.” Without shared scales, terminology, and understandings of what is normal and how to quantify severity, these terms are as useful as describing a color by saying “that bluish color that is kind of like green and yellow too.” As such, perceptual ratings are necessary but troublesome when attempting to make them useful and reliable across clinics, patients, clinicians, and disorders. For perceptual ratings to be reliable across clinicians, institutions, and patients, there are several assumptions that must be accepted. These assumptions apply to all perceptual ratings of speech and are summarized by Kent [14]. First, we must have shared vocabulary and definitions of vocal characteristics such as “hoarseness,” “breathiness,” “roughness,” “strain,” and other labels. Second, we have to use the same descriptors and scale values. Third, we need to be able to reliably isolate perceptual features, and fourth, the differences in ratings between judges need to be smaller than the differences needed to quantify severity of disorder or change in status [14]. Unfortunately, these assumptions are not always true. Until relatively recently, there was a lack of consistency in the terminology used to describe disordered voice, although the adoption of the CAPE-V in clinical settings lays out a consistent scale and terminology [15]. Studies have shown that perceptual features of voice are not reliably isolated by clinicians – for example, judgements of pitch have been found to be influenced by roughness [16]. Perceptual evaluation of voice quality can be influenced by articulatory context, visual stimuli, and even information about the medical diagnosis [14, 16–20]. It is not clear if this is more challenging in children than adults. We do not have any clear definitions of what constitutes a significant difference to quantify severity of disorder or to validate a change in status.
Few studies have examined inter- or intra-rater reliability in evaluation of pediatric voices. Kelchner et al. [21] found moderate to strong inter-rater agreement in rating of overall severity, roughness, and breathiness using the CAPE-V, and strong intra-rater reliability, but poor inter-rater reliability in ratings of strain in a population of children status post-laryngotracheal reconstruction [22].
Listener training, use of anchors, and using a rank and sort method of rating voices have been shown to improve inter-rater reliability. Rater training with anchors was found to increase inter-rater reliability in evaluation of dysphonic voices, and that training using synthesized anchors was more effective than natural voices [23]. Listener training can certainly be done in the clinical setting, but anchors and more involved methods of rank and sort rating are typically more feasible in research.
Clinically, perceptual evaluation of children can be challenging because they may not be as consistent in their productions as adults, they may have difficulty either reading or producing standard stimuli, it may be challenging to get a representative sample of connected speech, and they may simply choose not to do as they are asked. It is ideal to have a set of stimuli that is consistent across patients and across evaluations with the same patient. If this cannot be accomplished, we recommend attempting to get at least a representative sample of conversational speech, or speech in play.
Perceptual Characteristics of Voice
Historically, a wide variety of terms have been used in perceptual analysis of voice. When looking at methods of perceptual evaluation in the literature, several terms tend to be the most frequently used and easiest to define and in most cases can be partially linked with a physical or acoustic correlate [3, 18]. However, descriptors of vocal quality are multidimensional in nature, and it is extremely difficult, if not impossible, to completely isolate individual parameters of voice quality. In spite of these challenges, perceptual evaluation remains a cornerstone of voice evaluation.
Two of the most basic parameters used in perceptual voice analysis are pitch and loudness. Pitch refers to the perceived highness or lowness of the voice. It is the perceptual correlate of fundamental frequency as measured in Hertz, which as discussed earlier varies across age and gender. Loudness , on the other hand, is the perceptual correlate of sound intensity, measured in decibels, and is typically expressed as a range from too soft to appropriate to too loud. Three other parameters – roughness, breathiness, and strain – make up the other most commonly used terms in perceptual voice analysis. Roughness refers to the degree to which the voice is smooth/clear versus gravelly. In general terms, it correlates with the periodicity versus irregularity of vocal fold vibration. Breathiness refers to the degree to which excess airflow or “hiss” is detected in a person’s speaking voice and roughly correlates with the degree of glottal competence versus incompetence. Is extra air “leaking” during voice production? Strain relates to the perception of increased muscle effort or “pushing” associated with voice production. It can be thought of as the perceptual correlate of pressed phonation, or hyperadduction of the vocal folds.
A variety of additional descriptors can also be used to help describe a person’s voice. Most commonly, these include asthenia (weakness), glottal fry (low pitch pulsations of voicing or “creaking” – can be perceptually acceptable in certain age/gender/culture groups), tremor (regular oscillations in pitch), and diplophonia (perception of two pitches being produced simultaneously). Other descriptors can also be used, such as presence of pitch breaks or “cracks,” aphonic breaks or any periods of aphonia or near-total aphonia, and descriptions regarding vocal register (chest/modal register versus head voice or falsetto). These are not always reported with a measurable rating scale but may also aid the overall description of a person’s voice as perceived by the clinician.
As a final note, statements regarding oral/nasal resonance balance are sometimes included in an overall perceptual description of a person’s speech, but these are phenomena of the resonance system and not voice, and thus a detailed discussion of this is beyond the scope of this chapter. Briefly, hypernasality refers to excess nasal resonance (relating typically to velopharyngeal incompetence), and hyponasality refers to too little nasal resonance (resulting in a person sounding “stuffy” or congested). Similarly, articulation and language skills are of course separate from voice quality. In a pediatric population, however, even if the primary focus is on voice, these are important parameters to consider as part of an overall evaluation and may warrant further formal testing procedures if they appear to be problematic.
Standard Methods of Perceptual Evaluation of Voice in Children
There have been a wide variety of rating tools used by clinicians and described in the literature to help formalize and standardize the perceptual analysis. The two most commonly used, particularly in the United States, are the GRBAS and the Consensus Auditory-Perceptual Evaluation of Voice, or CAPE-V [2–4]. The GRBAS was first described by Hirano in 1981 [2] and is an acronym for five parameters to be rated: grade, roughness, breathiness, asthenia, and strain. For each of these parameters, the clinician rates the patient’s voice on an equal appearing interval scale from 0 to 3, 0 being normal, 1 mild, 2 moderate, and 3 severe. This scale is easy to use and relatively quick but can limit the ability to reflect change over time. For example, if a patient is rated as a 3 or “severe” for any of the parameters, but becomes worse at some point, there is no way to reflect this within the 4-point scale. Additionally, there are no standardized stimuli for administration of this scale, and perceptual ratings can change based on context, length of utterance, vowel, and whether the speaker is sustaining vowels, reading, repeating, or speaking spontaneously [14, 15, 19, 20].
To provide a standardized way of evaluating voices perceptually, work on the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) began in 2002 with a consensus meeting of speech-language pathologists, speech scientists, experts in psychoacoustics, and experts in perception [3]. After extensive discussion and development work, the CAPE-V instrument was developed. The full form is available for download for clinical use through the American Speech Language Pathology and Hearing Association website [24]. Rather than an ordinal scale, it implements a visual analog scale, with 100 mm lines for each parameter to be rated. Clinicians make a hash mark on the line to indicate their judgment, and a ruler is used to measure in mm where this mark falls from 0 to 100, 0 indicating normal, higher values indicating more severely disordered. The parameters to be rated include overall severity, roughness, breathiness, strain, pitch, and loudness. Several blank lines are also included so that other parameters can be rated if desired (e.g., tremor). The clinician can also indicate with each parameter whether it is consistent or inconsistent and note whether resonance is normal or not. While the results are reported as a number out of 100, there are also general visual guidelines indicating where mild, moderate, and severe fall on the scale.
While the GRBAS has no specific tasks to be completed upon which to base judgments, the CAPE-V has three tasks for the patient to complete – sustained vowels (/α/ and /i/, 3–5 s each), six phonetically distinct sentences, and a sample of spontaneous speech in response to “Tell me about your voice problem.” The CAPE-V is more detailed and takes longer to administer and score. At least in part because it uses 100-point scales, it is thought to be more responsive to small changes in voice [25]. Both the CAPE-V and GRBAS have been found to be reliable and valid measures of perceptual voice quality [15, 25], but as noted earlier in this chapter, it is important to consider the factors that affect reliability and validity of any perceptual rating tool and control this with use of standardized procedures, training, and use of anchors when needed.
We have discussed the importance of standardizing procedures and minimizing variability as much as possible to aid reliable perceptual assessment, but when these rating scales are applied to a pediatric population, this can be challenging, and modifications at times need to be made. It can be difficult for young children to sustain vowels for more than a few seconds at a time. Providing models and using child-friendly explanations and visual cueing can be helpful, as can encouraging them with competition. (“See if you can make your voice go all the way to the end of the screen!”). Some of these techniques are described more thoroughly in the acoustic and aerodynamic assessment chapters.
For sentence-level stimuli, ideally an older child could read the same standardized sentences an adult would produce. If the child is not a fluent reader, however, he or she may need to repeat each sentence after the clinician. While the standardized sentences used with adults are ideal, their linguistic complexity may be too difficult for younger children even when provided with a model to repeat. For these children, we have developed a list of similar more simple sentences that aim to preserve the same phonetic makeup of the original sentences. These include “Harry has a hat,” “We were away,” “We eat eggs,” “Mama made muffins,” and “Pet the puppy.” These are typically simple enough for even young 3–4-year-old children to repeat successfully. These modified sentences were not developed by the consensus committee and are not an official part of the CAPE-V. As with any standardized instrument, if used in a non-standardized way, this should be noted and taken into account.
Eliciting conversational speech samples can have its own challenges. While we know young children with dysphonia often have more awareness about their voice problem than they are given credit for, they may have difficulty answering a prompt such as “Tell me about your voice problem.” We have typically chosen to elicit speech with a more child-friendly prompt such as “Tell me about your favorite vacation” or “Tell me about your favorite movie.” Using visually interesting stimuli such as the Cookie Theft Picture [26] or the updated Cookie Theft Picture [27] is another way to help elicit additional speech samples. Sometimes despite a clinician’s best efforts, however, a young child may be very reticent to engage in any sort of conversation given the unfamiliar and sometimes anxiety -provoking setting. Sometimes the speech one is able to elicit may not be particularly representative of a child’s typical conversational voice, particularly in terms of loudness. In these cases, engaging the help of parents or caregivers in getting the child talking more naturally even a little can provide useful output upon which to base perceptual judgments. When judgments are based on a very limited speech sample, it is necessary to note this in one’s documentation.
Emerging and Evolving Practices
Emerging methods and technologies in perceptual voice evaluation are focused on improving inter- and intra-rater reliability, isolating perceptual features, and providing complementary information that may assist in the accuracy or reliability of evaluation. For example, providing a spectrogram of the voice in conjunction with the recording has been demonstrated to increase inter-rater reliability [28]. Synthesized voices have been used to better isolate individual vocal parameters, and better understand both the auditory perceptual characteristics and the acoustic correlates, and to increase reliability in ratings [29, 30]. The use of synthesized voices can provide anchors for varying severity and different parameters of voice and allow clinicians and researchers to isolate the salient characteristics. Currently, acoustic, aerodynamic, and perceptual assessment techniques are complementary, but as voice recognition and analysis techniques continue to develop, we may see more overlap and ability to more objectively quantify what we hear perceptually.