Acoustic Analysis of Voice

In the previous chapter, we have described the initial stages of the voice evaluation process in which the perceptual description of the patient’s voice holds a necessary and essential place as a voice evaluation measure for both the clinician and the patient. However, we have also stated key limitations with auditory-perceptual evaluation of voice including inter-clinician differences and biases that may result in problems of scale validity and reliability. Auditory-perceptions may also be difficult to characterize and may not be as credible as numerical test procedures. In addition, since one of the fundamental decisions made in any diagnostic is one of “normal/typical” versus “abnormal/atypical,” we must recognize that perceptual judgments alone do not allow for objective comparison with normative groups. 1 In making these diagnostic comparisons, it is common to compare our current patient to the average performance and average deviation of a target sample. Unfortunately, perceptions cannot be compared with measurable norms in any valid manner.


These aforementioned issues may be addressed by incorporating instrumental measures into our assessment procedures. Instrumental measures are typically obtained using electronic or computer-based equipment. 2 Because the environment in which speech–language pathologists practice continually and rapidly evolves, ever greater sophistication from the clinician is demanded in terms of our evaluation methods and procedures, and this includes the expanding use of instrumental procedures. In addition, the demands of effective health care delivery and reimbursement issues have required the voice clinician to quantify patient characteristics both in diagnosis and through the course of therapy via instrumental measures. Instrumental measures can also help guide and support overall clinical judgments and allow for the comparison of vocal performance to appropriate normative data. If clinical experience, expertise, and perceptual judgments form the foundation of diagnostic hypotheses, then instrumental measures are a key factor in the acceptance or rejection of these hypotheses. 1


4.2.1 Rationale for Acoustic Methods


There are many forms of instrumental measures that may be used to describe the voice signal and underlying vocal function. However, it is our view that the choice of acoustic analysis methods presents several distinct advantages for the voice clinician 1:




  1. Clinician experience and familiarity: Since all certified speech–language pathologists must have acquired knowledge and skills in basic speech science and acoustic methods as part of their academic training, key acoustic analysis concepts and measures (e.g., period, frequency) will be relatively familiar to the clinicians using them.



  2. Noninvasive: Because acoustic methods are noninvasive, they may be used with ease, comfort, and familiarity with all patients by any clinician.



  3. Readily available and relatively low cost: Key components necessary for high-quality computerized acoustic analysis are readily available to most clinicians in their own desktop or laptop computers. In addition, high-quality acoustic analysis software (both freely available and commercial) is widely available. The addition of a good-quality microphone and possible preamplifier make a high-quality acoustic analysis setup available for a fraction of the cost of other instrumental voice analysis methods.



  4. Correspondence with the underlying physiology of voice disorders: Because the acoustic signal is determined, in part, by movements of the vocal folds, “there is a great deal of correspondence between the physiology and acoustics, and much can be inferred about the physiology based on acoustic analysis.” 3 It must be noted that the relationships between phonatory physiology and acoustics are certainly not perfect. The voice signal “is a complex product of the nonlinear interaction between aerodynamic and biomechanical properties of the voice production system.” 4 Because this interaction is nonlinear, accurate predictions regarding underlying phonatory physiology cannot always be made on the basis of the acoustic signal alone. However, when acoustic analysis results are placed within the context of a complete voice diagnostic protocol, very powerful inferences may be made.



  5. Good applicability to future therapy: Acoustic methods lend themselves well to both diagnostic procedures and treatment methods. It has been our experience that most patients, even relatively young children, are able to easily understand (in a simple, but effective manner) many of the measures displayed in voice analysis programs (e.g., jitter values “should go down”; F0 values “should go up”; displayed F0 contours should flatten or become more variable, depending on the context). In this way, acoustic methods provide a valuable link between the voice diagnostic and voice therapy.



  6. Wide body of literature: Acoustic analysis methods have an extensive history of use with a wide range of voice-disordered populations. This provides the clinician with a vast body of literature that may be accessed to aid in the interpretation of diagnostic findings.


It is important to recognize that, while acoustic measures have been used as effective indices of dysphonia severity and even voice quality type (i.e., breathy, hoarse, rough voices), acoustic methods have been largely ineffective in specifying disorder type (e.g., paralysis vs. mass lesion vs. functional disorder). Since the perceptual and acoustic characteristics of functional versus organic voice disorders are often quite similar in nature, it is additional key evaluation components (e.g., case history; laryngeal visualization) of the voice diagnostic that often provide important information by which differential diagnosis may be achieved.


4.3 Necessary Requirements for High-Quality Audio Recordings of the Voice


In this section, we will describe valuable acoustic measures that will be used to supplement our auditory-perceptual impressions of the voice and to provide key objective measures that we may use to both validate and guide our diagnostic impressions. We will see that for each of the previously described perceptual impressions of the voice signal such as pitch, loudness, and quality, there are corresponding objective acoustic correlates that may be readily measured from the recorded voice sound wave. However, prior to exploring some of these acoustic voice analysis methods, we will review some of the necessary requirements for high-quality audio recordings of the voice, including




  • Choice of software.



  • Good-quality microphone.



  • Analog-to-digital (A-to-D) recording interface.



  • Speakers or headphones for signal playback.


4.3.1 Voice Analysis Software


There are several very good programs available for voice analysis (both free and commercial). Examples of free recording and analysis programs include Praat 5 and SpeechTool. 6 Examples of commercial programs designed for voice analysis include programs such as Computerized Speech Lab and Multi-Speech 7 and lingWaves 8 (see ▶ Fig. 4.1 for representative screenshots from these aforementioned programs).



Example screenshots of sound wave display and associated fundamental frequency contour in (a) Praat, (b) SpeechTool, (c) Multi-Speech, and (d) lingWaves.


Fig. 4.1 Example screenshots of sound wave display and associated fundamental frequency contour in (a) Praat, (b) SpeechTool, (c) Multi-Speech, and (d) lingWaves.


The potential users of these programs should be aware that there are pros and cons to the use of free versus commercial software. In the case of free software, the pro is obvious—the costs involved to put together a high-quality voice recording and analysis system using free software (e.g., Praat) can be relatively low (certainly < $500). However, the primary con is that support for installation, maintenance, and use of the software and associated hardware is minimal and will primarily depend on the troubleshooting skills of the user. In contrast, commercial systems are often fully customer-supported via live service and/or phone and internet support. In addition, it is generally to the interest of commercial entities to offer consistent upgrades to software, to provide highly compatible software and hardware packages, and to organize and support presentations and workshops that demonstrate effective product use. Of course, the costs for a commercial hardware/software system can be relatively costly (though still considerably less than the costs of aerodynamic or laryngeal imaging systems also used in voice evaluation). The clinicians will have to decide which option is best for them based on their experience, daily needs, and time available for the necessary “learning curve” required to use whatever software/hardware system they choose or are provided for acoustic analysis of voice.


4.3.2 Microphone and Preamplifier


A microphone transduces the analog (i.e., the continuously variable signal as it occurs in the natural world) acoustic sound wave into an electrical audio waveform. This waveform will then be captured for analysis and playback. Microphones may be handheld (▶ Fig. 4.2) or head-mounted (▶ Fig. 4.3), with the head-mounted microphone preferable since a consistent mouth-to-microphone distance and positioning may be easily maintained.



Examples of a (a) microphone preamplifier (Yamaha AUDIOGRAM3 Computer Recording interface), (b) a handheld dynamic microphone (Shure SM58), and (c) an XLR microphone cable.


Fig. 4.2 Examples of a (a) microphone preamplifier (Yamaha AUDIOGRAM3 Computer Recording interface), (b) a handheld dynamic microphone (Shure SM58), and (c) an XLR microphone cable.



Examples of (a) a condenser headset microphone (AKG C-520) and (b) a dynamic headset microphone (Shure SM10A).


Fig. 4.3 Examples of (a) a condenser headset microphone (AKG C-520) and (b) a dynamic headset microphone (Shure SM10A).


While mouth-to-microphone distance is not essential for consistent measures of vocal frequency, a consistent distance is essential when attempting to make measures that correlate with vocal loudness. For a headset microphone, a position of approximately 4 to 10 cm from the lips at a 45-degree angle 9 is suggested to obtain a high-quality voice signal with limited background noise and with avoidance of excessive noise from production of plosive sounds. 10 Microphones may be omnidirectional (receives signals from all directions with similar sensitivity) or unidirectional (highly sensitive from a single direction). The microphone should also have a relatively flat frequency response (i.e., variation of < 2 dB) across the spectral frequency range of the voice, with particular focus on flat response between approximately 50 to 8,000 Hz, and should have a dynamic range wide enough to capture both quietest and loudest voice productions.


The electrical signals produced by microphones are extremely small amplitude signals and generally should be preamplified. A preamplifier (▶ Fig. 4.2) is an electronic device that amplifies a weak signal. With lower cost microphones that will plug directly into the computer (via USB or 1/8th inch phono jack), the preamplifier is built into the computer audio interface. With higher quality microphones, an external preamplifier will be necessary to accept the microphone XLR plug (often a circular connector with three pins; ▶ Fig. 4.2). In addition, the preamplifier may also provide power (referred to as phantom power) necessary for certain high sensitivity microphones called condenser microphones. The preamplifier will generally have a gain control that may be adjusted so that the levels of the loudest phonations are not distorted and clipped and quiet phonations are raised above any background noise. Many current computer audio interface preamplifiers will provide an output that will connect to either the audio input of your computer interface (typically receives a 1/8th inch phono jack) or to the USB port of your computer via a USB cable. For a summary of microphone and preamplifier characteristics, see the studies by Svec and Granqvist and Patel et al. 9,​ 11


4.3.3 Digital Recording Basics


Once we have acquired our microphone and (if necessary) external preamplification hardware, we are almost ready to record our patient’s voice signal into the computer. When a voice/speech signal is recorded for computer analysis, it must be converted into a data format that the computer can manipulate. This procedure is known as analog-to-digital (A-to-D) conversion. In this process, the sound wave is transformed into a series of numbers (i.e., digits) which represent the fluctuating amplitude of the signal over time. The number of times the amplitude of the sound wave is captured per second is referred to as the sampling rate.


Sampling Rate


When a sound is “sampled,” we have a process analogous to taking a series of snapshots of some variable activity (e.g., someone running from one point to another); the more snapshots or samples we have over time, the more accurate the reproduction of the variable activity being observed. Sampling rate refers to the number of samples (i.e., pieces of data) that will be captured per second (in the case of sound waves, we will capture multiple measures of the changing amplitude of the sound wave per second). For sound recordings and reproduction, many thousands of samples are required per second if we are to obtain an accurate reproduction of the human voice sound wave. Therefore, prior to recording, the user should check the options or preferences of their recording software to select the sampling rate and the appropriate channel for recording (a single-mono channel is generally all that is required for voice recordings) (▶ Fig. 4.4).



Example of the sound recording application in Praat. The required number of channels (e.g., single-mono vs. dual-stereo) and the sampling rate/frequency should be selected prior to recording.


Fig. 4.4 Example of the sound recording application in Praat. The required number of channels (e.g., single-mono vs. dual-stereo) and the sampling rate/frequency should be selected prior to recording.


The appropriate sampling rate for a particular signal is determined by the Nyquist Theorem, which states that the sampling rate should be at least two times the highest frequency of interest. Since the normal hearing mechanism is expected to be sensitive to frequencies as high as approximately 20,000 Hz (i.e., 20 kHz), a sampling rate of approximately 2 × 20 kHz (40 kHz) or higher should capture the key acoustic characteristics (e.g., fundamental frequency (F0), harmonics of the F0, vocal tract resonances (formants), speech noise as observed in consonant productions, and noise as produced in disordered voice such as breathiness). 12 Sampling rates should be selected based on an understanding of the range of frequencies of interest in the signal being analyzed. While analysis of certain speech characteristics (e.g., vocal tract resonances/formants, analyses that focus on the vocal F0 and lower harmonics) may be effectively analyzed using lower sampling rates such as 11 to 22 kHz, valid analysis of higher frequency spectral energy and/or rapid changes and perturbations in the voice signal will require higher sampling rates. Most computer external recording interfaces and internal sound cards are capable of recording the speech/voice wave at approximately 44 thousand samples per second (44.1 kHz/44,100 Hz) or higher. The 44.1 kHz sampling rate provides very accurate reproduction of the fluctuations of the sound waveform in time (▶ Fig. 4.5) and also provides a high-quality recording of similar quality to the sound reproduced on most music compact discs.



Portion of a voice sound wave sampled at (a) 44,100 Hz (44.1 kHz) vs. (b) 11,025 Hz (≈11 kHz). While the general profile of the waveforms is similar, the digital representation of the sound wave is mu


Fig. 4.5 Portion of a voice sound wave sampled at (a) 44,100 Hz (44.1 kHz) vs. (b) 11,025 Hz (≈11 kHz). While the general profile of the waveforms is similar, the digital representation of the sound wave is much more accurate when recorded at the higher sampling rate in (a) versus the lower sampling rate in (b).


Quantization


While sampling rate is used to control time resolution, amplitude resolution is controlled via quantization. When an analog signal is quantized, the continuous amplitude variations are converted to discrete values or increments. The number of possible amplitude variations that can be measured are related to the number of bits of resolution. Bits of resolution are calculated using a base number of 2. As an example, 28 would give 256 levels of possible amplitude. Current A-to-D conversion methods generally use at least 16 bits of resolution (216), providing 65,536 possible amplitude levels (+32,768 to –32,768 amplitude increments).


Audio Format


There are many different methods or formats in which to save your recording. We want to make sure that we select an audio file format that has no compression (e.g., do not use .mp3). Though convenient for simply listening to a recorded sound wave, avoid saving your recordings in .mp3 or similar format, since these formats often reduce the size of the saved data file by removing certain frequencies within the sound wave that may be of interest during acoustic analysis of the voice. The most common noncompressed format that will allow you to analyze and play back your recording using a wide variety of voice analysis programs is the .wav format.


Appropriate Recording Quality and Signal Amplitude


In any type of speech/voice analysis, high-quality recordings are essential. Many of the acoustic analysis methods that are used to describe normal versus disordered voice production are essentially methods that quantify the degree of disturbance or perturbation in the voice signal. Poorly recorded signals which contain significant background noise or are distorted by poor recording levels can result in invalid measurements. During recording, some programs such as Praat will provide a colored volume unit (VU) meter to indicate the signal amplitude of the recording. With many colored VU meters, recorded signals should peak in the high green/low yellow region and not be in the red region or have too weak/low recording level (barely any movement of the VU meter). When signals exceed the available amplitude range and go “into the red,” the recorded signal becomes “clipped” (i.e., the tops and/or bottoms of the waveform are clipped off)—in digital recording, this tends to result in an obvious harsh distortion to the signal on playback. Unfortunately, in avoidance of clipping, many will record signals using minimal recording levels which use only a small fraction of the available amplitude range—this results in signals that have a very poor signal-to-noise ratio (SNR; i.e., the noise inherent in any recording begins to compete with the actual signal). This is also a type of signal distortion that may result in invalid acoustic analysis results. As shown in ▶ Fig. 4.6, the recorded signal should fill the middle one-third to half of the available amplitude scale, resulting in a strong SNR and good representation of the amplitude variations of the signal while still leaving available dynamic range (i.e., “headroom”) to capture occasional peaks in the recorded signal without clipping.



Recordings of the sentence “We were away a year ago” at three different recording levels. In sound wave A, the recording level was too low, resulting in a signal that has a weak signal-to-noise ratio


Fig. 4.6 Recordings of the sentence “We were away a year ago” at three different recording levels. In sound wave A, the recording level was too low, resulting in a signal that has a weak signal-to-noise ratio and poor representation of the waveform in the amplitude domain. In sound wave C, the recording level was too high, resulting in distortion of the sound wave by “clipping” (the tops and bottoms of the sound wave have been cut off since the recording amplitude has exceeded the range of the allowable amplitude resolution). In sound wave B, the recording level is just right—the recorded sound wave fills the approximate middle one third to half (≈ 33–50%) of the allowable amplitude range resulting in an excellent representation of the sound wave without clipping.


4.4 Key Acoustic Measurements Used in the Analysis of Voice


Now that we have reviewed some of terminology and processes involved in capturing a high-quality recording of our patient’s voice, we will discuss several key acoustic measurements that provide objective correlates to the perceptual attributes of vocal pitch, loudness, and quality. Many of these measures have been recently identified and described as essential in forming a minimal set of acoustic measures used in the instrumental assessment of voice. 11


4.4.1 Measures of Vocal Frequency


The categorization of typical or disordered voice in terms of pitch (a psychoacoustic scale that allows for the ordering of sounds from “low” to “high”) is an essential part of the conventional voice diagnostic. 13 An objective and measurable correlate of pitch is frequency (Hz), with the fundamental frequency (F0) generally appearing as the lowest harmonic frequency in the voice signal and may be observed in the spectrum of the voice signal as the frequency spacing between the harmonics (▶ Fig. 4.7).



Spectrum of a portion of a highly periodic voice waveform. The first significant peak in the spectrum is often the fundamental frequency (F0—a.k.a. the first harmonic). In this example, the F0 ≈ 110 H


Fig. 4.7 Spectrum of a portion of a highly periodic voice waveform. The first significant peak in the spectrum is often the fundamental frequency (F0—a.k.a. the first harmonic). In this example, the F0 ≈ 110 Hz and higher frequency harmonics that occur at integer multiples (i.e., whole number multiples) of the F0 are also observed. Therefore, the frequency spacing between the harmonics is also equal to the F0.


How do we measure the F0 of the voice signal? We know that, during the vibratory cycle, the adducted vocal folds are put into oscillation/vibration by the expiratory airstream, and the pitch of the sound produced is related to the number of vocal fold oscillations/cycles of vibration per unit time. The term frequency (typically measured in Hertz [Hz]) refers specifically to the measurement of the number of cycles of vibration per second. While the fundamental frequency (F0) of the voice may be identified from the frequency location of the lowest harmonic peak in the spectrum, as well as via the frequency spacing between the spectral harmonics, the identification of vocal F0 is calculated from the sound wave by estimating the period (i.e., the time it takes to complete a cycle of vibration) of the cycles of vibration being analyzed (▶ Fig. 4.8).



A portion of a highly periodic voice waveform. Four cycles of vibration are shown. The time it takes to complete a cycle of vibrating is referred to as the period (P). The inverse of the period (in se


Fig. 4.8 A portion of a highly periodic voice waveform. Four cycles of vibration are shown. The time it takes to complete a cycle of vibrating is referred to as the period (P). The inverse of the period (in seconds) is the frequency (i.e., frequency [Hz] = 1/(Period (s))). Successive estimates of the period and frequency can be averaged to estimate the mean fundamental frequency (F0). In this example, the mean period is 0.00907 and the mean F0 is 110.25 Hz.


While it is possible to measure the period (and convert to frequency) for each visible cycle of vibration in a sound wave, we typically use computer programs to do these calculations in a much more efficient manner. Estimations of the vocal period may be computed from the digitized voice signal using various methods such as peak picking (measurement of the period of the cycle as the time interval between two consecutive peak amplitudes), zero-crossings (measurement of the period of the cycle as the difference between two consecutive zero-crossing points, often defined as the amplitude closest to zero immediately preceding the peak amplitude of the cycle), waveform matching (determine the period by computation of the point at which the mean squared error between the two adjacent cycles is minimized), and autocorrelation (determines the period by identifying the repeating pattern of vibration via correlation). Once the period of a cycle (or average period within a specified duration of the sound wave) is identified, the frequency may be easily computed from reciprocal of period (in seconds).


(i.e., Formula () )


As you will see, changes and/or differences in vocal pitch and frequency are associated with factors such as typical aging, sex, and body type. In addition, changes in pitch and frequency are also frequently reported as characteristics of disordered voice. Measures of vocal frequency are particularly important to obtain because the complexity of the disordered voice signal may lead to inaccurate judgments of pitch. Because the fundamental frequency of the voice is a direct result of changes in factors such as (1) elasticity or tissue stress, (2) effective length, and (3) (to a degree) the effective vibratory mass of the vocal folds, the speech–language pathologists may discern key insights in reference to the function of the phonatory mechanism from their evaluation. 14 As a result, fundamental frequency measures, and in particular measures of the mean fundamental frequency, are the most frequently reported measures obtained in voice evaluation protocols. The following minimal set of vocal frequency measures have been demonstrated to be effective in the description of both typical and atypical/disordered voice production. 11


Mean Speaking Fundamental Frequency (Mean F0)


The measurement of mean fundamental frequency (mean F0) during speech production (often referred to as the mean speaking F0 or speaking fundamental frequency [SFF]) is the average of the fundamental frequency estimates across the entire acoustic signal or portion of signal being analyzed and is generally reported in Hertz (Hz). This measurement approximates the perception of habitual pitch. 15 Because mean F0/SFF is a useful correlate of vocal pitch, this measurement is useful in objectively documenting the appropriateness of pitch level for a patient’s age, sex, race, etc.


It is best to make the measurement of mean F0/SFF from a sample of reading or spontaneous speech (i.e., a continuous speech sample), with commonly used samples including portions of “the Rainbow Passage” 16 or CAPE-V sentences. 17 Zraick et al have indicated that it is best to make this measurement from a continuous speech sample of at least 5 seconds in duration (e.g., second and third sentences of the Rainbow Passage; combined CAPE-V sentence productions). 18 As always, procedures should be standardized so that the same procedures/instructions are provided both inter-subject and intra-subject (e.g., if used for tracking progress).


It has been recommended that, when using a reading sample, we ask our patient to read an entire passage (e.g., the entire first paragraph of “the Rainbow Passage”), but we will record and analyze an embedded or central portion of the passage (e.g., second or second and third sentences of the passage). Use of an embedded portion of a larger passage helps retain the naturalness of the patient’s speaking style while avoiding possible initial or final sentence effects, and also provides a sample of speech production that tends to correlate well with longer productions while keeping storage requirements for digitized speech samples at a manageable level. 19 The reading of a standard passage is generally useable for most patients. However, when working with patients who may not be able to read effectively (e.g., very young children, those with linguistic deficits, or patients with visual deficits), a sample may be elicited by means of sentence repetition, counting, or picture description tasks.


Based on a review of literature and the clinical experience of these authors, the following general expectations for mean F0/SFF in nondysphonic individuals are suggested:




  • Infants: 400 to 600 Hz (both males and females).



  • Children: 250 to 300 Hz (both males and females).



  • Adults: Males, 100 to 150 Hz; females, 180 to 230 Hz.



  • Senescent adults: Possible increases in the mean F0 of the male voice; possible decreases in the mean F0 of the postmenopausal female voice.


▶ Table 4.1 provides representative examples of mean F0/SFF obtained from various literature sources.


















































































































































































































































































































































































































































































































































































































































































Table 4.1 Examples of mean F0/SFF from various literature sources

Author


Gender


No. of subjects


Age


Mean and SD


Range


Fitch 20


M


100


M = 19:6


116.65 (1.05 T)


85.0–155.0



F


100


M = 19:5


217.00 (0.85 T)


165.0–255.0


Hollien and Jackson 21, a


M


157


17:9–25:8 (M = 20:3)


123.3 Hz


90.5–165.2


Hollien and Shipp 22


M


25


20–29


119.5


N/A



M


25


30–39


112.2


N/A



M


25


40–49


107.1


N/A



M


25


50–59


118.4


N/A



M


25


60–69


112.2


N/A



M


25


70–79


132.1


N/A



M


25


80–89


146.3


N/A


McGlone and McGlone 23


F


10


7:6–8:6


275.8 (0.6 T)


N/A


Horii 24


M


65


26–79 (M = 54.1 y)


112.5 (17.3)


84–151


Honjo and Isshiki 25


M


20


69–85


162.0 (30.7)


N/A



F


20


69–85


165.0 (32.5)


N/A


Hudson and Holbrook 26, b


MB


100


18–29


110.15 (16.21)


81.95–158.50



FB


100


18–29


193.10 (18.58)


139.05–266.10


Murry and Doherty 27


M


5


55–71 (M = 63.8)


122.9


104.0–137.7


Stoicheff 28


F


21


20–29 (M = 24.6)


224.3


192.2–275.4



F


18


30–39 (M = 35.4)


213.3


181.0–240.6



F


21


40–49 (M = 46.4)


220.8


189.8–272.9



F


17


50–59 (M = 54.4)


199.3


176.4–241.2



F


15


60–69 (M = 65.8)


199.7


142.8–234.9



F


19


70+ (M = 75.4)


202.2


170.0–248.6


Bennett 29, c


M


15


M = 8:2


234.0 (19.76)


204.0–270.0



F


10


M = 8:2


235.0 (12.31)


221.0–258.0



M


15


M = 9:2


226.0 (16.42)


198.0–263.0



F


10


M = 9:2


222.0 (8.25)


209.0–236.0



M


15


M = 10:2


224.0 (14.68)


208.0–259.0



F


10


M = 10:2


228.0 (9.37)


215.0–239.0



M


15


M = 11:2


216.0 (15.04)


195.0–259.0



F


10


M = 11:2


221.0 (13.43)


200.0–244.0


Ramig and Ringel 30, d


MY,G


8


26–35 (M = 29.5)


121.93 (1.91 ST)


N/A



MY,P


8


25–38 (M = 32.3)


127.30 (2.61 ST)


N/A



MM,G


8


46–56 (M = 53.0)


118.36 (3.02 ST)


N/A



MM,P


8


42–59 (M = 52.6)


122.85 (2.17 ST)


N/A



MO,G


8


62–75 (M = 67.5)


125.98 (2.96 ST)


N/A



MO,P


8


64–74 (M = 69.1)


132.89 (2.43 ST)


N/A


Horii 19


M


18


10–12


226.5 (20.5)


192.1–268.5



F


18


10–12


237.5 (15.9)


198.1–271.1


Moran and Gilbert 31


M


2


24–29


152.0


137.0–167.0



F


3


24–29


244.0


220.0–278.0


Pedersen et al 32


M


19


8.7–12.9


273.0


N/A



M


15


13.0–15.9


184.0


N/A



M


14


16.0–19.5


125.0


N/A


Kent et al 33


F


19


65–80 (M =71.8)


194.0 (6.0)


N/A


Shipp et al 34, e


MY


10


21–35 (M = 25.3)


120.67 (10.87)


103.54–139.08



MM


10


46–71 (M = 57.7)


106.22 (2.27)


91.0–131.24



MO


10


77–90 (M = 83.7)


149.23 (19.97)


116.39–187.57


Awan and Mueller 35


FY


9


M = 21.18 (1.06)


207.67 (16.38)


186.0–230.0



FE


9


M = 101.7 (2.40)


176.92 (22.61)


135.35–210.33


Awan 36, f


M


10


18–30 y


123.00 (12.54)


102.0–137.0



F


10


18–30 y


206.60 (14.99)


186.0–230.0


Morris et al 37, g


M


18


20–35


125.8 (11.1)


N/A



M


14


40–55


117.2 (9.4)


N/A



M


18


>65


130.1 (16.7)


N/A


Murry et al 38, h


MY


9


20–35


137.0 (0.9 ST)


N/A



MO


6


59–73


139.0 (6.9 ST)


N/A



FY


10


20–35


195.0 (1.2 ST)


N/A



FO


7


59–73


170.0 (2.3 ST)


N/A


Awan and Mueller 39, i


MW


15


5:1–6:3


240.07 (15.89)


211.89–263.06



FW


20


5:1–6:1


243.35 (22.17)


195.20–291.10



MB


18


5:0–6:0


241.31 (18.05)


204.94–274.35



FB


17


5:1–6:0


231.48 (14.99)


208.08–261.73



MH


16


5:1–6:0


248.99 (20.18)


219.16–287.51



FH


19


5:1–5:11


248.04 (14.45)


217.56–274.03


Morris 40, j


MW


15


M = 8.4 y (0.4)


213.0 (15.0)


N/A




15


M = 9.4 y (0.3)


219.0 (18.0)


N/A




15


M = 10.5 y (0.2)


220.0 (21.0)


N/A



MB


15


M = 8.3 y (0.4)


230.0 (22.0)


N/A




15


M = 9.5 y (0.2)


217.0 (39.0)


N/A




15


M = 10.6 y (0.2)


204.0 (37.0)


N/A


Awan 41


F


10


18–30 (mean = 23.80 y)


200.85 (15.70)


N/A



F


10


40–49 (mean = 43.40 y)


175.37 (11.18)


N/A



F


10


50–59 (mean = 54.80 y)


167.66 (22.23)


N/A



F


10


60–69 (mean = 65.20 y)


151.16 (18.30)


N/A



F


10


70–79 (mean = 72.30 y)


156.08 (20.19)


N/A


Izadi et al 42


M


100


18–45 (mean = 29.2 y)


122.48 (13.18)


N/A



F


100


18–45 (mean = 31.6 y)


183.12 (26.5)


N/A


Goy et al 43


M


55


18–28 (mean =19.4)


128 (21)


N/A



M


51


65–86 (mean =73.3)


127 (27)


N/A



F


104


18–27 (mean =18.9)


251 (28)


N/A



F


82


63–82 (mean =71.1)


211 (42)


N/A


Gelfer and Denor 44


M & F


18


6 (mean = 6.33 y)


240.5 (34.0)


N/A




22


7 (mean = 7.27 y)


252.9 (26.6)


N/A




23


8 (mean = 8.29 y)


239.7 (29.5


N/A




63


6–8


244.8 (30.0)


N/A


Cox and Selent 45


M


10


20–29 (mean = 22.7 y)


121.48 (14.87)


N/A



M


5


30–39 (mean = 34.6 y)


128.38 (12.07)


N/A



M


6


40–49 (mean = 42.5 y)


117.16 (6.03)


N/A



M


9


50–59 (mean = 55.11 y)


114.25 (14.63)


N/A



M


5


60–69 (mean = 62.6 y)


112.95 (6.28)


N/A


Abbreviations: M, mean; N/A, not available; SFF, speaking fundamental frequency; T, tones; ST, semitones; y, years.


Notes:


a. Data from extemporaneous speech only.


b. Data from Black (B) subjects.


c. Longitudinal study – same male and female subjects studied over a 3 year period.


d. Young (Y), Middle age (M), and Old age (O) subjects in good (G) and poor (P) condition. Data from reading of a standard passage.


e. Young (Y), Middle-aged (M), and Old (O) subjects.


f. Data from nonsingers only.


g. Data from nonsingers only.


h. Young (Y) and older (O) subjects. Data from a standard reading passage averaged over three times daily on three different days.


i. Data from White (W), Black (B), and Hispanic (H) children.


j. Data from White (MW) and Black (MB) subjects (spontaneous speech data only).


Fundamental Frequency Standard Deviation, Pitch Sigma, and F0 Coefficient of Variation (F0 CV/vF0)


The F0 standard deviation (F0 SD, reported in Hz) is a measure of average F0 variability and is a useful correlate of expected variations in pitch level during speech production, as well as the ability to produce a steady pitch sustained vowel production. When measured from a continuous speech samples, the F0 SD is a useful measure in documenting the intonation capability (i.e., purposeful variations in pitch used to express linguistic intent). When measured in a steady pitch sustained vowel context, the F0 SD is a valuable measure of pitch stability or instability. When measured in the context of sustained vowel production, the F0 SD has been referred to as a measure of long-term instability, in which variations in frequency occur more slowly than the glottal vibration itself. 46


Although F0 SD may be reported in Hz, it is necessary to normalize the average variation for different mean F0s (e.g., adult males vs. females). One such method is to convert the F0 SD into pitch sigma, in which the average variation in Hz is converted to semitones. The conversion is accomplished using the following formula:


Formula ()  or Formula () ,


where n is the number of semitones between the two frequency values, and f2 is a higher frequency value and f1 is a lower frequency value.


While pitch sigma may be useful for describing the perceived pitch variations purposefully used in speech production or singing, it is our view that the use of a perceptual semitone scale to represent the stability of pitch/F0 production in sustained vowel contexts is less useful. In addition, the computation of semitones is relatively cumbersome. Instead, we prefer the use of an F0 coefficient of variation (CV), in which the term coefficient of variation refers to the reporting of the average variation as a percentage of the mean value. By converting to a percentage of the mean i.e.,


Formula () ),


we “normalize” F0 variability for the comparison of different voice types. As an example, it is often observed that, during continuous speech, the F0 SD is minimally ≈ 10% of the mean speaking F0. 180 Therefore, for a male speaking with an F0 of 120 Hz, we would expect an F0 SD of approximately 12 Hz; for a child with an F0 of 250 Hz, F0 SD would be approximately 25 Hz. In this example, it appears clear that the child has substantially greater absolute F0 variability (in Hz) than the adult male (25 vs. 12 Hz). However, when converted to a CV, we would see that both subjects produce the same variability (10% of the mean). An F0 CV (a.k.a. vF0) markedly less than 10% during speech production may be consistent with the perception of monopitch.


In contrast to expectations in speech production, if the patient is able to produce and control a stable, steady pitch during sustained vowel production, we will expect that the F0 SD will be quite small in relation to the mean F0, and the F0 CV will tend to be less than 1% of the mean F0. Increased F0 CV/ vF0 in sustained vowels may reflect pitch instability, but also may reflect increased levels of noise and the tendency toward aperiodic voice production. This is because the F0 estimates used in the computation of F0 SD and F0 CV are derived from estimates of cycle periods (remember,


Formula () 


and if the ability to identify highly repetitive, cyclic patterns in the voice signal is disturbed, the F0 computations will also be disturbed and increased variation in estimates of F0 will occur. Therefore, an increased F0 SD and F0 CV in sustained vowel production may also be used as a correlate of quality disturbance (e.g., breathiness; roughness) as well as a correlate of pitch stability/instability.


The sustained vowel sample(s) we elicit for the measurement of F0 SD and F0 CV will also be used for several other acoustic measures specifically related to vocal quality. However, if we simply ask the subject or patient to hold out a vowel (say “ahhhh”), in the majority of cases, the patient will produce the vowel production at a substantially higher pitch level than their habitual speaking pitch. Instead, we would like this sustained vowel to have some reasonable similarity to the patient’s habitual voice characteristics during their typical speech. The method that this book proposes is the “1, 2, 3” method of sustained vowel elicitation similar to that of Murry. 47 In this method, we ask our patient to “chant” the numbers “1, 2, 3,” followed by a sustained vowel /ɑ/ demonstration by the clinician is useful). Instructions may be as follows:


“I want you to chant the numbers ‘one, two, three’ followed by the vowel /a/ at a comfortable pitch and loudness, something like this”:


→→→


One, two, three, ahhhhhhhhhhhhhhhh” (These words have a horizontal arrow over them to imply a flat intonation pattern; the vowel /a/ should be sustained for 3 to 5 seconds).


Sample 4.1 provides an audio example of a sustained vowel elicited using the “1, 2, 3” method and ▶ Fig. 4.9 shows the sound wave of the recorded sample and the associated F0 contour. Note that the level of the F0 contour for the sustained vowel is very similar to F0 produced during the chanted “1, 2, 3.”



Recorded sample of the sustained vowel /a/ elicited using the “1, 2, 3” method (upper window) and associated F0 contour (lower window). Note that the F0 level during the sustained vowel production is


Fig. 4.9 Recorded sample of the sustained vowel /a/ elicited using the “1, 2, 3” method (upper window) and associated F0 contour (lower window). Note that the F0 level during the sustained vowel production is very similar to the F0 produced during the chanted “1, 2, 3.” Elicitation of a sustained vowel with this method tends to prevent atypically high F0 productions.


Demonstrate the method and have your patient repeat it once without recording to make sure that they understand what he or she is to do. If you are satisfied that the patient understands the procedure, start your recording and have the patient repeat the sample three times. These three trials can all be captured in the same recording and saved for later analysis. Once your sample has been acquired, you will measure F0 SD (as well as subsequent measures of jitter, shimmer, and HNR) on a central portion of the vowel (at least the central 1 second). If all three trials are similar in perceptual characteristics, measures from the second trial will suffice for clinical purposes. If the voice quality deviation is intermittent and, perhaps, only affects one of the vowel samples, it is suggested that you compute your measurements on both a disrupted and nondisrupted sample to reflect the intermittent disturbance in your data.


The following are suggested general expectations for F0 SD/F0 CV in nondysphonic individuals:




  • Infants: Increased F0 SD and F0 CV in sustained vowel production for infants and during puberty.



  • Adults: F0 CV is minimally ≈ 10% of the mean F0 or higher in continuous speech, and pitch sigma is generally observed to be ≈ 2 to 4 STs. In sustained vowel production, we often observe F0 CV to be ≈ 1% of the mean F0 or less, and substantially less than 1 ST for pitch sigma.



  • Senescent adults: Possible increases in F0 SD, F0 CV, and pitch sigma during both sustained vowel and continuous speech productions.


▶ Table 4.2 provides representative examples of F0 SD obtained from various literature sources.






















































































































































































































































































































































































































Table 4.2 Examples of F0 standard deviation (SD) from various literature sources as reported in terms of F0 coefficient of variation (vF0; F0, CV) and pitch sigma (F0 SD reported in semitones—ST).

Author


Gender


No. of subjects


Age


Mean and SD


Range


Hollien and Jackson 21, a


M


157


17:9–25:8 (M = 20:3)


1.6 T


0.5–2.5


Horii 24


M


65


26–79 (M = 54.1)


2.41 ST (0.48)


1.46–3.54


Murry and Doherty 27, b


M


5


55–71 (M = 63.8)


1.88 ST


1.0–3.2


Stoicheff 28


F


21


20–29 (M = 24.6)


3.78 ST


N/A



F


18


30–39 (M = 35.4)


3.92 ST


N/A



F


21


40–49 (M = 46.4)


4.00 ST


N/A



F


17


50–59 (M = 54.4)


4.33 ST


N/A



F


15


60–69 (M = 65.8)


4.25 ST


N/A



F


19


70+ (M = 75.4)


4.70 ST


N/A


Horii 48, c


M


12


24–40


0.27 ST (0.09)


0.14–0.47


Linville and Fisher 49, V


F


25


25–35


1.47 Hz (0.39)


0.84–2.39




25


45–55


1.68 Hz (0.43)


1.08–2.69




25


70–80


2.52 Hz (1.49)


1.06–8.05


Linville 50, d, V


F


22


18–22 (M = 20.32; S.D. = 0.95)


0.11 ST (0.04)


0.05–0.33


Linville et al 51, V


F


20


67–86 (M = 76.0; S.D. = 6.09)


0.34 ST (0.19)


0.10–0.74


Orlikoff 52, e


MY


6


26–33 (M = 30.0)


0.96%




ME


6


68–80 (M = 73.3)


2.19%



Shipp et al 34, f


MY


10


21–35 (M = 25.3)


1.76 ST (0.26)


1.34–2.30



MM


10


46–71 (M = 57.7)


2.25 ST (0.45)


1.61–3.08



MO


10


77–90 (M = 83.7)


2.60 ST (0.60)


2.08–3.85


Wolfe et al 53


M & F


20


18–30


1.86 (1.89)


0.33–7.52


Awan and Mueller 39, g


MW


15


5:1–6:3


4.38 ST (1.78)


2.18–9.36



FW


20


5:1–6:1


5.59 ST (1.81)


2.92–8.98



MB


18


5:0–6:0


5.26 ST (1.44)


2.85–8.28



FB


17


5:1–6:0


5.03 ST (2.04)


2.64–10.71



MH


16


5:1–6:0


5.39 ST (2.59)


2.75–11.08



FH


19


5:1–5:11


4.64 ST (1.67)


2.53–9.64


Morris 40, h


MW


15


M = 8.4 y (0.4)


2.5 ST (0.9)


N/A




15


M = 9.4 y (0.3)


1.9 ST (0.7)


N/A




15


M = 10.5 y (0.2)


2.3 ST (0.8)


N/A



MB


15


M = 8.3 y (0.4)


2.1 ST (0.7)


N/A




15


M = 9.5 y (0.2)


2.5 ST (0.9)


N/A




15


M = 10.6 y (0.2)


3.2 ST (0.7)


N/A


Awan 1, i


M


20


18–30


0.29 ST (0.09)


0.17–0.54



F


20


18–30


0.22 ST (0.07)


0.13–0.45


Awan and Scarpino 54, R


M


10


18–30


16.65%


N/A



W


10


18–30


15.60%


N/A



C


10


5–9


13.01%


N/A


Awan 41, j


F


10


18–30 (Mean = 23.80 y)


2.79 ST (1.07)


N/A



F


10


40–49 (Mean = 43.40 y)


3.00 ST (1.19)


N/A



F


10


50–59 (Mean = 54.80 y)


3.65 ST (1.13)


N/A



F


10


60–69 (Mean = 65.20 y)


4.77 ST (1.21)


N/A



F


10


70–79 (Mean = 72.30 y)


4.90 ST (0.72)


N/A



M & F


10


5–6


0.29 ST (0.06)


0.20–0.37


Hema et al 55, V


M


30


18–25


0.98%




F


30


18–25


0.97%


N/A








Petrović-Lazić et al 56


F


21


21–61 (Mean = 47.57)


1.12% (0.44)


N/A


Aithal et al 57


M


24


20–30


1.04% (0.50)


0.53–2.45



F


24


20–30


0.84% (0.31)


0.40–1.64








Gelfer and Denor 44


M & F


18


6 (Mean = 6.33 y)


1.96 ST (0.77)


N/A




22


7 (Mean = 7.27 y)


2.24 ST (1.03)


N/A




23


8 (Mean = 8.29 y)


1.96 ST (0.59)


N/A




63


6–8


2.06 ST (0.82)


N/A


Notes: M, mean; R, rainbow; V, vowel.


a. Data from extemporaneous speech only.


b. Speech – Converted from tones.


c. Vowel – Data reported from modal register phonations and reported in semitones


d. Data from the vowel /a/ only.


e. Vowel – Data from healthy young men (MY) and healthy elderly men (ME).


f. Speech – Young (Y), Middle-aged (M), and Old (O) subjects.


g. Data from White (W), Black (B), and Hispanic (H) children.


h. Speech – Data from White (MW) and Black (MB) subjects (spontaneous speech data only).


i. Data for the vowel /a/ only.


j. Speech.


4.4.2 Examples of Mean F0, F0 Standard Deviation, and F0 Coefficient of Variation Measurements Using Praat


The following examples were analyzed using Praat. The clinician may have recorded a new voice signal (in Praat, select New | Record Mono Sound) or may be analyzing a previously recorded voice sample (in Praat select Open | Read from File). The recorded or opened sound wave will be placed in the “Objects” list—selection of the “View& Edit” button will provide a view of the sound wave and the ability to playback the recorded sound. Selection of View | Show Analyses will provide checkboxes—for these examples, select “Show Pitch” and “Show Pulses.


The F0 contour is computed automatically using the default settings in Praat once View | Edit is selected. However, selecting the Pitch | Pitch Settings will allow the user to select the analysis method (the autocorrelation method is recommended for F0 analysis) and pitch range (please note that the use of the term “pitch” as per Praat is inaccurate and should be “frequency”). It is recommended that the pitch analysis range be set so that the low and high limits of the analysis will be approximately 100 to 150 Hz above the expected mean F0 for the speaker. As an example, a pitch range of 100 to 300 Hz for typical young adult female speakers will tend to place the F0 contour within the middle of the lower analysis window with variations within the F0 contour nicely detailed. If the user selects a pitch range that is too large (e.g., 75–1,000 Hz), the F0 contour will appear highly compressed with very little detail regarding fluctuations in the contour provided and the estimates of F0 will be prone to error. If the user selects a pitch range that is too narrow (e.g., 175–225 Hz for a typical adult female), errors in F0 estimation will occur in which F0s above or below the selected range may not be captured and F0s above the higher limit may be measured at half of the actual F0 or may result in estimation errors. It is important that the user of voice analysis software becomes familiar with key analysis parameters and always compares what they are seeing in the F0 contour and measuring with their perception of the recorded voice sample and their knowledge regarding expectations for measured vocal parameters such as mean F0.


To obtain various measurements of F0 (e.g., mean, SD), the user will select Pulses | Show Pulses, use the mouse pointer to select the portion of the sound wave they want to analyze, and then select Pulses | Voice Report to view analysis statistics. The “pulses” are markers that identify the boundaries of each detected cycle of vibration (i.e., where a repetitive pattern starts and ends/repeats). ▶ Fig. 4.10 shows a magnified view of the syllable “rain” in the word “rainbow.” The pulses can be observed marking each cycle of vibration in the sound wave. Estimates of period are computed from these pulses and then converted to estimates of frequency.



Magnified view of the syllable “rain” in the word “rainbow.” Each cycle of vibration in the sound wave (upper window) is marked with a blue line (“pulse”) indicating the boundaries of each cycle. The


Fig. 4.10 Magnified view of the syllable “rain” in the word “rainbow.” Each cycle of vibration in the sound wave (upper window) is marked with a blue line (“pulse”) indicating the boundaries of each cycle. The time between pluses is the period, and the estimates of period are converted to estimates of frequency (lower window). When using Praat, pulses must be turned on to compute many statistical measures of the sound wave such as the mean F0.


▶ Fig. 4.11 shows the sound wave and F0 contour for a typical young adult female (age 22 years) speaking the second sentence of the Rainbow Passage (play Sample 4.2 to hear the voice sample). The voice sample should be perceived as appropriate in pitch for a typical young adult female with expected intonation (i.e., pitch variations are not excessive and are not highly limited as in flattened intonation and monopitch). The intonation is viewed as fluctuations in the F0 contour, with stressed syllables observed to have marked increases in F0 (e.g., the syllable “rain”; the word “white”) and typical falling pitch and F0 at the end of a noninterrogative sentence. In this example, the mean F0 = 193.11 Hz, well within the expected range of 180 to 230 Hz for a young adult female. In addition, her F0 SD = 24.11 Hz, resulting in a F0 CV = 12.45%, reflecting typical F0 variation in speech that is often observed to be greater than 10% of the mean F0.



Typical young adult female speaking the second sentence of the Rainbow Passage. The mean F0 = 193.11 Hz and the F0 standard deviation = 24.11 Hz (F0 coefficient of variation = 12.45%). Praat analysis


Fig. 4.11 Typical young adult female speaking the second sentence of the Rainbow Passage. The mean F0 = 193.11 Hz and the F0 standard deviation = 24.11 Hz (F0 coefficient of variation = 12.45%). Praat analysis parameters: pitch range = 100 to 300 Hz; autocorrelation; voicing threshold = 0.60.


Sample 4.3 and ▶ Fig. 4.12 present the sound wave and F0 contour for the second sentence of the Rainbow Passage produced by a typical adult male (25 years). The mean F0 of 122.54 is very close to the midpoint of the typical expected range of 100 to 150 Hz. Again, intonation patterns are typical with appropriate F0 variation confirmed by the measured F0 CV = 14.67%.



Typical adult male speaking the second sentence of the Rainbow Passage. The mean F0 = 122.54 Hz and the F0 standard deviation = 17.98 Hz (F0 coefficient of variation = 14.67%). Praat analysis paramete


Fig. 4.12 Typical adult male speaking the second sentence of the Rainbow Passage. The mean F0 = 122.54 Hz and the F0 standard deviation = 17.98 Hz (F0 coefficient of variation = 14.67%). Praat analysis parameters: pitch range = 75 to 250 Hz; autocorrelation; voicing threshold = 0.60.


▶ Fig. 4.13 and Sample 4.4 as well as ▶ Fig. 4.14 and Sample 4.5 show the speech samples of a typical 5-year-old female and male, respectively. As expected, both child speakers have substantially higher mean F0s than either of the previous adult speakers (mean F0 = 249.17 Hz and 271.55 Hz). Note that, at this age, males and females often have very similar mean F0s, and a male may very well have a higher mean F0 than a female of the same age (as in this case). Though the female speaker has a somewhat smaller F0 CV than our general expectation (8.10%), her intonation is perceived as typical and not flattened, and may reflect a slightly limited range of F0 variation or may simply reflect her reading style. The F0 CV for the 5-year-old male speaker is at an expected level (12.65%).



Typical 5-year-old female speaking the second sentence of the Rainbow Passage. The mean F0 = 249.17 Hz and the F0 standard deviation = 20.11 Hz (F0 coefficient of variation = 8.10%). Praat analysis pa


Fig. 4.13 Typical 5-year-old female speaking the second sentence of the Rainbow Passage. The mean F0 = 249.17 Hz and the F0 standard deviation = 20.11 Hz (F0 coefficient of variation = 8.10%). Praat analysis parameters: pitch range = 150 to 350 Hz; autocorrelation; voicing threshold = 0.60.



Typical 5-year-old male speaking the second sentence of the Rainbow Passage. The mean F0 = 271.55 Hz and the F0 standard deviation = 34.36 Hz (F0 coefficient of variation = 12.65%). Praat analysis par


Fig. 4.14 Typical 5-year-old male speaking the second sentence of the Rainbow Passage. The mean F0 = 271.55 Hz and the F0 standard deviation = 34.36 Hz (F0 coefficient of variation = 12.65%). Praat analysis parameters: pitch range = 150 to 400 Hz; autocorrelation; voicing threshold = 0.60.


Let us now look and listen to some dysphonic speech samples and examine some possible effects on mean F0, F0 SD, and F0 CV. ▶ Fig. 4.15 and Sample 4.6 represent a speech sample from a 45-year-old female who was diagnosed with unilateral vocal fold paresis post-thyroidectomy. Her voice was perceived as being initiated with audible breathing (inhalatory stridor) and having a weak, breathy quality and monopitch. The flattened F0 contour is evident, and the F0 CV = 4.44% (mean F0 = 218.05 Hz; F0 SD = 9.69 Hz) provides a measured correlate of the perception of monopitch.



Adult female (45 years old) with unilateral vocal fold paresis producing the second sentence of the Rainbow Passage. The mean F0 = 218.05 Hz and the F0 standard deviation = 9.69 Hz (F0 coefficient of


Fig. 4.15 Adult female (45 years old) with unilateral vocal fold paresis producing the second sentence of the Rainbow Passage. The mean F0 = 218.05 Hz and the F0 standard deviation = 9.69 Hz (F0 coefficient of variation (CV) = 4.44%). The F0 contour is relatively flat consistent with the perception of monopitch and the measurement of a substantially reduced F0 CV (substantially < 10% minimum expected variation). Praat analysis parameters: pitch range = 100 to 300 Hz; autocorrelation; voicing threshold = 0.60.


In ▶ Fig. 4.16 and Sample 4.7, the voice of a 59-year-old female chronic smoker with Reinke’s edema is analyzed and heard. Her voice was perceived as mildly diplophonic with audible breathing, and particularly low in pitch. Her mean F0 was measured at 114.55 Hz (F0 SD = 14.99 Hz; F0 CV = 13.08%). While mean F0 may lower in some postmenopausal females, this F0 is substantially reduced secondary to increased vocal fold mass and is lower than the mean F0 of many adult male speakers.



Adult female (59 years old) chronic smoker with Reinke’s edema producing the second sentence of the Rainbow Passage. The mean F0 = 114.55 Hz and the F0 standard deviation = 14.99 Hz (F0 coefficient of


Fig. 4.16 Adult female (59 years old) chronic smoker with Reinke’s edema producing the second sentence of the Rainbow Passage. The mean F0 = 114.55 Hz and the F0 standard deviation = 14.99 Hz (F0 coefficient of variation = 13.08%). Praat analysis parameters: pitch range = 50 to 200 Hz; autocorrelation; voicing threshold = 0.60.


Measures of F0 SD and F0 CV obtained from sustained vowel samples are useful measures of vocal stability. As observed in ▶ Fig. 4.17 and Sample 4.8, a steady pitch and loudness sustained vowel produced by a nondysphonic speaker should be produced with very little average variability and a CV that is typically less than 1% of the mean F0 (mean F0 = 189.68 Hz; F0 SD = 1.12 Hz), F0 CV = 0.59%). In contrast, ▶ Fig. 4.18 and Sample 4.9 show the sustained vowel sample of an adult female with a strained, effortful voice quality. The F0 instability observed in the F0 contour is reflected in an increased F0 SD and F0 CV above our expected less than 1% threshold (1.40%). The patient produces a considerable increase in vocal pitch and F0 at the initiation of the vowel followed by a mild tremor.



Typical voice young adult female (24 years old) producing the sustained vowel /a/ (“ahhh”). Measures from a central portion of the vowel production showed a mean F0 = 189.68 Hz and the F0 standard dev


Fig. 4.17 Typical voice young adult female (24 years old) producing the sustained vowel /a/ (“ahhh”). Measures from a central portion of the vowel production showed a mean F0 = 189.68 Hz and the F0 standard deviation = 1.12 Hz (F0 coefficient of variation = 0.59%). Praat analysis parameters: pitch range = 100 to 300 Hz; autocorrelation; voicing threshold = 0.60.



Adult female (32 years old) with hyperfunctional, strained voice producing the sustained vowel /a/ (“ahhh”). Measures from a central portion of the vowel production showed a mean F0 = 213.28 Hz and th


Fig. 4.18 Adult female (32 years old) with hyperfunctional, strained voice producing the sustained vowel /a/ (“ahhh”). Measures from a central portion of the vowel production showed a mean F0 = 213.28 Hz and the F0 standard deviation = 2.98 Hz (F0 coefficient of variation = 1.40%). Praat analysis parameters: pitch range = 150 to 350 Hz; autocorrelation; voicing threshold = 0.60.


An increase in the F0 CV may also be observed in dysphonic voice production even though the perceived vocal pitch is relatively steady. ▶ Fig. 4.19 and Sample 4.10 demonstrate the sustained vowel sample and F0 contour for a patient with a moderate breathy voice production. Again, we can see instability in the F0 contour of the sustained vowel which coincides with an increased F0 CV (1.06%). In this case, the increased F0 CV is primarily reflecting the fact that, with increased additive noise in the voice signal during breathiness, the periodicity of the voiced signal becomes disturbed, resulting in increased variation in measurements of period and F0 during the vowel production.



Adult female (29 years) with breathy voice secondary to vocal nodules producing the sustained vowel /a/ (“ahhh”). Measures from a central portion of the vowel production showed a mean F0 = 216.67 Hz a


Fig. 4.19 Adult female (29 years) with breathy voice secondary to vocal nodules producing the sustained vowel /a/ (“ahhh”). Measures from a central portion of the vowel production showed a mean F0 = 216.67 Hz and the F0 standard deviation = 2.31 Hz (F0 coefficient of variation = 1.06%). Praat analysis parameters: pitch range = 150 to 350 Hz; autocorrelation; voicing threshold = 0.60.


A more extreme example of combined vocal tremor and dysphonic voice quality is observed in ▶ Fig. 4.20 and Sample 4.11. This voice signal was produced by a 47-year-old woman presenting with adductor spasmodic dysphonia (ADSD). We see that tremor produces an obvious “wavelike” rhythmic variation in the F0 contour. As may be expected, the F0 CV for this example shows an extreme deviation from our expected less than 1% threshold (F0 CV = 9.05%). In addition, two intermittent periods of roughness result in extreme variations in the F0 contour due to frequency jumps and episodes of noise often observed with this form of dysphonic quality. 58



Adult female (47 years old) with severe tremor and intermittent roughness secondary to adductor spasmodic dysphonia producing the sustained vowel /a/ (“ahhh”). Measures from a central portion of the v


Fig. 4.20 Adult female (47 years old) with severe tremor and intermittent roughness secondary to adductor spasmodic dysphonia producing the sustained vowel /a/ (“ahhh”). Measures from a central portion of the vowel production showed a mean F0 = 242.96 Hz and the F0 standard deviation = 21.98 Hz (F0 coefficient of variation = 9.05%). Praat analysis parameters: pitch range = 150 to 350 Hz; autocorrelation; voicing threshold = 0.60.


Total Phonational Frequency Range


Another measure used to document F0 variation is the Total Phonational Frequency (F0) Range. This entails an assessment of the range between the lowest pitch and frequency in modal register to the highest pitch and frequency in falsetto. Total phonational range provides an important index of laryngeal health and is often one of the first parameters of vocal capability affected in disordered voice. 15 This measure may be obtained either by using a pitch glide on a sustained sound (e.g., /a/) or in a stepwise fashion. 59 The lowest and highest phonational frequencies should be sustainable (1–2 seconds in duration) and repeatable within the total phonatory range. Because intra- and inter-subject variability on maximum performance tasks may be large and affected by factors such as practice, motivation, or instructions, it is recommended that at least three trials be conducted for both highest and lowest F0 productions. 60 The total phonational frequency range would then be computed from the highest F0 and the lowest F0 productions observed out of the elicited trials.


When assessing the total phonational pitch/frequency range, it is important for the clinician to be aware of the difference between “physiological” range of phonation and the “musical” range. 16,​ 61,​ 62,​ 63,​ 64,​ 65,​ 66 In assessing physiological range, no constraints are placed on quality, pitch, loudness, or duration of the phonation, while the “musical” range entails “controlled” phonations, in which the patient must (1) sustain both the lowest and highest frequency for a minimum of 1 to 2 seconds, (2) must maintain a relatively steady intensity and frequency level, and (3) must produce a “quality” phonation (i.e., no pitch or phonation breaks; no excessive breathiness, harshness, or hoarseness). We suggest that, at the very least, range measures should be repeatable and sustainable, and that extreme vocal attempts (particularly at high pitch levels) be avoided. Case contends that measures of vocal capability are best evaluated during conditions of less than maximum effort, because it may be argued that even an inefficient larynx can produce voice when enough effort is applied. 15 In a similar manner, it may be argued that even inefficient vocal mechanisms may produce extensive vocal ranges if allowed to produce physiological voice, whereas the actual useable range of vocal pitches and frequencies would be much smaller if evaluated under the conditions of “controlled” phonation. Therefore, we suggest that the total phonational frequency range be assessed in terms of musical range rather than physiological range.


Within the phonational frequency range, the highest phonational frequency has been viewed as having particular importance in voice assessment. Wuyts et al stated that, when extra mass is evenly distributed along the true vocal fold(s), the higher vibratory rates become dampened. 67 The result is a decrease in the upper reaches of the phonational frequency range. Highest phonational F0 may also be obtained and utilized as part of the Dysphonia Severity Index (DSI; see discussion on multivariate analysis).


To record the total (musical) phonational range:




  1. Use a headset or handheld microphone. Because high pitch levels can be quite loud, it can be useful to hold the microphone by hand for this task and to move the microphone somewhat away from the mouth for the higher pitch productions. Alternatively, the output level of the preamplifier can also be turned down for these productions so as to avoid distorting any recordings.



  2. To record the patient’s minimum pitch level, provide the following instructions to your patient: “I am going to ask you to hold the sound “ah” (/a/) at several different notes or pitches. Starting at a comfortable pitch level, I would like you to go down in steps to the lowest note you can hold out without your voice breaking or cracking. It will be similar to singing down a scale, such as…” (provide an example for your patient here).



  3. When the patient gets to the lowest sustainable pitch, have them repeat it at least three times so that (1) you can be sure that it is of reasonable quality (do not include vocal fry phonation), (2) you can be sure that it is repeatable, and (3) you can have the opportunity to cue them to lower productions if you believe that they have not truly reached their minimum pitch limit. When you are confident the patient has reached his or her lower pitch limit, record a brief sample on the computer (1–2 seconds).



  4. To record the patient’s maximum pitch level, provide the following instructions to your patient: “Starting at a comfortable pitch level, I would like you to go up in steps to the highest note you can hold without your voice breaking or cracking, including falsetto voice—falsetto is a high, thin, reedy voice such as… (provide example). It will be similar to singing up a scale, such as…” (provide an example for your patient here).



  5. When the patient gets to his or her highest sustainable pitch, have him or her repeat it at least three times. When you are confident that the patient has reached the highest pitch limit, record a brief sample on the computer (1–2 seconds).



  6. To compute the total phonational range in Hz, simply subtract the lowest pitch/frequency level from the highest. To convert the total range in Hz to semitones (STs), consult a chart of musical note/frequency equivalents (▶ Table 4.3) and count the number of semitones between the lowest and highest frequency level (you will probably have to “round” the low- and high-frequency levels to the nearest semitone).























































































































































































































    Table 4.3 Table of musical note and frequency equivalents

    Note


    Frequency (Hz)


    Note


    Frequency (Hz)


    Note


    Frequency (Hz)


    Note


    Frequency (Hz)


    Note


    Frequency (Hz)


    Note


    Frequency (Hz)


    Note


    Frequency (Hz)


    C0


    16.35


    C1


    32.7


    C2


    65.41


    C3


    130.81


    C4


    261.63


    C5


    523.25


    C6


    1046.5


    C#0/Db0


    17.32


    C#1/Db1


    34.65


    C#2/Db2


    69.3


    C#3/Db3


    138.59


    C#4/Db4


    277.18


    C#5/Db5


    554.37


    C#6/Db6


    1108.73


    D0


    18.35


    D1


    36.71


    D2


    73.42


    D3


    146.83


    D4


    293.66


    D5


    587.33


    D6


    1174.66


    D#0/Eb0


    19.45


    D#1/Eb1


    38.89


    D#2/Eb2


    77.78


    D#3/Eb3


    155.56


    D#4/Eb4


    311.13


    D#5/Eb5


    622.25


    D#6/Eb6


    1244.51


    E0


    20.6


    E1


    41.2


    E2


    82.41


    E3


    164.81


    E4


    329.63


    E5


    659.25


    E6


    1318.51


    F0


    21.83


    F1


    43.65


    F2


    87.31


    F3


    174.61


    F4


    349.23


    F5


    698.46


    F6


    1396.91


    F#0/Gb0


    23.12


    F#1/Gb1


    46.25


    F#2/Gb2


    92.5


    F#3/Gb3


    185


    F#4/Gb4


    369.99


    F#5/Gb5


    739.99


    F#6/Gb6


    1479.98


    G0


    24.5


    G1


    49


    G2


    98


    G3


    196


    G4


    392


    G5


    783.99


    G6


    1567.98


    G#0/Ab0


    25.96


    G#1/Ab1


    51.91


    G#2/Ab2


    103.83


    G#3/Ab3


    207.65


    G#4/Ab4


    415.3


    G#5/Ab5


    830.61


    G#6/Ab6


    1661.22


    A0


    27.5


    A1


    55


    A2


    110


    A3


    220


    A4


    440


    A5


    880


    A6


    1760


    A#0/Bb0


    29.14


    A#1/Bb1


    58.27


    A#2/Bb2


    116.54


    A#3/Bb3


    233.08


    A#4/Bb4


    466.16


    A#5/Bb5


    932.33


    A#6/Bb6


    1864.66


    B0


    30.87


    B1


    61.74


    B2


    123.47


    B3


    246.94


    B4


    493.88


    B5


    987.77


    B6


    1975.53


Alternatively, the clinician may use the aforementioned formulas by which the number of semitones (n) between the highest frequency (f2) and the lowest frequency (f1) may be calculated. 68


Suggested expectations regarding the total phonational range in nondysphonic individuals are as follows:




  • Infants and children: Phonational range of approximately 1 to 2 octaves (12–24 semitones).



  • Adults: Phonational range of approximately 2 to 3 octaves (24 to 36 semitones). Adults without singing training or without singing experience may be limited to approximately 20 to 24 semitones.



  • Senescent adults: Possible decreases in phonational range with increased age.


▶ Table 4.4 provides representative examples of total phonational range in semitones obtained from various literature sources.


















































































































































































































Table 4.4 Examples of total phonational range from various literature sources

Author


Gender


No. of subjects


Age


Mean and SD


Range


Hollien and Jackson 21


M


157


17:9–25:8 (M = 20:3)


19.4 T


14.5–27.0


Ramig and Ringel 30, a


MY,G


8


26–35 (M = 29.5)


32.20 ST (8.77)


N/A



MY,P


8


25–38 (M = 32.3)


26.65 ST (7.10)


N/A



MM,G


8


46–56 (M = 53.0)


28.29 ST (8.74)


N/A



MM,P


8


42–59 (M = 52.6)


26.84 ST (3.57)


N/A



MO,G


8


62–75 (M = 67.5)


31.37 ST (4.38)


N/A



MO,P


8


64–74 (M = 69.1)


24.30 ST (7.12)


N/A


Pedersen et al 32


M


19


8.7–12.9


34.4 ST


N/A



M


15


13.0–15.9


37.5 ST


N/A



M


14


16.0–19.5


41.4 ST


N/A


Linville 69


F


24


25–35


33.13 ST (3.43)


28–40



F


20


45–55


34.00 ST (3.22)


27–38



F


23


70–80


28.96 ST (4.13)


19–35


Linville et al 51


F


20


67–86 (M = 76.0; S.D. = 6.09)


36.4 ST (4.1)


29.0–44.0


Awan 61


M


10


18–30


29.10 ST (5.02)


N/A



F


10



25.80 ST (4.87)


N/A


Morris et al 37


M


18


20–35


36.6 ST (4.2)


N/A



M


14


40–55


36.9 ST (5.0)


N/A



M


18


>65


29.7 ST (5.2)


N/A


Ma et al 70


F


35


22–52 (M = 36.03)


40.39 ST (3.73)


N/A








Šiupšinskiene et al 71


F


76


M = 38.5


30.8 ST (4.3)


N/A








Siupsinskiene and Lycke 72


M


38


18–67 (M = 33.7)


34.2 ST (3.2)


27.0–41.0



F


89



29.5 ST (3.3)


22.2–36.7








Hallin et al 73


M


30


21–50 (M = 30.0)


40.6 ST (4.41)


33.0–51.0


Notes:


a. Young (Y), middle age (M), and old age (O) subjects in good (G) and poor (P) condition. Data are from sustained vowel /a/.


4.4.3 Examples of Total Phonational Range Measurements Using Praat


Two examples are provided to illustrate the recording and measurement of total phonational range. In ▶ Fig. 4.21 and Sample 4.12, we see the recorded sound wave and F0 contour for an adult male providing a glide from the lowest phonational pitch and F0 in modal register to the highest phonational pitch and F0 in falsetto register. Note that the analysis pitch range has been greatly expanded (50–800 Hz) compared to the more focused analysis ranges we have used for continuous speech and vowel samples in previous examples. In this example, the total phonational range in Hz is 446.6 Hz (559.2–112.6 Hz), corresponding to 27.74 STs and an approximate range from A2 (110 Hz) to C#5 (554 Hz). The transition from modal to falsetto register is indicated in ▶ Fig. 4.21 with a vertical arrow and can clearly be heard in Sample 4.12 and marked by a sudden drop in signal amplitude and variation in the F0 contour.



Typical adult male producing a pitch glide from the lowest to the highest phonation pitch and F0. The total phonational range in Hz is 446.6 Hz (559.2–112.6 Hz), corresponding to 27.74 STs and an appr


Fig. 4.21 Typical adult male producing a pitch glide from the lowest to the highest phonation pitch and F0. The total phonational range in Hz is 446.6 Hz (559.2–112.6 Hz), corresponding to 27.74 STs and an approximate range from A2 (110 Hz) to C5 (554 Hz). The transition from modal to falsetto register is indicated with a vertical arrow corresponding to a drop in signal amplitude in the sound wave and a variation in the F0 contour. Praat analysis parameters: pitch range = 50 to 800 Hz; autocorrelation; voicing threshold = 0.60.


▶ Fig. 4.22 and Sample 4.13 provide another example of the recording and measurement of the total phonational range, but this time elicited by having the subject going down in steps to the lowest pitch that can be sustained, followed by steps going up to the highest sustainable pitch. In this example, the lowest and highest sustainable pitch and F0 productions were recorded and have been combined into one sound file. In this example, the total phonational F0 range is 381.4 Hz (493.9–112.5 Hz), corresponding to 25.61 STs and approximate range from A2 (110 Hz) to B4 (493.9 Hz).



Recordings of the lowest phonation pitch and F0 in modal register and the highest phonational pitch and F0 in falsetto register from a typical adult male combined into a single sound file. The total p


Fig. 4.22 Recordings of the lowest phonation pitch and F0 in modal register and the highest phonational pitch and F0 in falsetto register from a typical adult male combined into a single sound file. The total phonational F0 range is 381.4 Hz (493.9–112.5 Hz), corresponding to 25.61 STs and approximate range from A2 (110 Hz) to B4 (493.9 Hz). Praat analysis parameters: pitch range = 50 to 800 Hz; autocorrelation; voicing threshold = 0.60.


4.4.4 Measures of Vocal F0 in Normal/Typical Voice Subjects


Vocal F0 during Infancy and Childhood


Since we expect there to be variations in habitual pitch as a function of factors such as the typical aging process and sex differences, we should also expect that these same factors would result in expected variations in mean F0. The infant voice has been reported to generally have expected F0s in the range of approximately 400 to 600 Hz, with no observed significant difference between sexes in the F0. 74,​ 75 A possible tendency for lower F0s in infants with increased body size has been reported. 76 Pitch and F0 level gradually lower in both sexes during infancy and through the childhood years, with rapid changes in F0 occurring during the first 4 months and 1 to 3 years of age.


Increased F0 variability during vowel-like productions (e.g., cries) in the infant voice has been reported and may reflect a low level of neuromuscular coordination and limited ability to control phonatory parameters such as vocal folds tension. 3 Studies have demonstrated that pitch and F0 range also develop from infancy through childhood, with the singing range expanding from one to two octaves between the ages of 6 to 11 years. 77,​ 78


Vocal F0 during Puberty


F0 generally becomes distinguishable by sex at about 11 years of age and certainly by 13 years of age. There is a tendency for F0 variability during vowel productions to decrease with age from infancy to puberty, reflecting increased neuromuscular control during phonation. 79 In females, the fundamental frequency lowers approximately 1 octave from birth to puberty; in males, F0 lowers approximately 2 octaves, however, because females may mature faster than males, they may enter adolescence with slightly lower speaking F0s. 80 The growth changes in the larynx that occur during puberty (referred to as mutation, in which the neck lengthens and the larynx descends and grows in size) are primarily responsible for the observed changes in vocal F0 in both sexes, with complete mutation occurring within 3 to 6 months. 81 The more drastic pubertal changes in vocal F0 observed in males are due to increased laryngeal size and increased vocal fold mass and length, 82 but are also consistent with the development of secondary sex characteristics. As an example, Pedersen et al reported on F0 measures in relation to pubertal development in 48 normal males in three groups (8.7–12.9, 13–15.9, and 16–19.5 years). 32 As expected, mean speaking F0 dropped considerably between the three groups (273 vs. 184 vs. 125 Hz). In addition, speaking F0 correlated quite strongly with height (r = –0.82), pubic hair stage (r = –0.87), testis volume (r = –0.78), and total testosterone (r = –0.73). The aforementioned laryngeal changes experienced during puberty are also associated with a period of increased pitch instability/variability. Boltezar et al reported that the F0 CV from sustained vowel /a/ productions was greater in both pubescent males and females than in adult voice, though the pubescent female voice was observed to be somewhat more stable than for pubescent males. 83 The authors attributed this F0 instability to the inconsistency between gradually developing nervous control of a relatively rapidly growing peripheral speech mechanism. As males and females move through puberty and into adulthood, the range of useable vocal pitches will extend to approximately 2 to 3 octaves.


Vocal F0 during Adulthood and Senescence


Possible changes in voice characteristics, including pitch and frequency, have been documented as males and females progress through the lifespan toward senescence (a state of advanced aging in which a gradual deterioration in bodily function(s) may occur due to biological changes). As females move via adolescence into adulthood, mean F0 tends to be relatively stable in the vicinity of 180 to 230 Hz until the onset of menopause. Due to factors such as hormonal changes and edema, 25,​ 84 it has been observed that the pitch and F0 of the female voice may lower peri- and postmenopause. 35,​ 80,​ 82,​ 85 Awan reported that an 18- to 30-year-old group of females had significantly higher mean F0 than females in their 40s, 50s, 60s, and 70s. 41 In addition, significant differences in mean speaking F0 were observed between the 40- to 49- and 50- to 59-year-old groups and the 60- to 69- and 70- to 79-year-old groups. A significant inverse correlation (r = –0.69, p < .001) observed between mean speaking F0 and age in the female subjects studied was also reported. 41


The male vocal pitch and mean F0 appear to remain relatively stable post-adolescence and throughout adulthood with expected F0s in the 100 to 150 Hz vicinity. However, there have been studies that indicate that some men may experience a rise in the pitch and F0 of the voice during senescence. 22,​ 25,​ 34,​ 86,​ 87 It has been hypothesized that pitch and F0 elevation in the elderly male voice may be due hormonal changes such as decreased secretion of testosterone resulting in a reduction in muscular tissues. 34,​ 82,​ 88 ▶ Fig. 4.23 incorporates data from various studies to illustrate possible changes in mean F0 with aging for males and females across the lifespan. 19,​ 22,​ 32,​ 35,​ 41,​ 86,​ 89 As in our previous discussion of vocal pitch, substantial variation in vocal F0 with aging should be expected. However, it is interesting that males and females at both extremes of the lifespan may have a tendency to produce similar pitch and F0 voices. 1



Prospective changes in mean F0 across the lifespan. Male versus female differentiation in mean F0 occurs during puberty. Greatest separation between the sexes occurs postpuberty until the late 40s/ear


Fig. 4.23 Prospective changes in mean F0 across the lifespan. Male versus female differentiation in mean F0 occurs during puberty. Greatest separation between the sexes occurs postpuberty until the late 40s/early 50s. A tendency for a lowering of female F0 and an increase in male F0 may result in decreasing differentiation in the mean F0 of the sexes in older age.


In addition to possible changes in mean F0, evidence exists that increased variability of vocal pitch and F0 during both continuous speech and sustained vowel production may occur in senescence. In some senescent speakers, a wider range and increased variability of F0 may be observed during intonation in continuous speech. 35,​ 90 In addition, increased pitch and F0 variability resulting in a “shaky” or even tremulous-sustained vowel production has been reported in elderly versus younger speakers. 52,​ 91,​ 92 These possible changes in vocal F0 variability may be due to decrements in sensory feedback, decreased speed and accuracy of motor control, neurochemical changes in the basal ganglia, and structural and physiological changes in the laryngeal mechanism. 93,​ 94 Declinations in pitch/frequency range in elderly persons have also been observed, with senescent persons observed to have reductions in useable pitch and F0 range, with particular limitations in the production of higher pitch and F0 levels. 69,​ 95,​ 96


Possible Effects of Race on Vocal F0


The possible effect of race on speaking F0 has been unequivocal. A number of studies have reported significantly lower speaking F0s for African American speakers compared to that for Caucasians of comparable age. 26,​ 97,​ 98 The possibility of lower F0s in African American speakers was attributed to possible anatomical differences between racial groups such as increased size of laryngeal structures. 98,​ 99,​ 100,​ 101,​ 102 In contrast to those studies that have speculated on a lower speaking F0 for African American speakers, several studies have reported no significant difference in the speaking F0 of African American versus Caucasian subjects. 103,​ 104 Awan and Mueller examined the speaking F0 characteristics of Caucasian, African American, and Hispanic kindergartners and reported mean speaking F0 to be significantly lower in African American versus Hispanic children (236.4 vs. 248.5 Hz) but not different from Caucasian children (236.4 vs. 241.7 Hz). 39


Caution should be exercised when applying the normative speaking F0 data collected solely from one racial group to decisions regarding the speaking F0 of other racial groups. 1 Regardless of whether possible anatomical differences in the speech/voice mechanisms of different races exist or not, there may be significant linguistic and societal differences (e.g., differences in pragmatic and interaction styles) between racial and/or dialect groups that influence speaking F0 characteristics. 1,​ 104


Effects of Vocal Training on Vocal F0


Vocal training may result in increased vocal capability and capacity. Persons with vocal training have been reported to learn and use different mechanisms of laryngeal control that may result in an expansion of the phonational F0 range and, particularly, the ability to produce high F0s. 105 As an example, a mode of phonation used by trained singers at high frequencies has been described in which only portions of the vibrational length would be used during vocal fold vibration, analogous to shortening the length of a vibrating string to produce a higher pitch tone than the open string by pressing a finger firmly at some intermediate point along the string’s length. 106,​ 107 In addition, trained singers may use isometric contractions of the laryngeal musculature to produce wide ranges of acoustic output by varying the tension of the vocal fold independent of changes in vocal fold lengthening that are primarily responsible for vocal fold tension in untrained subjects. 88


It has been speculated that, unlike untrained vocalists, the trained singer may be able to maintain an acceptable degree of vocal quality and control at even the limits of their vocal capacity and thereby achieve similarity between the “musical” range of phonation and the “physiological” range. 1,​ 61,​ 62,​ 66 In contrast, the “untrained” vocalist often produces a “musical” range of controlled phonation that is distinctly smaller than the “physiological range” that may be available in relatively uncontrolled activities (i.e., no intent to produce or sustain particular target pitch levels) such as screaming, crying, shouting, etc.


4.4.5 Measures of Vocal F0 in Voice-Disordered Persons


When considering measures of F0, the possible effects of age, sex, and race must be considered in both normal and voice-disordered persons. With that in mind, the following is a discussion of several disorder types that have been reported to present with changes in the expected F0 of the voice. Because frequency is a direct correlate of perceived vocal pitch, expectations of changes in vocal frequency will be consistent with expected changes in vocal pitch for various disorders reported in a previous chapter.


4.4.6 Mass Lesions and Distributed Tissue Change


Lesions (e.g., nodules, polyps, and tumors) or added mass (e.g., edema) that develop on or within the vocal fold tissue may have variable effects on vocal pitch and F0 characteristics. 108 In some cases, mean F0 may be lowered because of increased vocal fold mass as in cases of edema, benign growths, and erythema. 108,​ 109,​ 110,​ 111 In contrast, vocal pitch and F0 may increase in these same aforementioned conditions due to increased vocal fold stiffness, or not be substantially effected at all by the presence of laryngeal tissue changes. 108,​ 112,​ 113,​ 114


The presence of vocal fold lesions or distributed tissue change and associated increased mass and stiffness of the vocal folds may alter the control of vocal fold vibration and therefore result in (1) increased pitch variability during sustained vowel productions and (2) monopitch, limited intonation voice production during continuous speech. 110 Decreased pitch and F0 range may be observed because the presence of lesion(s) or distributed tissue change results in increased stiffness of the vocal fold(s) and a restricted range of lengthening/tensing motion. 108,​ 109


Smoking is a form of vocal abuse that is associated with development of mass lesions and distributed tissue change. Several studies have reported changes in vocal F0 including a decrease in mean F0, increase in F0 variability in sustained vowel productions, and limitations in total phonational F0 range. 84,​ 115,​ 116,​ 117,​ 118,​ 119,​ 120


Neurological Disorders


Because neurological disorders may affect motor control in a variety of ways, the possible effects of neurological disorders on vocal F0 are highly varied. Increased mean F0 has been reported for patients with Parkinson’s disease who experience rigidity of the laryngeal musculature 121,​ 122 versus both increased and decreased mean F0 in certain subjects with amyotrophic lateral sclerosis (ALS) 33 versus a report of no significant differences in mean F0 in groups of patients with hypo- and hyperkinetic dysarthrias versus normal controls. 123 Flaccidity of the vocal fold musculature may result in increased effective mass resulting in a reduction in mean F0.


While mean F0 may be variably effected in dysarthric states, F0 variability during speaking has been consistently reported to be reduced (monopitch and flattened F0 contours) in various conditions such as ataxia, right hemisphere lesions, and particularly in patients with Parkinson’s disase. 124,​ 125,​ 126,​ 127,​ 128,​ 129 Increased F0 SD in sustained vowel productions has also been reported in Huntington’s and cerebellar ataxic groups. 123,​ 124


Muscle Tension Dysphonia/Hyperfunctional Voice


While excessive tension within the vocal musculature may be secondary to neuromotor dysfunction, increased tension in the laryngeal musculature is also observed in many functional voice disorders. In functional cases, the increased tension may occur as a result of factors such as emotional stress, fatigue, use of inappropriate pitch level, or compensatory activity developed in conjunction with organic disturbance (e.g., laryngitis). 1 Muscle tension dysphonia (MTD) is a term used to describe a form of vocal hyperfunction characterized by a presumed overactivation and dysregulation of muscles in and around the laryngeal region in the absence of organic or neurologic laryngeal disorders. 130 When excessive muscular activity occurs in those muscles that affect the lengthening and tensing aspects of vocal fold function, the result may be effort, possible strain, and an increase in pitch and F0 level. 130,​ 131


Although increased pitch is most commonly associated with excessive tension, decreased pitch has also been reported in this condition 1,​ 132,​ 133 and may be due to increased tension localized to the thyroarytenoid/vocalis muscle(s) and/or anterior-posterior squeezing of the larynx resulting in vocal fold compression and lower pitch and F0. 133


Puberphonia/Mutational Falsetto


As previously stated, a substantial lowering of the pitch and F0 of the voice is generally observed as males move from childhood to adulthood via puberty. In puberphonia/mutational falsetto, the male subject retains a childlike, higher pitched voice after puberty, resulting in a voice and age mismatch and pitch abnormality (see previous description). Luchsinger and Arnold classified disturbed mutation in boys into three clinical forms: (1) delayed mutation, (2) prolonged mutation, and (3) incomplete mutation. In all three forms, the voice is characterized by high pitch and F0, highly variable pitch and F0 production with frequent pitch breaks, and chronic hoarseness. 65 Several studies have described a reduction in mean F0 as a key outcome of voice therapy for these patients. 134,​ 135,​ 136


Psychological Disorders


Pitch and F0 variability has been reported to be substantially reduced in speech contexts (monopitch, flattened intonation) consistent with flat affect in cases of both schizophrenia and severe clinical depression. 110,​ 137,​ 138,​ 139


Deafness/Hearing Impairment


Patients with congenital or early childhood hearing losses of more than mild severity may be observed to have poor control over vocal F0 characteristics. Increased mean F0 in speech contexts has been reported for deaf/severely hearing impaired speakers. 19,​ 140,​ 141 In addition, reduced SD of F0 19,​ 142 and highly variable speaking F0 141 (flattened F0 contours as well as extreme variations in F0 during speech intonation) have also been reported.


▶ Table 4.5 provides a summary of general expectations for possible changes in vocal F0 characteristics in the aforementioned disorders.





































Table 4.5 A summary of general expectations for vocal F0 characteristics in various disorder types

Disorder


Possible effect on vocal F0


Comments


Mass lesions and distributed tissue change


Variable


Decreased mean F0 with conditions resulting in increased vocal fold mass; increased mean F0 with increased vocal fold stiffness. Increased mass and increased stiffness may counteract to result in minimal change in mean F0.


Increased F0 variability in sustained voicing; reduced F0 variability in speech (consistent with monopitch).


Decreased F0 range.



Neurological disorders


Variable


Increased mean F0 with hypertonia/rigidity (e.g., spasticity; hypokinesias such as Parkinsonism); possibly decreased in flaccidity


Increased F0 variability in sustained vowel productions; reduced F0 variability in speech (consistent with monopitch; flattened F0 contours)


Muscle tension dysphonias (MTDs)/hyperfunctional voice


Variable


Increased mean F0 with increased vocal fold tension and stiffness; mean F0 may be decreased if tension is focal to the TA/vocalis muscle or results in A-P laryngeal squeezing


Puberphonia/Mutational falsetto


Increased mean F0; increased F0 variability in sustained voicing and in speech (consistent with pitch breaks)


Retains childlike, high mean F0 even though physical development is consistent with transition to adulthood


Psychological disorders


Reduced F0 variability in speech intonation


Flattened intonation consistent with flat affect in cases of both schizophrenia and severe clinical depression


Deafness/Hearing impairment


Increased mean F0; F0 variability in speech may be reduced (consistent with monopitch) or excessive


Patients with congenital or early childhood hearing losses of more than mild severity may be observed to have poor control over vocal F0 characteristics


4.4.7 Limitations in the Measurement of F0


Regardless of the computer algorithm that is used for the computation of mean F0, certain situations will tend to result in miscalculations in estimating vocal frequency. 143 For measures of F0 to be valid, the acoustic analysis program must be able to identify the fundamental period of the signal. The rapid pitch changes observed during intonation patterns, the effects of voiced-to-voiceless or voiceless-to-voiced transitions, noise within the signal, and (as in various forms of dysphonia) disturbance to the periodicity of voice are all situations in which difficulties in frequency tracking may occur.


Program parameters that dictate how estimates of vocal F0 are calculated must also be monitored closely by the program user. As an example, most F0 analysis programs such as those that utilize autocorrelation or waveform matching require the user to provide a frequency range within which the program will search F0 estimates. For the analysis of the continuous speech of a typical adult male, a range of 80 to 300 Hz would comfortably encompass the mean F0 of most typical adult males, while also providing an adequate range within which the frequency changes observed within the intonation patterns of the speaker can also be detected. If the range is too wide (e.g., 80–1,000 Hz), the program may produce numerous high-frequency errors in estimating F0. In contrast, if the range is too limited (e.g., 80–200 Hz), the algorithm may be restricted in its ability to identify F0 variations observed during normal intonation patterns. In addition, the use of pitch smoothing algorithms will be expected to have an effect on measures of F0 range and SD (e.g., increased smoothing will reduce measures of F0 range and average variation for speech samples). The user of a particular program should closely review instructions on how to set these parameters for optimal F0 processing.


Different software programs with different F0 analysis algorithms have been reported to have a high degree of correspondence and strong interprogram correlations in estimates of mean speaking F0, regardless of sex or age of the subject group producing the samples. 54 These results are similar to those reported by Bielamowicz et al for F0 estimations obtained from sustained vowels. However, poorer correspondence and weaker interprogram correlations have been reported for measures of F0 SD. 54,​ 144


4.4.8 Vocal Sound Level (a.k.a. Vocal Intensity)


Sound is the perception of pressure changes in the medium (typically, air). 68 The perceptual attribute that corresponds to the magnitude of the pressure changes is loudness, and the magnitude of these pressure changes is reflected in the amplitude of the wave. The sound intensity level (SIL) of a signal refers to the power of the signal (proportional to the square of the pressure), while the sound pressure level (SPL) represents a comparison of the SPL of an observed signal to a reference sound pressure. Because both SIL and SPL give similar result in decibels (dB), it has been suggested that we simply refer to this measurement of sound power/pressure as sound level (SL), though the term vocal intensity is ubiquitous in the voice literature. 145


The evaluation of vocal loudness and sound level/intensity provides us with valuable information about the coordination between phonatory and respiratory mechanisms since vocal loudness and intensity relate not only to the degree of respiratory force but also to the amplitude of vocal fold vibration. 146 Specific to phonatory function, Colton and Casper indicated that vocal loudness and intensity is affected by (1) glottal resistance to the expiratory airstream and (2) the rate of airflow change at the moment of closure. 3


Considerations in Measuring Vocal Intensity


Mouth to Microphone Distance

The measurement of vocal intensity (a.k.a. sound level) is somewhat more complicated than recording the voice for measures of frequency, as the measurement of intensity is greatly affected by the distance the microphone is from the mouth (i.e., mouth-to-microphone distance). While the frequency of the vocal signal will be constant regardless of whether the microphone is placed close or relatively distant from the mouth, intensity is effected by the inverse square law, which states that, in a free field (i.e., one without reflections), the intensity of sound drops by 6 dB for each doubling of distance from the source. The fact that sound intensity will reduce the farther away from the source should be intuitive in that sound waves spread out in all directions like radiating spheres and, as the radius of the spheres gets larger, the amount of acoustic energy in the disturbance gets distributed over the expanding surface of the sphere. Therefore, the mouth-to-microphone distance substantially affects the intensity of the sound being measured and must be kept as constant as possible if accurate readings of vocal intensity are to be made.


A mouth-to-microphone distance of 30 cm (approximately 12 in) has been suggested 147 for handheld or desktop mounted microphones versus approximately 4 cm from the lips at a 45-degree angle for headset microphones. 10,​ 11 In practice, the literature on vocal sound level/intensity shows a wide variety of mouth-to-microphone distances being reported. Therefore, it is essential that we know the mouth-to-microphone distance that was used in examples of reported clinical or research data. Headset microphones are strongly recommended because not only do they help considerably in maintaining a consistent mouth-to-microphone distance, but they also provide improved signal-to-noise ratios (due to the short distance from the lips).


Sound Level Meter

While consistent mouth-to-microphone distances help standardize our recordings, the distance by itself does not tell us the actual sound level/intensity of the subject’s voice. To actually measure vocal sound level/intensity in dB, we will need a sound level meter (SLM). A basic SLM consists of a microphone, an amplifier, a frequency weighting circuit, and a meter (analog or digital) calibrated in decibels. 148 While high-quality SLMs (i.e., Class 1 and Class 2 SLMs which comply with IEC or ANSI standards and provide an accuracy of ±1.5 and ±2 dB, respectively, within the frequency range of 100–1,000 Hz) can be quite expensive, a multitude of low cost SLMs are available (▶ Fig. 4.24) and may be acceptable for clinical purposes. 149,​ 150



Examples of low-cost sound level meters. (a) Pyle PSPL25 sound level meter (https://www.pyleaudio.com/sku/PSPL25/Sound-Level-Meter-with-A-and-C-Frequency-Weighting); (b) Analog Sound Level Meter by Ra


Fig. 4.24 Examples of low-cost sound level meters. (a) Pyle PSPL25 sound level meter (https://www.pyleaudio.com/sku/PSPL25/Sound-Level-Meter-with-A-and-C-Frequency-Weighting); (b) Analog Sound Level Meter by Radio Shack (Model 33-2050).


Regardless of the SLM being used, it is important to recognize the type of frequency weighting being used, as these circuits can have a substantial effect on the sound level measurement provided by the SLM. Two commonly used frequency weighting circuits are A-weighting and C-weighting. The C-weighting circuit (a linear weighted circuit) is preferred for measures of vocal intensity because it (1) measures uniformly over the frequency range (up to approximately 10 kHz) and (2) does not discriminate against low frequencies such as those often found in the F0s of speech and most singing. In contrast, the A-weighted circuit reduces the influence of low-frequency ambient noise on sound level measurements by attenuating the low-frequency range (i.e., < 500 Hz), resulting in substantially reduced intensity measurements for lower frequency voice productions such as those in the vicinity of the vocal F0 and other modal register phonations in most subjects. 151


Although low-cost or free SLM apps are also available for iOS (e.g., iPhone) and Android-based (e.g., Samsung Galaxy) smart phones, these apps do not guarantee sufficient accuracy. 152 In addition, regardless of the quality of the app itself, the characteristics of the microphone or the sound acquisition hardware built into smartphones is often unknown. Because of the potential for poor reliability across various mobile devices, operating systems, and apps, sound level/intensity measures obtained from smartphones should be considered as unclassified SLMs and, therefore, only be used for clinical approximations. 149


Recording Environment

The environment in which the recording takes place can also have a substantial effect on our measures of sound level/intensity. Ideally, recordings would take place in a soundproofed or treated environment. For measures of sound level, a room should be selected in which the ambient noise level is at least 10 dB lower than the expected lowest sound level/intensity phonation (optimally, <38 dB(C) for measurements at 30 cm distance or <48 dB(C) for measurements with omnidirectional head-mounted microphones) and reverberations (i.e., echoes) from reflective surfaces are minimized. 153


4.4.9 Measurement of Modal Vocal Intensity (dB)


The SLM can be used to directly give us a measure of the vocal sound level via its built-in microphone or to calibrate the handheld or headset microphone that we are using for all of our recordings (including previously mentioned recordings for measures of vocal frequency). The modal vocal sound level/intensity may be obtained from the same continuous speech sample(s) (i.e., portions of the Rainbow Passage—second sentence or second and third sentences; CAPE-V sentences) as used in vocal frequency analyses. Here are a number of methods by which the sound level/intensity of the voice may be measured:




  1. The SLM can provide a real-time (i.e., almost “instantaneous”) estimate of the modal sound level/intensity level (i.e., the most commonly or frequently occurring sound level/intensity). Turn on the SLM, place (e.g., on a tripod) or hold the SLM at a 30 cm (∼1 ft) mouth-to-microphone distance. Set the range selector to 70 dB—when the range selector is set for 70 dB, the meter will respond to intensity levels between ≈ 60 and 80 dB (70 dB is indicated when the meter points to 0). The 70-dB setting is suggested because conversational speech levels generally range from 65 to 70 dB when measured at 30 cm (1 ft mouth-to-microphone distance). Select C-weighting and slow response. As the patient is reading, closely watch the decibel meter for the most frequently occurring intensity level. The meter will be in a constant state of fluctuation. However, the use of the slow-response setting should allow the meter to “linger” somewhat at the displayed intensity levels, making it easier to identify the most commonly/consistently occurring intensity value (i.e., the modal intensity level). For an approximation of a mean vocal intensity level, record observations of the intensity of speech approximately every 3 seconds during the reading of a standard passage (e.g., the first paragraph of “the Rainbow Passage”). The list of intensity observations can be reviewed for the modal intensity level or averaged to result in an approximately mean intensity level (▶ Appendix 4.1). The clinician should be able to record approximately 9 to 10 intensity observations for a patient reading the first paragraph of “the Rainbow Passage” at a normal speaking rate.




  1. The headset microphone may be precalibrated using procedures consistent with Winholtz and Titze 10 and Asplund 154 to indirectly obtain intensity values by converting the microphone signal to a decibel (dB) level. Specifically, to pre-calibrate the headset microphone, a subject is asked to sustain a vowel /a/ at a comfortable pitch and loudness, with a headset microphone placed a fixed distance away from the mouth (e.g., 4 cm) and a SLM placed 30 cm/1 ft away from the mouth (▶ Fig. 4.25). 154 Recordings of the headset microphone signal can be recorded into a sound acquisition program (e.g., Praat, Audacity) while the SLM is simultaneously monitored for the modal sound level (in dB). The mean amplitude (e.g., root mean squared amplitude) of the recorded signal corresponding to the known SL/intensity level at 30 cm is obtained. For future recordings, the amplitude of the subject’s voice production can now be compared to the reference signal amplitude (from 30 cm/1 ft) and be converted to decibels using a standard decibel formula (e.g., Formula ()  where PM is equivalent to the RMS amplitude of the signal being measured and PR is equivalent to the RMS amplitude of the reference signal).



Example procedure by which a headset microphone (e.g., 4 cm mouth-to-microphone distance) may be calibrated in reference to a sound level meter placed at a standard 30 cm mouth-to-microphone distance


Fig. 4.25 Example procedure by which a headset microphone (e.g., 4 cm mouth-to-microphone distance) may be calibrated in reference to a sound level meter placed at a standard 30 cm mouth-to-microphone distance (see text for full description). The calibration “tone” maybe a steady sustained vowel production or a tone (e.g., 150 Hz triangular wave) produced via a speaker.

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

Feb 25, 2020 | Posted by in OTOLARYNGOLOGY | Comments Off on Acoustic Analysis of Voice

Full access? Get Clinical Tree

Get Clinical Tree app for offline access