Acoustic Theory of Vowel Production
Speech scientists and speech-language pathologists are indebted to Gunnar Fant, a Swedish speech scientist, and Kenneth Stevens, a Canadian speech scientist who spent his career at the Massachusetts Institute of Technology in Boston, for the development of the theoretical basis of speech acoustics. Fant worked on the theory in the 1940s and 1950s, and published his classic book, titled Acoustic Theory of Speech Production, in 1960. Stevens began publishing his work in the 1950s and in 1998 published a text titled Acoustic Phonetics. Previously, two Japanese scientists (Chiba & Kajiyama, 1941) developed a mathematical theory of vocal tract acoustics like Fant’s, but this work was essentially unknown in Western hemisphere countries until well after World War II. Fant, Stevens, and several other scientists continued to develop and refine the theory in the 1950s, 1960s, and 1970s. Indeed, the theoretical development continues today. In particular, a text published by Flanagan (1972) and more recent work (e.g., Lammert & Narayanan, 2015; Story, 2005; Story & Bunton, 2017) are excellent sources for the advancement of speech acoustic theory. Much of the information in this and the following chapter is drawn from these sources.
The acoustic theory of vowel production can be stated in very broad terms, as follows: for vowel production, the vocal tract resonates like a tube closed at one end, and shapes an input signal generated by the vibrating vocal folds. The two major concepts suggested in this broad statement of the theory—the resonance patterns of a tube closed at one end, and the shaping of an input by a resonator—are covered in Chapter 7. For this chapter, the broad statement of the theory refers only to vowel production. The theory is most precise for the case of vowels, primarily because its mathematical basis works best for the resonant frequencies of vowels (compared with many consonants). The theory also addresses consonant acoustics, which is covered in Chapter 9. To explore the acoustic theory of vowel production in greater depth, this chapter addresses the following set of questions:
1. What is the precise nature of the input signal generated by the vibrating vocal folds?
2. Why should the vocal tract be conceptualized as a tube closed at one end (compared with open at both ends)?
3. How are the acoustic properties of the vocal tract determined?
4. How does the vocal tract shape the input signal?
5. What happens to the resonant frequencies of the vocal tract when the tube is constricted at a given location?
6. How is the acoustic theory of vowel production confirmed?
Professor Gunnar Fant (1919–2009) was a famous speech acoustician and is widely regarded as one of the fathers of speech acoustics. Fant was born in Sweden in 1919 and spent most of his career in the Department of Speech, Hearing, and Music at the Royal Institute of Technology (KTH) in Stockholm. Fant founded this department in 1951 as the Speech Transmission Laboratory, after spending 1949 and 1950 at the Massachusetts Institute of Technology working with another giant in the field, Professor Kenneth Stevens (1924–2013). Over the years, Fant’s department generated a wealth of valuable research, all of which was reported in a famous, recurring publication called the KTH Speech Transmission Laboratory Quarterly Progress Report. Fant’s 1960 book, Acoustic Theory of Speech Production, is one of a very few knowledge touchstones for the serious speech scientist. The student who reads and comprehends Fant’s text, as well as Professor Stevens’ (a second intellectual father) magnificent 1998 text Acoustic Phonetics, which can be considered a successor to, and enlarger of, Fant’s classic book, can claim to be well informed about the many aspects of speech acoustics.
WHAT IS THE PRECISE NATURE OF THE INPUT SIGNAL GENERATED BY THE VIBRATING VOCAL FOLDS?
The periodic vibration of the vocal folds provides the input signal for vowels to the vocal tract resonator. This periodic vibration is referred to as the source for vowel acoustics. As discussed in Chapter 7, any signal can be studied in both the time and frequency domains. Much of what follows is a condensation of work done by Fant (1979, 1982, 1986) as refinement of the theory first published in 1960.
The time-domain characteristics of the signal produced by vocal fold vibration are complex. The larynx, a structure of cartilage, membrane, ligament, and muscle (Chapter 3), is not easily accessible for direct measurement of vocal fold behavior. When a microphone is placed directly in front of a speaker’s lips while he or she phonates a vowel, the recorded acoustic event will reflect the combination of source (vocal fold) and resonator (vocal tract) acoustics. There is no simple way to look at the waveform (time domain) of a vowel recorded in this way (such as that shown in Figure 7–8B) and identify the time or frequency characteristics due only to vocal fold vibration. Some other approach must be found to separate the waveform of a recorded vowel into the parts contributed by (a) the vibrating vocal folds and (b) the resonating vocal tract.
One of the earliest attempts to understand the details of vocal fold vibration was described by Farnsworth (1940), who took high-speed motion pictures of the vibrating vocal folds by filming images of the folds as reflected in a laryngeal mirror. When played back in slow motion, these films allowed Farnsworth to view, on a frame-by-frame basis, movements of the vocal folds and the changing configuration of the glottis (the space between the vocal folds) throughout individual cycles of vocal fold vibration. A sequence of images of one vocal fold cycle, like those examined by Farnsworth but collected with a contemporary device, is shown in Figure 8–1. The cycle begins with the vocal folds fully approximated (image 1). The folds separate gradually to a maximum width of the glottis (image 5), then begin to move back to the midline until they are once again fully approximated (images 6–10). Scientists examined images such as these and for each frame of a vocal fold cycle measured the width and length of the glottis. These measurements allowed them to derive the glottal area, based on the width and length measures, on an image-by-image basis. Glottal area as a function of time was plotted for a complete cycle of vocal fold vibration.
A typical glottal area function, symbolized as Ag, is shown in Figure 8–2A for two cycles of vocal fold vibration. The baseline in this plot represents full approximation of the vocal folds (i.e., Ag ~0), and upward movement of the function indicates increasingly greater separation of the vocal folds (i.e., increasing Ag). One cycle of vocal fold vibration is defined as the interval between successive separations of the vocal folds, as marked in Figure 8–2. Note that the vocal folds are fully approximated for a substantial portion of each cycle (nearly 40% of each cycle). The moment immediately before the vocal folds separate has been chosen arbitrarily as the initiation of each cycle. The duration of these cycles of vocal fold vibration may range from as little as 1 ms or less (for some opera or pop singers who can produce extremely high-pitched notes) to the more typical 5 ms (adult women, as shown in Figure 8–2) or 8 ms (adult men).
The Ag function shown in Figure 8–2 is not an acoustic signal, but rather reflects a pattern of vibration that produces an acoustic signal. How does one obtain the acoustic signal associated with vocal fold vibration—separate from the influence of the cavities in the vocal tract—and how is this signal related to the Ag function just described?
Imagine that it was possible to suspend, immediately above the vocal folds, a device that measures the magnitude of airflow streaming through the glottis as a function of time. When the vocal folds separate during a vibratory cycle (e.g., during phonation of a vowel) airflow through the glottis is expected because speech is produced with tracheal pressures greater than those in front of the lips (i.e., Patm), and air always flows from regions of higher pressure to regions of lower pressure (Chapter 3). Intuitively, the magnitude of this airflow should be zero when the vocal folds are fully approximated, and maximum when the vocal folds are widest apart (when Ag is the largest). In other words, the airflow coming through the glottis should increase as Ag increases, and decrease as Ag decreases. A plot of the magnitude of airflow coming through the glottis as a function of time (Figure 8–2B) looks a lot like the Ag function shown in Figure 8–2A. The time-domain plot of airflow through the glottis is a glottal flow function, symbolized as g. Because the g reflects movement of air molecules, which results in sound pressure waves (see Chapter 7), g is the proper signal to study as the acoustics of the source—vocal fold vibration—in vowel production.
Figure 8–1. Successive images from one complete cycle of vocal fold vibration recorded via videostroboscopy (see Chapter 6). The cycle begins in the upper left frame (image 1) with the vocal folds approximated. The folds begin to separate in image 2 and reach maximum separation in image 5. Closing of the vocal folds takes place in images 6 through 10. Images provided courtesy of KayPENTAX, Montvale, NJ. Reproduced with permission.
The g function shown in Figure 8–2B looks very much like the Ag function (the detailed differences between the two types of waveforms will not be discussed in this text). Nevertheless, as noted above, actual measurement of the flow coming through the glottis in a phonating human is extremely difficult, if not impossible. Instead, scientists obtain g signals using an indirect approach.
Imagine a situation in which the input signal is the vibration of the vocal folds and the filter is a resonance curve associated with a specific shape of the vocal tract. From the discussion of tube resonators in Chapter 7, and introductory comments concerning the vocal tract resonating like a tube closed at one end, multiple peaks are expected in the resonance curve. The input signal plus a resonance curve for the vocal tract tube resonator are shown in the upper part of Figure 8–3. When the input signal—here labeled “glottal source signal”—is applied to the vocal tract filter, the resulting waveform is the output labeled “speech signal” (Figure 8–3, upper right panel). That speech signal represents the blending of the input and filter characteristics. Because of this blending, the output signal cannot reveal the exact characteristics of the input (source) signal unless the output signal is separated into its source and filter parts.
Figure 8–2. A. Glottal area function (Ag) for two cycles of vocal fold vibration. Upward displacements represent increasingly larger glottal areas, which are proportional to glottal widths measured from images like those in Figure 8–1. B. Glottal airflow function (g) obtained by inverse filtering (see text). Upward displacement indicates increasing magnitude of airflow passing through the glottis.
A technique to recover the source signal from the blended signal is called “inverse filtering” as demonstrated in the lower part of Figure 8–3. On the left is the speech signal, the same one as in the upper right-hand panel of the figure. This is the “blended” signal reflecting the influence of both the source and vocal tract filter. The blended signal serves as input to a resonance curve that is “flipped” like a mirror image of the vocal tract filter shown in the top of the figure. In the flipped, or inverse, filter, there are valleys at the precise locations of the peaks in the upper filter function. If the inverse filter is constructed correctly, when the “blended” input signal is run through it, the resonances are subtracted from the signal, and what remains at the output of the filter is the glottal source signal. This is shown in the lower right panel as “recovered glottal source signal.” In essence, the sequence of the bottom panel reverses that of the top panels, with the special adjustment of flipping the filter function.
Technical details of inverse filtering are not important here, and the technique is more complicated (and often trickier) than implied by the straightforward logic of Figure 8–3. For current purposes, the ability to recover a glottal source signal—such as the g signal—is the central issue. There are three important features of the g signal shown in Figure 8–2B.
First, as noted above for the Ag signal, the g signal is periodic, meaning that its characteristic shape repeats over time. The rate at which it repeats over time is the fundamental frequency (F0) of vocal fold vibration, or how many times per second the vocal folds go through complete cycles of vibration. In adult women, a typical F0 is around 190 to 200 Hz, in men around 115 to 125 Hz, and in 5-year-old children around 250 to 300 Hz (Kent, 1997; Lee, Potamianos, & Narayanan, 1999). As shown in Figure 8–2 the period (T) of this time-domain signal is easily measured, and the inverse of the period is the F0 (see Chapter 7 for a discussion of f = 1/T).
The second important feature of the g signal produced by most people in ordinary conversation is the shape of the opening and closing portions of each cycle. The slope of the opening phase is shallower than the slope of the closing phase, making each cycle appear as if it is “leaning to the right.” This shape feature is seen clearly in the g signal of Figure 8–2B. The steepness of the closing phase is important because it reflects how rapidly the vocal folds come together at the end of each cycle. The more rapidly the vocal folds come together, the steeper the closing part of the g signal. This has great importance to the frequency-domain characteristics of the source, as discussed in the next section.
The third important feature is that the g signal shows some portions where the vocal folds are apart (i.e., where airflow is coming through the glottis), and some portions where the vocal folds are approximated. The ratio of open time to closed time for each cycle, which in normal voices is typically around 1.2:2 (i.e., the vocal folds are open approximately 60% of each cycle), may be an important determinant of how much of the source signal is periodic and how much is aperiodic. This is important when considering the physiological and acoustical basis of pathological voice quality.
Figure 8–2 shows that the g signal does not have a sinusoidal shape, but is periodic. Material covered in Chapter 7 suggests that the signal is a complex periodic waveform. Determination of the frequency components of a complex periodic waveform requires analysis of source acoustics in the frequency domain.
Figure 8–3. Schematic representation of steps in the inverse filtering process. Top row of panels: Glottal source signal serves as input to a multipeaked filter function associated with a vocal tract configuration and results in an output which is the speech signal recorded at the lips. This speech signal is a blend of the source signal and the filter characteristics. The three-panel sequence summarizes the source-filter theory of vowel acoustics. Bottom row of panels: The output signal shown in the top right waveform now serves as the input to an “inverse filter,” which is the “flipped” or mirror image of the filter function in the middle of the top row of panels. The inverse filter inverts the peaks shown in the top filter function and subtracts the resonances from the input signal. The result of the process is the recovered glottal source signal shown in the bottom right panel. Figure provided courtesy of James Hillenbrand, PhD, Western Michigan University, Kalamazoo, Michigan. Reproduced with permission.
The g signal shown in Figure 8–4A can be submitted to Fourier analysis to identify the frequency components contributing to this source waveform. A typical spectrum resulting from Fourier analysis is shown in Figure 8–4B. The important features of this spectrum are: (a) the series of frequency components at consecutive-integer multiples of the lowest-frequency component; and (b) the relative amplitudes of the frequency components that decrease systematically as frequency increases. This is the glottal source spectrum.
The lowest frequency of the glottal source spectrum is the fundamental frequency (F0), which corresponds to the rate of vibration of the vocal folds. The F0 is also called the first harmonic (H1) of the source spectrum. The other frequency components in the glottal source spectrum are whole-number multiples of the F0. There is a component at two times the F0 (the second harmonic, H2), three times the F0 (the third harmonic, H3), four times the F0 (the fourth harmonic, H4), and so on. In theory the number of harmonics in the glottal source spectrum is infinite, but the progressive reduction in relative amplitude with increasing frequency greatly limits the significance of very high frequency harmonics.
The reduction of energy (relative amplitude) in the harmonic components of the glottal source spectrum as frequency increases is clearly seen in Figure 8–4B. Moving from left (lower frequency) to right (higher frequency) on the x-axis, the vertical lines showing the amplitude of the components become progressively shorter. This energy reduction is systematic, with the relative amplitude decreasing 6 to 12 dB for each octave increase in frequency. This gives the typical glottal spectrum a distinctly “tilted” appearance. As discussed below, changes in the mode of vocal fold vibration affect the extent to which the glottal spectrum is “tilted.”
Figure 8–4. Time-domain (A) and frequency-domain (B) representations of acoustics resulting from vocal fold vibration. The time domain is represented by the glottal airflow waveform (g). The frequency domain is represented by the glottal spectrum resulting from vocal fold vibration and indicates F0, the fundamental frequency (first harmonic), and H2, H3, and H4 (the second, third, and fourth harmonics, respectively). The harmonics above H4 (arbitrarily chosen as the last-labeled harmonic on the graph) continue to be whole-number multiples of the F0.
In the summary of the time-domain characteristics of the glottal source, three major characteristics were identified, including: (a) the periodic nature of the waveform, (b) the shape of the waveform, and (c) the ratio of open to closed time. Discussion turns now to how each of these features affects the glottal source spectrum.
The Periodic Nature of the Waveform
The g waveform repeats over time, is not sinusoidal, and is, therefore, a complex periodic event. The repetition of the g waveform is not perfectly periodic, but rather has very small variations in the periods of successive glottal cycles. Vocal fold vibration is therefore referred to as quasi-periodic. Throughout this discussion, the term “period” refers to the average period—the small, period-to-period variations are not considered further here. The period of the glottal waveform depends on the rate of vibration of the vocal folds, which varies according to a number of factors, including sex and age (see Chapter 3). Figure 8–5 shows two g waveforms (left part of figure) having different periods, and their associated glottal spectra (right part of figure). Note that both waveforms show the g over a 40-ms interval. The top waveform has a period of 8 ms (typical of many adult males), the inverse of which is an F0 of 125 Hz. This 125 Hz F0 is shown as the lowest-frequency component in the glottal spectrum to the right of the waveform. The bottom waveform has a period of 5 ms (typical of many adult females), the inverse of which is an F0 of 200 Hz. This F0 is shown as the lowest-frequency component in the corresponding glottal spectrum. The harmonics of the female glottal spectrum (bottom) are more widely separated than the harmonics of the male glottal spectrum (top). This follows from the fact that the glottal spectrum consists of a consecutive integer series of harmonics: higher F0s yield greater spacing between successive harmonics compared with lower F0s. The glottal spectra of speakers with low F0s are more densely packed with harmonics compared with the glottal spectra of speakers with high F0s. This difference between the glottal spectra of low versus high F0s explains, in part, why spectrographic analysis of vowels produced by adult males tends to be easier than spectrographic analysis of vowels produced by adult females and children (see Chapter 10).
Figure 8–5. Two g waveforms having different periods, and their associated glottal spectra. Top: period = 8 ms, F0 = 125 Hz. Bottom: period = 5 ms, F0 = 200 Hz. Both waveforms show g over a 40-ms interval. Note the greater number of cycles within this interval for the F0 = 200-Hz waveform (lower waveform), compared with the F0 = 125-Hz waveform (upper). Note wider spacing of harmonics in the glottal spectrum where F0 = 200 Hz (bottom) compared with F0 = 125 Hz (top).
The opening and closing parts of the g waveform create a shape that appears to be “leaning to the right,” and the steepness of the closing slope reflects how rapidly the vocal folds return to the midline for each cycle. There is a systematic relationship between this closing slope and the “tilt” of the glottal spectrum: the steeper the closing slope in the g waveform (the faster the vocal folds return to the midline on each cycle), the less tilted the glottal spectrum. This relationship is exemplified in Figure 8–6, where the g waveform on the left has a steeper closing slope than the g waveform on the right (see arrows indicating slopes on the closing part of the waveforms). CP in Figure 8–6 is the closed phase of the glottal cycle, or the portion of each cycle when the vocal folds are fully approximated, whereas OP stands for open phase. Note the spectra associated with these two glottal waveforms. The glottal spectrum for the waveform with the relatively steep closing slope shows progressive reduction in relative amplitude with increasing frequency, but not nearly as dramatically as the glottal spectrum for the waveform with the shallower closing slope. The dashed line connecting the tops of the vertical lines in the two spectra shows the rapid reduction in energy across frequency when the closing slope in the time domain is shallow (right-hand part of figure) compared with when it is steep (left-hand part of figure). The spectrum with the dramatic reduction in harmonic energy is more tilted than the spectrum with the more gradual reduction in energy. In theory, a glottal spectrum with “no tilt” would be one in which the relative amplitudes of all harmonic components were equal (the dashed line connecting the tops of the vertical lines would be strictly horizontal), and a glottal spectrum with “infinite tilt” would be one in which there was energy at the first harmonic (F0), but at no other frequencies.
Figure 8–6. Schematic illustration of the speed of vocal fold closing and tilt of the glottal spectrum. The two upper panels show g waveforms, one with relatively rapid closing of the vocal folds (left), and one with relatively slow closing of the vocal folds (right). Relative speeds of closure are portrayed by the slope of the arrows alongside the closing phase of the functions. The glottal spectra immediately below the two waveforms demonstrate how speed of closure affects the tilt of the spectrum. CP = closed phase (time). OP = open phase (time).
Another way to express the concept of tilt of the glottal spectrum is to have a reference value of 8 dB per octave (midway in the range of 6–12 dB per octave given above) as the typical reduction in harmonic amplitude across frequency. g waveforms with very steep closing slopes (e.g., Figure 8–6, left panel) have a smaller dB change per octave (<8 dB per octave), and those with very shallow closing slopes have a larger dB change per octave (>8 dB per octave).
These concepts are important in understanding the physiological and acoustical bases of hyperfunctional and hypofunctional voice disorders. In hyperfunctional voice disorders, the vocal folds move together too rapidly and forcefully on each closing phase of vocal fold vibration, resulting in a glottal spectrum with less than normal tilt, or too much energy in the higher-frequency harmonics. Listeners interpret the voice quality associated with reduced tilt of the glottal spectrum as abnormal, sometimes using the term pressed voice to describe what they hear. A pressed voice sounds overly effortful or strained. In hypofunctional voice disorders, the vocal folds move together more slowly and less forcefully, the result being a highly tilted glottal spectrum because there is so little energy in the higher-frequency harmonics. The quality of a voice with greater than normal tilt is often heard by listeners as weak, breathy, and thin.
The quasi-periodic nature of vocal fold vibration is largely a result of subtle aeromechanical imperfections. Vocal fold vibration is driven by aerodynamic forces and sustained by mechanical ones. These forces do not repeat themselves across successive cycles with absolute precision. If the forces do not repeat themselves exactly, the thing they are forcing—vibration of the vocal folds—does not either. In early versions of talking computers (speech synthesizers), scientists used a perfectly periodic, complex tone to simulate the source for vowels. Listeners didn’t like it. It sounded mechanical, robotic, unfriendly. The solution was to take this complex periodic waveform and introduce into it a small amount of “jitter,” or very minimal variation in the cycle-to-cycle period. Listeners found this much more pleasing. More human, you could say.
The Ratio of Open Time to Closed Time
For each cycle in a typical g waveform, the vocal folds are apart about 60% of the time, and approximated about 40% of the time. The two waveforms in Figure 8–6 show that a waveform with a shallower (slower) closing phase is also likely to have more open time throughout a complete cycle. Similarly, a waveform with a steeper (faster) closing phase is likely to have less open time and, therefore, a longer closed phase throughout a complete cycle. Because the speed of closing is often correlated with the open time or closed time (greater speed, less open time; less speed, more open time), less open time is generally associated with a less tilted glottal spectrum, and more open time with a more tilted glottal spectrum. The closing speed and ratio of open phase to closed phase (OP/CP) are somewhat redundant descriptions of g waveforms (and spectral characteristics).
Nature of the Input Signal: A Summary
To answer the first question posed at the outset of this chapter, the input signal generated by the vibrating vocal folds is a complex periodic waveform whose spectrum consists of a consecutive integer series of harmonics at whole-number multiples of the F0. The harmonics in the glottal spectrum systematically decrease in relative amplitude with increasing frequency. These harmonics serve as input to the vocal tract resonator, which shapes that input according to its resonant characteristics. Consideration turns now to the vocal tract resonator.
WHY SHOULD THE VOCAL TRACT BE CONCEPTUALIZED AS A TUBE CLOSED AT ONE END?
The vocal tract is an acoustic resonator. In Chapter 7, two general classes of acoustic resonators—Helmholtz and tube—are described. For the present discussion, accept on faith that the vocal tract is a tube resonator (The bend in the vocal tract is not acoustically relevant). It is easy to provide the proof of the tube-resonance characteristics of the vocal tract, as shown in a later section of this chapter. One obvious proof is the relationship between tube length and tube resonances: shorter tubes have higher resonant frequencies than longer tubes. This is consistent with the higher resonant frequencies of children’s vocal tracts compared with longer vocal tracts and lower resonant frequencies of either men or women. For the same reason, women have higher resonant frequencies than men. Comparison of vowel resonant frequencies for men, women, and children is presented in Chapter 11.
If the vocal tract is regarded as a tube resonator, the question must be asked, “Does it resonate as a tube open at both ends, or closed at one end?” The answer to this question requires a brief reconsideration of the vibrating vocal folds, and how this vibration influences the acoustic output of the vocal tract.
Figure 8–7 shows two signals in the time domain, collected synchronously during phonation of a vowel. The upper signal is g, discussed above at some length. The upper-pointing arrows mark the instant in time at which the vocal folds snap together during each cycle of vibration. At these instants, when the airflow through the glottis is suddenly blocked by closure of the airway, the air immediately above the vocal folds becomes compressed and initiates a pressure wave through the vocal tract. Now, consider the situation just described. At the vocal fold boundary of the vocal tract, there is, for an instant, no airflow and the air molecules become compressed, whereas at the open, oral boundary of the vocal tract air molecules move freely between the lips. This appears very much like the aeromechanical conditions found in a resonating tube closed at one end, where pressure is maximum at the closed end and flow is maximum at the open end (see Chapter 7). Each time the vocal folds snap together, a pressure wave is set up in the vocal tract and obeys the rules of resonance in a tube closed at one end. Another way to say this is that the vocal tract resonances are excited each time the vibrating vocal folds snap together. Because the excitation occurs when the folds approximate and the oral end of the vocal tract is open for vowel production, the vocal tract resonates like a tube closed at one end.
Figure 8–7. Schematic waveforms showing excitation of a vocal tract resonance (bottom waveform) by the vibrating vocal folds (top waveform). The top waveform, g, shows four periods, with the instant of closing for each cycle marked by an upward-pointing arrow. The waveform of the vocal tract response shows a damped vibration at a resonant frequency of 500 Hz. This vibration is initiated each time the vocal folds snap shut, as indicated by the downward-pointing arrows. The amplitude of the damped vocal tract resonance decays over time. The damped resonator vibration from one vocal fold excitation overlaps with each new excitation of the resonance. The overlap of damped resonator vibration from each excitation causes the resonator response to sum, consistent with the superposition principle described in Chapter 7. All vocal tract resonances, according to the quarter-wavelength rule, are excited by the closing of the vocal folds, but only a single resonance waveform (for 500 Hz) is shown for the sake of clarity.
In Chapter 7, a model of resonance was described in which a hammer tapped a resonator, thus exciting the resonant frequency (Helmholtz resonator) or frequencies (tube resonator). Imagine the hammer being controlled by a periodic motor, rotating it toward the resonator and striking it, then pulling back away, rotating it back toward the resonator and striking it again, and so forth. The continuous motion of the hammer back and forth, toward and away from the resonator, is important to producing the excitation of the resonator, but the actual instant of excitation of the resonator corresponds to the point in time when the hammer strikes the resonator. The analogy to excitation of the vocal tract resonances is fairly direct. The motion of the vibrating vocal folds (the swing of the hammer) is important to the resonance of the vocal tract, but the actual excitation of the resonances occurs only at the instant in time of vocal fold approximation.1
The Response of the Vocal Tract to Excitation
What does it mean to say that the vocal tract resonances are excited each time the vibrating vocal folds snap together? The answer is found in the bottom trace of Figure 8–7, where a waveform is initiated each time the vocal folds snap together. For purposes of simplification, a waveform for only a single resonant frequency is shown in the bottom signal of Figure 8–7, but waveforms are initiated for each resonant frequency of the tube. Think of the vocal tract response waveform shown in Figure 8–7 as corresponding to the first resonance of the tube, with a frequency of 500 Hz. The period of that waveform is 2 ms (500 = 1/T, T = .002 s or 2 ms). Note how the resonance waveform responding to the first excitation is initiated with relatively great amplitude, which declines over each successive cycle until the vibration dies out completely (red waveform). In the example of Figure 8–7 note also that the resonance is re-excited before the previous waveform dies out completely (compare the amplitudes of the vocal tract response from excitation 1 (red waveform) and excitation 2 (black, dashed-line waveform). The large, 500-Hz vocal tract vibration at excitation 2 overlaps the small (decaying) vibration from excitation 1. Similar “re-excitations” are shown at excitations 3 (blue waveform) and 4 (green waveform). If Figure 8–7 showed the waveforms of all the excited and re-excited resonances, the vocal tract response signal would be visually too “busy” to illustrate the main point of this discussion, which is that the vocal tract responds to excitation with damped oscillations at each of its resonant frequencies. The oscillations are damped because there is energy loss in the vocal tract due to the factors discussed in Chapter 7 (friction, absorption, and radiation).
Of Beer Bottles and Vocal Tracts
When the vocal tract is excited by the sudden “snapping shut” of the vocal folds, the source excitation has the form of a glottal spectrum. A series of such excitations, such as the 190 or so per second expected for an adult female, gives the excitation spectrum its “nice” form of discrete harmonics. This harmonic spectrum is shaped by the vocal tract filter. A historical footnote in speech acoustics was the idea that the excitation of the vocal tract was like the edge tones described in the Chapter 7 sidetrack titled “Beer and Flutes.” In this view, the resonant chambers of the vocal tract are excited by the individual puffs of air coming through the vocal folds during each cycle of vibration. The air puffs “force” air in the vocal tract into resonance, much like blowing across the opening of a beer bottle forces the air inside the bottle to produce a tone. This was called the “inharmonic theory” of vocal tract acoustics; it is not correct. The correct view is called the “harmonic theory,” for obvious reasons. The vocal tract shapes an acoustic source spectrum according to its resonant properties, rather than having its resonant frequencies “forced” into vibration by an aerodynamic event such as the air puff.
To this point, emphasis has been placed on the time-domain characteristics of the source signal and the response of the vocal tract. The focus now turns to the question of how the vocal tract resonances shape the input signal to produce an acoustic output at the lips. Stated more simply: “What is the acoustic basis of the events known as vowel sounds?” The best approach to this problem is to consider the source signal and vocal tract resonances in frequency-domain terms.
How Are the Acoustic Properties of the Vocal Tract Determined?
As discussed in Chapter 7, acoustic resonators can be described in the frequency domain by a resonance curve (see Figure 7–18). The peak of the resonance curve defines the resonant frequency of the resonator, and the width of the curve between the 3-dB-down points—the bandwidth—provides an index of the amount of energy loss in the vibration. Assume, for this discussion, a vocal tract shape associated with schwa (/ə/), a shape very much like a tube having uniform cross-sectional area from the glottis to the lips. This shape is like that of the straight tubes considered in Chapter 7, for which there are no constrictions, or narrowings, along the length of the tube. With knowledge of the length of the tube and its one closed end, the quarter-wavelength rule can be applied to obtain the multiple peaks of the resonance curve—that is, the resonant frequencies of the tube. The shapes of the resonance curves (determined by the bandwidths) are also important, so some additional calculations are necessary to arrive at a full resonance curve for the vocal tract tube.
If the mathematical tools are available to determine the bandwidths of the multiple resonances, the resonance curve for a vocal tract tube 15 cm in length and shaped for the vowel schwa looks something like the one shown in Figure 8–8. As expected, the lowest (first) resonant frequency is at c/4l = 560 Hz (where c = 33,600 cm/s and l = 15 cm), the second resonance at 1680 Hz (3 × 560), and the third at 2800 Hz (5 × 560). Although the vocal tract tube has, like any other tube, an infinite number of resonances, only the first three are shown for the sake of clarity (as well as for other reasons that will become apparent as this discussion proceeds).
Figure 8–8. Resonance curve for a vocal tract tube in the shape of the schwa /ə/ and having a length of 15 cm. The resonant frequencies along the curve are computed by the quarter-wavelength rule, and the bandwidth of each resonance is assumed to be 60 Hz. Only the first three resonances of the tube are shown.
The bandwidths are indicated for each peak of the resonance curve by the range of frequencies between the 3-dB-down points. For the present discussion, the bandwidths for each of the three resonances have been set to 60 Hz.
The example in Figure 8–8 was generated using simple principles established in Chapter 7 (the quarter-wavelength rule), as well as an “on faith” assumption about bandwidths. Fant (1960) needed a more comprehensive theory, however, because for most vowels the vocal tract tube does not have a uniform cross-sectional area from the glottis to the lips. Rather, vocal tract configurations typically involve constrictions along the path from glottis to lips, some of which are relatively tight.
Fant approached the problem of drawing the resonance curves for different vocal tract shapes in the following way. He took sagittal x-ray pictures of an adult male speaker’s productions of a variety of Russian and Swedish vowels. The soft and hard tissues of the speaker’s vocal tract (tongue, lips, hard palate, velum, part of the pharynx) were coated with barium paste, which allowed easier identification of the outlines of structures in the developed film. Figure 8–9 shows a magnetic resonance image (MRI) of a speaker producing a high back rounded vowel, with the boundaries of the air tube outlined and the air tube itself slightly shaded (note the constriction produced by the tongue, as well as by the rounded lips). The outlined, shaded tube includes boundaries defined by the walls of the larynx superior to the vocal folds, the pharynx, hard palate, velum, tongue, lips, and other surfaces. Although Figure 8–9 is an image type much more advanced than the standard x-ray images used by Fant, his approach can be explained just as effectively using the MRI example. Fant used the x-rays he obtained for many different vowels to outline the varied vocal tract shapes. When the structures are outlined in this way, the column of air extending from the glottis to the lips can be conceptualized as a tube of varying cross-sectional area.
Figure 8–9. Sagittal magnetic resonance image (MRI) of a male speaker producing a vocal tract configuration for the vowel /u/. Boundaries of the vocal tract are outlined to show the shape of the vocal tract tube and how its dimensions vary from glottis to lips. Background image obtained from the Audiovisual-to-articulatory SPeech Inversion (ASPI) project. Retrieved September 2, 2012, from http://aspi.loria.fr. Reproduced with permission.
The vocal tract length of the speaker studied by Fant (1960) was approximately 17.5 cm. Fant plotted the varying cross-sectional area of the vocal tract by estimating the area of the air tube at 0.5-cm increments from glottis to lips. This “sectioning off” of the vocal tract is shown in Figure 8–9 by the sequence of straight lines drawn through the vocal tract tube. If a line is imagined running straight forward from the glottis to the lips, along the long axis of the vocal tract, each of the straight lines seen in Figure 8–9 intersects this long axis line at a right angle. The length of the intersecting line within the vocal tract—between the red outline in Figure 8–9—defines the “size” of the vocal tract tube at that location. The distance between adjacent lines defines a small section, or “tubelette” within the vocal tract (Story, 2005), for which width measurements can be made. For example, the tubelettes are quite narrow toward the back of the oral cavity, where the tongue is raised toward the boundary of the hard and soft palates. The tubelettes are much wider toward the front of the vocal tract. Fant sectioned his vocal tract images into 35 “pieces” (2 measurements per cm of vocal tract, 2 × 17.5 = 35). The width of each of these section lines was measured and entered into a simple formula to compute the area for that slice of the vocal tract. What emerged from this exercise was an area function of the vocal tract.
Area Function of the Vocal Tract
An area function of the vocal tract is a plot of cross-sectional area as a function of distance along the vocal tract from glottis to lips. This distance is described by the succession of the measurement “slices” shown in Figure 8–9. Figure 8–10 shows an area function for the vowel /i/. Area, in cm2, is plotted on the y-axis and section (slice) number (i.e., distance) is plotted on the x-axis. The low section numbers are near the glottis (i.e., section 1 is immediately above the glottis), and the measurement of section areas moves toward the lips from left to right. Each section number has an area value, so the function is actually a sequence of discrete points. For illustration purposes, the discrete points have been connected and the area function is represented in Figure 8–10 as a continuous line. This is justified because the measurements of successive slices were sufficiently close (in 0.5-cm increments) to minimize the likelihood of major changes in cross-sectional vocal tract area between the measurement steps.
The area function in Figure 8–10 shows relatively large cross-sectional areas in the lower and upper pharyngeal regions (the left side of the x-axis), with relatively smaller areas toward the front of the vocal tract. Tight constrictions—that is, small cross-sectional areas—are present between sections 22 and 27. The entire area function is intuitively consistent with phonetic descriptions of the vowel /i/ as a high-front vowel, for which the major constriction is in the front of the vocal tract.2 Area functions were measured by Fant for many different vowels.
Area functions are the link between the configuration of the vocal tract tube, as shaped by the oral and pharyngeal structures, and the resonant frequencies of that tube. Fant (1960) developed a mathematical technique for estimating the tube resonances from the area function. The specifics of the mathematical technique are not covered here, but the conceptual link between the area function and estimation of vocal tract resonant frequencies is straightforward, and makes use of information developed in Chapter 7.
Figure 8–10. Area function for the vowel /i/ plotted from data reported by Fant (1960, p. 115). Vocal tract section number (from 1 to 35, with sections extending from the glottis to the lips) is plotted on the x-axis. Cross-sectional area (in cm2) is plotted on the y-axis.
Chapter 7 described the role of mass and compliance in determining the resonant frequency of an acoustical resonator. Imagine that the air contained within the boundaries of any two adjacent measurement points in the vocal tract (that is, within a tubelette corresponding to the small air column between two measurement lines) has certain mass and compliance properties. If these properties are specified for each of the 35 sections of air, nearly all information relevant to the resonant frequencies of the vocal tract is available. Fant’s mathematical theory allowed him to estimate mass and compliance properties from the area measurement for each section, or tubelette. Based on the mass and compliance estimates from all 35 sections, the theory produced an estimate of the resonant pattern for the entire vocal tract. When mathematical information concerning energy loss factors was included, Fant was able to draw the complete resonance curve (resonant frequencies and bandwidths as in Figure 8–8) for a given vocal tract configuration.
The conceptual basis of Fant’s (1960) theory is, therefore, fairly simple. If the mass and compliance characteristics of the vocal tract tube can be determined, the resonance curve for the tube can be constructed. Because the vocal tract resonates like a tube, there are multiple peaks (i.e., resonances) along the resonance curve, each with its own bandwidth.
At this point in the development of Fant’s (1960) theory, it is important to recognize that the vocal tract resonance curve is computed from the measured area function. The resonant peaks are determined mathematically, rather than measured by analyzing the spectrum of a produced vowel. For this reason the computed resonance curve is called a theoretical spectrum, or a filter function. This filter function shows where the resonances for this a given vocal tract configuration should be. The term filter function is particularly interesting, because it implies that the vocal tract acts like a filter, allowing energy to pass only at certain frequencies. It also implies that the filter function shows the frequencies at which energy does not pass. The regions of the spectrum where energy passes through, of course, are those regions at and in the immediate vicinity of the resonant peaks. The next section discusses the way in which the source spectrum and filter function (theoretical spectrum) are combined to produce an output spectrum—a measured spectrum, as for a phonated vowel. This discussion shows why the theoretical spectrum (the filter function) is not always precisely the same as the output spectrum (the spectrum of a produced vowel).
HOW DOES THE VOCAL TRACT SHAPE THE INPUT SIGNAL? (HOW IS THE SOURCE SPECTRUM COMBINED WITH THE THEORETICAL VOCAL TRACT SPECTRUM TO PRODUCE A VOCAL TRACT OUTPUT?)
The two questions heading this section are variants of the same problem, which is to determine how the acoustic characteristics of the source and vocal tract combine to produce a vocal tract output for a vowel. The frequency-domain representation of the vocal tract output is called an output spectrum. This is the spectrum measured, with appropriate instruments, from an actual vowel produced by a talker.
Figure 8–11 presents a simple graphic answer to the question posed above. The input (source) spectrum, as described in a preceding section, is shown at the left of the figure. The filter function is shown in the middle of the figure as a resonance curve with three peaks, corresponding to the first three resonances of a vocal tract tube in an /i/ shape. Note the multiplication sign between the input spectrum and filter function. To determine the output of the vocal tract (the right panel in Figure 8–11), the energy in the input spectrum is multiplied by the energy in the filter function.3 To be a little more explicit about what this process of multiplication means, the axes of both the input (source) spectrum and the filter function are the same—frequency on the x-axis, relative amplitude on the y-axis (i.e., they are both spectra). The input spectrum shows the relative amplitude of discrete frequency components, and the filter function shows the frequencies at which energy applied to it will be “amplified” (the resonances) or “de-emphasized” (the valleys of the resonance curve, between the resonances). The multiplication described here is simply another way to understand the shaping of the input (source) spectrum by the filter function. The harmonics in the input (source) spectrum are shaped by the form of the filter function. At resonant frequencies of the filter function, energy in the input (source) spectrum is strongly multiplied, and appears in the output spectrum as prominent energy components. At valleys in the filter function, energy in the input (source) spectrum is multiplied weakly or not at all, and does not appear (or appears only weakly) in the output spectrum.
Figure 8–11. Schematic representation of how the source or input spectrum and filter function are combined to produce a vocal tract output. The left panel depicts the input spectrum, the middle panel the filter function for a vocal tract in an /i/ configuration, and the right panel the output spectrum. The output spectrum shows which harmonics are and are not emphasized by combining the input and filter functions. The peaks in the output spectrum are formants (F1, F2, F3) representing the first three resonant frequencies of the vocal tract.