This chapter provides a selective summary of acoustic phonetics data. The presentation is selective because the research literature is voluminous. The large quantity of data available reflects the many uses of acoustic phonetics research—the discipline of acoustic phonetics serves many masters. Acoustic phonetics data are used by linguists who wish to enhance their phonetic description of a language; by speech communication specialists who are interested in developing high-quality speech synthesis and recognition systems; by scientists who wish to test theories of speech production and prefer to use the speech acoustic approach to the interpretation of articulatory patterns rather than data obtained using one of the more invasive and time-consuming physiological approaches (such as x-ray or electromagnetic tracking of articulatory motions); by speech perception scientists who want to understand the acoustic cues used in the identification of vowels and consonants by normal hearers and persons with hearing loss and prosthetic hearing devices; and by speech-language pathologists who want information concerning a client’s speech production behaviors—information that can be documented quantitatively and in some cases may be too subtle or transient to be captured by auditory analysis.
The chapter is concerned primarily with the acoustic characteristics of speech sound segments—the vowel and consonant segments of a language. A brief discussion of the acoustic characteristics of suprasegmentals (prosody) is also presented. The chapter does not provide coverage of the rich literature on the acoustic characteristics of voice production (phonation). Good sources for this information are Baken and Orlikoff (1999) and Titze (2000).
The upper part of Figure 11–1 shows a spectrogram of two vowels, /ӕ/ and /i/, spoken by a 57-year-old healthy male in the disyllable frames /əꞌhӕd/ and /əꞌhid/. The /əꞌhVd/ frame (where V = vowel) is famous in speech science, originally used by Peterson and Barney (1952) in their landmark study of vowel formant frequencies produced by men, women, and children. Peterson and Barney wanted to measure formant frequencies of vowels under minimal influence from the surrounding phonetic context—that is, with little or no coarticulatory effects. The most apparent way to get this kind of “pure” information on vowel articulation, and the resulting formant frequencies, is to have speakers produce isolated, sustained vowels. However, when speakers phonate sustained vowels they tend to sing, rather than speak, the vowels. Peterson and Barney designed a more natural speech task in which the unstressed schwa preceded a stressed syllable initiated by the glottal fricative /h/. The reasoning was that a segment whose initial articulation required primarily laryngeal gestures had minimal influence on the vocal tract gestures required for a following vowel. The /d/ at the end of the syllable was necessary to provide a “natural” ending to the syllable (in English, at least), as well as accommodating the production of lax vowels such as /ɪ/, /ε/, and /ʊ/, which in English do not occur in open syllables (with very few exceptions, such as “yeah”).
The /d/ at the end of the syllable may have some influence on the vowel articulation. But Peterson and Barney (1952) made their formant frequency measurements at a location where the formants were assumed to be sufficiently distant from the /d/ to minimize its influence on the formant frequency estimates. Conveniently, this measurement location also seemed to capture the “target” location of the vowel: the point in time at which formant frequencies were thought to coincide with the articulatory configuration aimed at by the speaker when trying to produce the best possible version of the vowel. The inclusion of the schwa as the first syllable provided some control over the prosodic pattern of the disyllable, placing stress on the /hVd/ syllable.
Figure 11–1. Spectrograms of two vowels (top part of figure), /ӕ/ and /i/, spoken by a 57-year-old healthy male in the disyllable frames /ə’hӕd/and /ə’hid/. Bottom part of the figure shows LPC spectra for the two vowels, measured at the temporal middle of the vowels. The colored lines (red = F1; green = F2; blue = F3) connect the middle of the formant bands on the spectrograms to the peaks in the LPC spectra.
Forty-three years following the publication of Peterson and Barney’s (1952) classic work, Hillenbrand, Getty, Clark, and Wheeler (1995) published a replication of the study using updated analysis methods. Like Peterson and Barney, Hillenbrand et al. studied men (n= 45), women (n = 48), and children aged 10 to 12 years (n = 46, both girls and boys) producing the 12 monophthong vowels of English (/i,ɪ,e,ε,ӕ,ɑ,ↄ,o,ʊ,u,ʌ,ɝ/) in an/hVd/ frame.1 Hillenbrand et al. measured the first four formants (F1–F4) at their most stable point, in much the same way as shown in Figure 11–1 (except that F4 is not shown). The heavy vertical line through the spectrogram in Figure 11–1 shows the point in time at which the formant frequencies were measured (in these cases, roughly in the middle of the vowel duration). Formant frequencies were estimated using linear predictive code (LPC) analysis, discussed in Chapter 10. The LPC spectra for the measurement point shown on the spectrograms are provided in the lower half of Figure 11–1. Lines point from the middle of the formant bands to the corresponding peaks in the LPC spectra. For these vowels, values of the first three formant frequencies for /ӕ/ are approximately 700, 1750, and 2450 Hz, and for /i/ are roughly 290, 2200, and 2950 Hz.
Figure 11–2 shows Hillenbrand et al.’s (1995, p. 3104) formant frequency data in the form of an F1-F2 plot (see Peterson & Barney, 1952, Figure 8, p. 182). Each phonetic symbol represents an F1-F2 coordinate for a given speaker’s production of that vowel. Two of the vowels (/e/ and /o/) are not plotted to reduce crowding of the data points. The ellipses drawn around each vowel category enclose roughly 95% of all the observable points for that vowel.
These data show how a single vowel can be associated with a wide range of F1-F2 values. For example, the vowel /i/ shows points ranging between (approximately) 300 and 500 Hz on the F1 axis and 2100 and 3400 Hz on the F2 axis. The ellipse enclosing the /i/ points is oriented upward and leaning slightly to the right. Points in the lower part of the ellipse are almost certainly from men, points in the middle from women, and points at the upper part and to the right from children. This follows from material presented in Chapters 7 and 8 on resonance patterns of tubes of different lengths, and age- and sex-related differences in vocal tract length: shorter vocal tracts have higher resonant frequencies than longer ones. The same general summary can be given for almost any vowel in this plot, even though the degree of variation and orientation of the ellipses vary from vowel to vowel.
Despite the wide variation across speakers in formant frequencies for a given vowel, Hillenbrand et al. (1995) replicated Peterson and Barney’s (1952) finding that the vowel intended by a speaker was almost always perceived correctly—consistent with the speaker’s intention—by listeners. Somehow listeners heard the same vowel category even when confronted with a wide variety of formant patterns.
Figure 11–2. F1-F2 plot of American English vowels from Hillenbrand et al. (1995, p. 3104). Each phonetic symbol represents an F1-F2 coordinate for a given speaker’s production of that vowel. Two vowels (/e/ and /o/) are not plotted to reduce crowding of the data points. The ellipses drawn around each vowel category enclose roughly 95% of all the observable points for that vowel. Reproduced with permission from Hillenbrand, J., Getty, L., Clark, M., and Wheeler, K. (1995). Acoustic characteristics of American English vowels. Journal of the Acoustical Society of America, 97, 3099–3111.
Figure 11–2 also shows that the formant frequency patterns of one vowel often overlap with those of another vowel. There is a small region of overlap between the ellipses of /i/ and /ɪ/, and a larger region of overlap between /ӕ/ and /ε/. In the lower left-hand part of the plot, roughly around F1 = 525 Hz and F2 = 1400 Hz, there is a three-way overlap of the vowels /u/,/ʊ/, and /ɝ/. In many cases, these areas of overlap are for vowels produced by speakers with vocal tracts of different lengths. For example, a good portion of the overlap between /ӕ/ and /ε/ seems to come from adult male /ӕ/ values with adult female, or child /ε/ values. But there are many cases where the overlap is not so clearly explained by differing vocal tract lengths. Overlapping formant frequencies for the same vowel produced by different speakers raises the question of how we hear the same formant frequencies produced by different speakers as different vowels. Hillenbrand et al.’s (1995) data, like those of Peterson and Barney (1952), demonstrate that knowing a pattern of formant frequencies does not necessarily provide sufficient information to identify the intended (spoken) vowel category. At the least, the identity of the speaker, including age, sex, and almost certainly dialect (Clopper, Pisoni, & de Jong, 2006), must be known to link a specific formant pattern to a vowel category.
Figure 11–3 plots a summary of a subset of the data reported by Hillenbrand et al. (1995). The subset includes averaged F1-F2 data from men, women, and children, for American English corner vowels that were “well identified” by a panel of listeners. Corner vowels define the limits of vowel articulation, with /i/ the highest and most front vowel, /ӕ/ the lowest and most front, /ɑ/ the lowest and most back, and /u/ the highest and most back. If these are the most extreme articulatory configurations for vowels in English, the formant frequencies should be at the most extreme coordinates in F1-F2 space. By including only data from “well-identified” vowels, the plotted data can be regarded as excellent exemplars of these vowel categories. When average F1-F2 coordinates for each of the four corner vowels are connected by a line for each of the three speaker groups, three vowel quadrilaterals are formed. In an F1-F2 plot, the area enclosed by such a quadrilateral is called the acoustic vowel space.
Figure 11–3. Vowel spaces constructed from corner vowels (/i/, /ӕ/, /ɑ/, /u/) of American English for men (filled circles connected by solid lines), women (unfilled circles connected by dashed lines), and children aged 10 to 12 years (lightly shaded diamonds connected by solid lines). These data are replotted from Hillenbrand et al. (1995, Table V., p. 3103) and are averages of vowels that were well identified by a crew of listeners.
Students sometimes have difficulty embracing the gray areas in science; black and white answers are more comfortable. The overlap of vowel formant frequencies shown in Figure 11–2, coupled with listeners’ apparent ease in identifying vowels in the overlapped areas, seems to be one of those hard-to-embrace gray areas: how do listeners do it? One possible answer is that the relationship of vowel acoustics to vowel identification is both more complex and more simple than suggested by the scatterplot in Figure 11–2. Both Peterson and Barney (1952) and Hillenbrand et al. (1995) constructed scatterplots based on F1 and F2 measured at a single point in time during vowel production. In identification experiments, however, listeners heard the whole vowel. What if changes in formant frequency throughout the vowel duration are important for vowel identification? If this is the case, plots such as Figure 11–2 underrepresent the mapping between vowel acoustics and vowel identification (see Hillenbrand & Nearey, 1999; Jacewicz & Fox, 2012, 2017; Nearey & Assman, 1986 for just this argument). The scientific problem is more complex because there is more to the acoustics of vowels than formant frequencies measured at a single point in time, and less complex because plots such as those in Figure 11–2 may miss the point. Think gray: it’s so much more fun than thinking black and white.
Three general characteristics of these F1-F2 plots are noteworthy. First, the vowel quadrilaterals for men, women, and children move from the lower left to upper right part of the graph, respectively. This is because the vocal tract becomes progressively shorter across these three speaker groups. Second, the area of the vowel quadrilaterals—the acoustic vowel space—appears to be larger for children, compared with women, and larger for women compared with men. This probably has little to do with articulatory differences between the groups (although there may be some sex-specific articulatory effects, see Simpson, 2001), but rather is another consequence of the different-sized vocal tracts. Exactly the same articulatory configurations for the corner vowels in a shorter compared with longer vocal tract generate not only higher formant frequencies but also greater distances between F1-F2 points for the different vowels. The larger vowel space for children compared with men does not mean the children use more extreme articulatory configurations for vowels. Third, the acoustic vowel quadrilateral for one group of speakers cannot be perfectly fit to the quadrilateral for a different group by moving it to the new location (up and to the right) and uniformly expanding or shrinking it to achieve an exact match. Imagine the quadrilateral for men in Figure 11–3 moved into the position of the quadrilateral for women, followed by a uniform expansion of the male space to “fit” the female space. This attempt to scale the male quadrilateral to the female quadrilateral fails because the magnitudes of the sex-related vowel differences are not the same for each vowel. For example, note the relative closeness in F1-F2 space of the male and female /u/ compared with the other three vowels. A similar inability to scale the vowels of one group to another is seen in Figure 11–4, which is an F1-F2 plot for the lax vowels (/i/, /ʊ/, /ε/), based on data reported by Hillenbrand et al. (1995). The group differences in these vowel triangles are very much like the ones shown for the corner vowel quadrilaterals (Figure 11–3), including vowel-specific differences across groups. Note especially the small difference between males and females for /ʊ/ compared with the much larger male-female differences for /ɪ/ and /ε/. The lax-vowel triangle for men cannot be scaled in a simple way to obtain the triangle for women.
Professor Emeritus James Hillenbrand has made the vowel formant frequencies reported in Hillenbrand et al. (1995) available for public use. Interested readers are directed to homepages.wmich.edu/~hillenbr/voweldata.html
Figure 11–4. Vowel spaces constructed from three American English lax vowels (/ɪ/, /ʊ/, /ε/) for men (filled circles connected by solid lines), women (unfilled circles connected by dashed lines), and children aged 10 to 12 (lightly shaded diamonds connected by solid lines). These data are replotted from Hillenbrand et al. (1995, Table V., p. 3103) and are averages of vowels that were well identified by a crew of listeners.
The acoustic vowel space for corner vowels has potential clinical application as an index of speech motor integrity, the effects of speech therapy, and comparison of vowel articulation in different languages. Evidence from studies of speakers with dysarthria (speech disorders resulting from neurological disease; see Liu, Tsao, & Kuhl, 2005; Turner, Tjaden, & Weismer, 1995; Weismer, Jeng, Laures, Kent, & Kent, 2001), glossectomy (removal of tongue tissue, usually because of a cancerous tumor; see Whitehill, Ciocca, Chan, & Samman, 2006), cochlear implants (Chuang, Yang, Chi, Weismer, & Wang, 2012), and even neurologically normal speakers (Bradlow, Torretta, & Pisoni, 1996) suggests that the size of the acoustic vowel space is modestly correlated with independent measurements of speech intelligibility or perceptual measures of articulatory precision (Fletcher, McAuliffe, Lansford, & Liss, 2017). The size of the vowel space is smaller in persons with dysarthria (Lee, Littlejohn, & Simmons, 2017) and can be made to expand and contract with different speaking styles such as “clear speech” versus “casual speech” (Lam & Tjaden, 2013). Computation of the acoustic vowel space is often based on the assumption that the area enclosed by the formant frequencies of corner vowels in an F1 versus F2 (or F2 vs. F3) plot serves as a “proxy” for articulatory behavior (Berisha, Sandoval, Utianski, Liss, & Spanias, 2014). The idea is that larger vowel space areas reflect greater articulatory distinctions between the corner vowels, and perhaps between vowels on the “interior” of the corner vowels as well (e.g., lax vowels of English such as /ɪ/ in “bit,” /ε/ in “bet,” and /ʊ/ in “book”). Over the past fifteen years there have been many publications on the acoustic vowel space, applied to many different problems, some of which are listed above.
The correlation between the area of the acoustic vowel space and speech intelligibility (or perceptual measures of speech precision) can be interpreted in at least two ways. First, because the acoustic vowel space for corner vowels is constructed from the most extreme articulatory positions for vowels (i.e., most high and front vowel /i/, most high and back vowel /u/, and so forth), the size (area) of the space may be an index of articulatory mobility—the acoustic “proxy” mentioned above. Larger vowel space areas are thought to reflect greater differences between vocal tract configurations for the corner vowels. In speakers with motor speech disorders, as one example, larger vowel spaces are thought to reflect greater articulatory flexibility compared with smaller vowel spaces. The size of the vowel space area may indeed serve as a global index of speech motor control. In this view, one cannot claim that specific vowels in the acoustic vowel space contribute directly to speech intelligibility; the vowel space measure is a general measure of the overall severity of a speech motor control deficit. A corollary to this is that the inference from the vowel space to the degree of articulatory flexibility is connected to speech intelligibility in a general way: intelligibility increases with increases in vowel space area.
A popular research goal has been to find a reliable connection between the speech acoustic output of the vocal tract and the intelligibility of utterances. This makes sense because the treatment of many speech disorders has a central goal of making a speaker more intelligible. To meet this overall goal, treatment efforts are often aimed at improving phonetic accuracy, under the assumption that speech intelligibility is better or worse as phonetic accuracy for sound classes (vowels, fricatives, and so forth) improves or declines. This very reasonable assumption is, unfortunately, not completely true, and in fact a good deal less true than you might think (Weismer, 2008). In an effort to identify other perceptual consequences of speech disorders, researchers have developed scales for articulatory precision, naturalness, acceptability, bizarreness (we kid you not), normalcy, and severity. There are other terms, too, but let’s not go on. Whether or not the terms are variants of one phenomenon—speech intelligibility—is a topic for future research.
A second interpretation is that larger vowel space areas increase the acoustic difference between closely related vowels, such as /i/ versus /ɪ/ or /u/ versus /ʊ/. In other words, acoustic contrasts for vowels other than the corner vowels are more sharply defined within larger vowel spaces.2 In this interpretation, a smaller vowel space is not simply a sign of overall speech motor control problems (as in the first interpretation) but is an important, independent component of a speech intelligibility deficit. More sharply defined contrasts among vowels allow better vowel identification, and therefore contribute to increased speech intelligibility (Liu, Tsao, & Kuhl, 2005). This may also explain why children with cochlear implants, whose auditory input is not as rich as that of children with normal hearing, produce smaller vowel spaces than normal-hearing children (Chuang et al., 2012).
There is a critical distinction between the first and second interpretations of the size of the vowel space. The second interpretation predicts an improvement in speech intelligibility for a client who, as part of a management program, learns to produce corner vowels with more extreme positions and, therefore, expands the acoustic vowel space. If the vowels are an independent component that contributes to speech intelligibility, larger vowel spaces resulting from management will be associated with more sharply defined vowel contrasts, for all vowels of a language. The better vowel contrasts result in improved speech intelligibility (Kim, Hasegawa-Johnson, & Perlman, 2011; Lansford & Liss, 2014; Liu et al., 2005). The first interpretation, that a smaller vowel space merely reflects some limitation on articulatory flexibility, does not necessarily require that the smaller vowel space be associated with reduced intelligibility. In fact, a given speaker produces vowel spaces of very different size depending on the speaking style (formal versus casual) and even the kind of speech material produced (e.g., vowels in isolated words versus vowels in an extended reading passage) without any significant loss of speech intelligibility (Kuo & Weismer, 2016; Picheny, Durlach, & Braida, 1986).
In a 1980 paper delivered to the 100th meeting of the Acoustical Society of America in Los Angeles, the famous phonetician/phonologist John Ohala (now Professor Emeritus of Linguistics at UC Berkeley) proposed an acoustic explanation for smiling. By pulling the corners of the mouth back and against the teeth, Ohala argued, the vocal tract is effectively shortened. Shorter vocal tracts mean higher formant frequencies, and higher formant frequencies are associated with smaller people. Smaller people, such as children, are generally not viewed as a physical threat. Speaking while smiling—think game-show host—sends a signal that says, “I’m small, I’m not a threat, I’m friendly, like me, don’t hurt me.” In evolutionary terms, vocalizations while smiling eventually were dispensed with, and the soundless smile was enough to send a signal of friendliness.
It is not clear which of these interpretations is correct, or even if they should be considered opposing viewpoints. Both views may be correct to some degree. A set of data relevant to interpretation of vowel space area shows that among normal, young adult males, the area may vary dramatically with change in speaking material (Kuo & Weismer, 2016). Figure 11–5 shows vowel space areas adapted from Kuo and Weismer for a single speaker in two conditions: (1) speaking clearly in an /hVd/ frame (where V = vowel; unfilled boxes), and (2) vowels extracted from a reading passage (filled circles). This plot shows F1 on the y-axis and F2 on the x-axis, corresponding to tongue height on the ordinate (higher tongue height as values move down the ordinate, hence the higher F1 values with lower tongue height) and tongue advancement on the abscissa (greater advancement as values move to the right on the abscissa). Eight monophthongs are plotted for each of both conditions (1) and (2) with the corner vowels identified by phonetic symbols. The other vowels can be inferred from their position along the tongue height and advancement dimensions. The vowel space area in the citation speech condition (/əꞌhVd/) is substantially greater than the area in the reading condition. Does the normal speaker’s reduction of vowel space area in the reading condition reflect a loss of speech motor control relative to the clear speech condition? The answer is obviously no. Seven of the 10 speakers studied by Kuo and Weismer (2016) had an extensive reduction pattern like the speaker shown in Figure 11–5, to varying degrees; the remaining 3 speakers had smaller reductions in the same direction as the 7 speakers. The vowel spaces from the reading condition are comparable to those reported for speakers with amyotropic lateral sclerosis (Turner & Tjaden, 2000), and for speakers with Parkinson’s disease (among others) (Whitfield & Goberman, 2014). These comparable “normal” data and data reported for speakers with speech motor control deficits resulting from neurological disease are not meant to deny the potential of vowel space area as a noninvasive, clinical index of the integrity of speech motor control and its effect on speech intelligibility. Rather, it points to caution in the interpretation of vowel space area as an index of speech motor control.
Figure 11–5. F1-F2 plot of one speaker’s corner vowel productions in two speaking conditions. The conditions are citation speech in an /əꞌhVd/ context (unfilled boxes), and passage reading (filled circles). F1 is on the ordinate, F2 on the abscissa, to display the formants as “within” the vocal tract. Tongue height increases upward on the ordinate and tongue advancement increases to the right on the abscissa.
In recent years there has been enormous interest in comparative acoustic phonetics. Comparative acoustic phonetics has two main, interrelated branches. One concerns the acoustic characteristics of similar speech sounds in two or more languages or in two or more dialects of the same language. The other branch concerns the effect of native language (or dialect) phonetics (usually abbreviated as L1) on the acoustic characteristics of speech sounds in a second language (or dialect) (L2).3 The relationship between these two branches is simple in concept, but complex in practice. Both branches are relevant to preclinical speech science. The population of the United States includes many people who speak English as an L2. Dialect variation within the United States is not subtle, and a person’s dialect may be a core component of identity and culture. A significant proportion of individuals who seek the services of a speech-language pathologist and whose language or dialect does not match the clinician’s presents an interesting diagnostic and treatment problem. These concerns may include the diagnosis and treatment of a developmental speech sound disorder, the influence of disease on speech production, or a desire among healthy speakers for accent modification or reduction.
Acoustic characteristics of vowels have been a major focus for both branches of inquiry. Figure 11–6 shows F1-F2 patterns for “shared” vowels, measured roughly at the temporal midpoint of the vowels, produced by adult male speakers in four languages—Madrid Spanish; American English as spoken in Ithaca, New York (presumably corresponding to the “Inland North” dialect; see Labov, 1991); modern Greek (primarily from Athens); and modern Hebrew as spoken in Israel. These four languages share the vowels /i/, /e/, /o/, and /u/. A fifth vowel, /a/ slightly advanced from American English /ɑ/, is shared by Spanish, Greek, and Hebrew. A “shared” vowel is one that phoneticians transcribe with the same symbol across languages. In Figure 11–6, the F1-F2 values for shared vowels are enclosed by ellipses drawn by eye. Shared vowels in Hebrew and English (for example) have different F1-F2 values; the magnitude of these differences varies across vowels. The Hebrew-English differences for the high front vowels /i/ and /e/ are more dramatic compared with /u/ and /o/. Similar comparisons for different language pairs suggest the same conclusion: the use of the same phonetic symbol does not mean the sound has the same acoustic characteristics in different languages. Data such as these may explain why an L2 speaker’s production of a vowel shared by L1 and L2 (e.g., an Israeli speaker’s production of the vowel /i/ when speaking American English) can still be detected as accented by a native speaker of the L2 (Bradlow, 1995).
Figure 11–6. F1-F2 plot of shared vowels from four different languages. Spanish, American English, modern Greek, and Hebrew all have the vowels /i/, /e/, /o/, and /u/ in their phonetic inventories. In addition, Spanish, modern Greek, and Hebrew share the vowel /a/. Spanish data are shown by filled circles, American English data by unfilled circles, Greek data by lightly shaded diamonds, and Hebrew data by filled diamonds. The sources for the data are Bradlow (1995; Spanish and English), Jongman, Fourakis, and Sereno (1989; Greek), and Most, Amir, and Tobin (2000; Hebrew). All data are from adult male speakers.
The cross-language comparison presented in Figure 11–6 is a simplistic one, even though it makes a valid point. The comparison is simplistic because “shared” vowels across different languages may vary in more ways than the F1-F2 values measured at a single point in time. The vowels may also differ (or be similar) by the higher (F3, F4) formant frequencies, the formant transitions going into and out of the so-called vowel steady-states, the overall vowel duration, the relation of the vowel duration to the duration of adjacent syllables, and other factors. The acoustic comparison of shared vowel sounds in different languages is potentially very complex.
Despite this complexity, a simple F1-F2 comparison of L2 vowels and their corresponding native vowels (the L1, e.g., Americans producing English) reveals a good deal about the influence of L1 on an L2 vowel system (i.e., the vowel productions of speakers learning an L2). Chen, Robb, Gilbert, and Lerman (2001) studied the formant frequencies of American English vowels produced by native speakers of Mandarin, the primary language of Taiwan and many parts of mainland China. The Mandarin vowel system includes six vowels, /i/, /e/, /u/, /o/, /a/, and /y/ (similar to a lip rounded /i/), the first four of which are also found in American English. The American English lax vowels/ɪ/, /ε/, and /ʊ/ are “new” for the Mandarin speaker learning English (just as /y/ would be a new vowel for the American speaker learning to produce Mandarin). Figure 11–7 shows F1-F2 data from the Chen et al. study, specifically for the American English vowels /i/, /e/, /ɪ/, /u/, /ʊ/, and /ʌ/ produced by Taiwanese adult females whose native language is Mandarin (filled circles) and by native female speakers of American English (unfilled circles). Each plotted point is labeled with the phonetic symbol matching the American English vowel intended by the Mandarin speakers.
Figure 11–7. F1-F2 data for American English vowels i/, ɪ/, e/, u/, U/, and /ʌ/, spoken by adult female speakers from Taiwan whose native language is Mandarin (filled circles) and adult female speakers whose native language is American English (unfilled circles). Arrows project from each phonetic symbol to the points representing a specific, average F1-F2 coordinate for vowels produced by both groups of speakers. Data replotted from Chen et al. (2001).
These data suggest several important conclusions concerning the way in which the vowel pairs [i]-[ɪ] and [u]-[ʊ] were produced by the two groups of speakers. Native speakers of English produced these vowel pairs with a fair degree of separation in F1-F2 space, as expected for vowels with categorical (that is, phonemic) status. English [i] and [ɪ] differ both in the F1 and F2 dimensions, the differences implying a somewhat more open (higher F1) and slightly more posterior tongue position (lower F2) for [ɪ]. English [u] and [ʊ] are separated only minimally along the F2 dimension, but differ by nearly 100 Hz on the F1 dimension, suggesting a more open vocal tract for the latter vowel. In contrast, the F1-F2 points for the [i]-[ɪ] and [u]-[ʊ] English vowel pairs produced by Taiwanese speakers were closer together, differing by small amounts along the F2 dimension. In the case of [u]-[ʊ], the Taiwanese points in F1-F2 space are so close to each other it appears the speakers treated the vowels as members of a single vowel category. Note also the large separation in F1-F2 category between [ɪ] and [e] for the native speakers, but the small separation between the same two vowels for the Mandarin speakers.
There are competing theories for the L1-L2 data in Figure 11–7. The Mandarin speakers produced a “new” vowel ([ɪ] in one case, [U] in the other) as if it were a member of one of the “shared” vowels ([e] and [u]). Because the formant frequencies for the “shared” vowels are not identical across the two languages (see above), it may be more accurate to say the Mandarin speakers took “shared”-“new” vowel pairs and treated them as one category. The F1-F2 values for both the “shared” and “new” vowels had intermediate locations between the American’s well-separated F1-F2 points for the two vowels. Think of it as a phonetic compromise when the skill of producing two nearby, but separate, vowels is not yet available to a speaker. The influence of the L1 vowel system on the L2 is to draw “new” vowels toward one of the shared vowels.
The formant frequencies plotted in Figures 11–3 to 11–8 are averages across speakers. Figure 11–2, from Hillenbrand et al. (1995), presents a more realistic picture of variability in formant frequencies for a given vowel, but this presentation shows only across-speaker variability. Within a speaker, vowel formant frequencies vary with a number of factors. These factors include—but may not be limited to—speaking rate, syllable stress, speaking style, and phonetic context.
Traditionally, the effects of different factors on vowel formant frequencies have been referenced to a speaking condition in which the vowel is produced in a hypothetically “pure” form, as described above. As already mentioned, in their original study of vowel formant frequencies, Peterson and Barney (1952) designed the /əꞌhVd/ frame as a speech production event similar to real speech but largely free of many of the influencing factors noted above. Stevens and House (1963), in their classic paper on phonetic context effects on vowel formant frequencies, demonstrated for three phonetically sophisticated speakers (i.e., speech scientists) the lack of any difference in F1 and F2 for isolated vowels and vowels spoken in the /hVd/ frame. This result, as well as other data reviewed by Stevens and House, suggested that formant frequencies measured at the midpoint of a vowel in the /hVd/ frame are representative of vowels articulated under minimal influence from factors such as context, rate, and so forth. Because of this, vowels measured in the /hVd/ frame are often referred to as null context vowels.
When null context vowels are plotted in F1-F2 space together with the same vowels produced in varied phonetic contexts, at different rates, and in different speaking styles, an interesting pattern emerges. Figure 11–8 shows an F1-F2 plot for the corner vowels of American English (/i/, /ӕ/, /ɑ/, /u/). The formant frequencies in this plot were derived for male speakers from several different sources in the literature. Two sets of “null context” data are plotted, one from Peterson and Barney (1952; filled circles connected by solid lines), the other from Hillenbrand et al. (1995; open circles connected by dashed lines). The decision to include F1-F2 data for null context vowels from two different data sets underscores the potential variability in these kinds of measurements. The two sets of null context data are most different for the low vowels /ӕ/ and especially /ɑ/. The speakers in the two studies were from the same geographical region (Michigan, with a few speakers from other areas), but the recordings are separated in time by approximately 45 years. A likely explanation for the difference in the low vowel formant frequencies is changing patterns of vowel pronunciation over the second half of the twentieth century.
How do formant frequencies deviate from null context values when they are produced under different speaking conditions? The data shown in Figure 11–8 include F1-F2 data for the corner vowels spoken at a fast rate, but with syllable stress (Fourakis, 1991; filled triangles), in conversational-style production of sentences (Picheny et al., 1986; unfilled triangles), in a “clear-speech” production style of sentences (Picheny et al., 1986; filled diamonds), and in a /bVb/ context (Hillenbrand, Clark, & Nearey, 2001; lightly shaded diamonds).
Figure 11–8. F1-F2 plot showing two vowel spaces for the “null context” corner vowels, plus corner-vowel data from studies in which the vowels were produced in other speaking conditions. Null context data are from tabled means published by Peterson and Barney (1952) and from careful pencil-and-ruler estimates of figures shown in Hillenbrand et al. (1995). Data from Fourakis (1991) are from tabled formant frequencies for fast-speech, stressed vowels averaged across various phonetic contexts. Values plotted from Picheny, Durlach, and Braida (1986) are for vowels extracted from sentence productions in conversational (unfilled triangles) and clear-style speech (filled diamonds) and were estimated from their published figures by the pencil-and-ruler technique. The same estimation technique was used to obtain the /bVb/ data from Hillenbrand et al. (2001). The red circle in the middle of the plot is the F1-F2 pattern expected from a male vocal tract with uniform cross-sectional area from glottis to lips (the expected vocal tract shape for schwa). All plotted data points are from male speakers.
With certain exceptions (the Fourakis points for /ɑ/ and /i/), especially when the Hillenbrand et al. (1995) null context vowel space is used as a reference, vowels spoken in any of the other conditions tend to have F1-F2 points that are “inside” the null context vowel space. More specifically, the F1-F2 points for the different conditions move away from the null context coordinates in the direction of a point roughly in the center of the quadrilaterals, indicated by the red circle. This point plots F1 = 500 Hz, F2 = 1500 Hz, the first two formant frequencies associated with the “neutral vowel” configuration, or a vocal tract with uniform cross-sectional area from the glottis to the lips. As discussed in Chapter 8, this is the vocal tract configuration most closely associated with schwa (/ə/). The F1 = 500 Hz, F2 = 1500 Hz pattern is consistent with an adult male vocal tract with uniform cross-sectional area from glottis to lips.
One way to interpret the patterns seen in Figure 11–8 is to regard the null context formant frequencies (and underlying vocal tract configuration) for a given vowel as idealized targets. In this view, described explicitly by Bjorn Lindblom (1963, 1990), the speaker always aims for the idealized target, but misses it in connected speech because the articulators do not have sufficient time to produce the target before transitioning to production of the following sound. For example, the target vocal tract shape (the area function) for the vowel /i/ has a relatively tight constriction in the front of the vocal tract, and a wide opening in the pharyngeal region. This vocal tract shape is a significant deviation from the straight-tube configuration of schwa and fits the description of /i/ as a high-front vowel. In Lindblom’s view, when a vowel such as /i/ is produced in a condition other than the null context, the idealized target is missed in a specific way, namely, by producing a vocal tract configuration (and resulting formant frequencies) that reflects a lesser deviation from the schwa configuration. It is as if all vowels are viewed as deviated vocal tract shapes (and formant frequencies) from the straight-tube configuration of schwa. Under optimal conditions, when the target is achieved, these deviations in vocal tract shape are maximal. In connected speech, however, the deviations from the schwa configuration are not as dramatic. By not producing the most extreme configuration associated with the sound, the speech mechanism has more time to produce a sequence of sounds in an efficient and intelligible way. An /i/ in connected speech is still a high-front vowel, but not quite as high and front as in the null context.
Lindblom (1963) called this phenomenon “articulatory undershoot.” Undershoot, in his opinion, is a result of phonetic context (relative to the null context), increased speaking rate, reduced stress, and casual speaking style, but all these different causes are likely to be explained by a single mechanism. Simply put, the shorter the vowel duration, the greater the undershoot. Relative to the duration of a null context vowel, phonetic context (e.g., a vowel surrounded by two obstruents), increased speaking rate, reduced stress, and a casual speaking style are all associated with shorter vowel durations. In the language of phonetics, vowels experience greater reduction as vowel duration decreases, regardless of the condition, resulting in a shorter vowel duration. Although this is not a universally accepted interpretation of undershoot, it explains a good deal of variation in formant frequencies for a given vowel produced by a specific speaker. Most likely, factors other than vowel duration may, in some cases, have an independent effect on formant frequencies. For example, in Figure 11–8, the vowels /ɑ/ and /i/ from Fourakis (1991) do not fit the duration explanation of undershoot because they fall outside the null context quadrilaterals, even though the “fast condition” vowel durations were relatively short (as reported by Fourakis, 1991, Table III, p. 1821). Because these vowels were stressed, an independent effect of stress on vowel formant frequencies must be considered a possibility.
Kuo and Weismer (2016) varied speech materials so that American English vowels occurred in simple, single-syllable utterances, words in sentences, and words in long reading passages. The Lindblom-inspired logic of this manipulation was that as speech material changed from formal, simple syllables (close to the “null” vowel) to more “connected” and conversational utterances (more casual), vowel durations would shorten, resulting in an increased amount of undershoot of formant frequencies. Data were obtained from 10 adult males producing American English vowels embedded in the varied speech materials. As predicted from Lindblom’s theory, the undershoot of “target” vowel formant frequencies increased as the speech material became more “casual.” The degree and patterns of undershoot, however, depended on speaker and vowel. Some speakers were dramatic undershooters, some less so. In addition, the extent of undershoot across the speech materials was not the same for all vowels, and the vowel-specific patterns varied across speakers. An important lesson from the Kuo and Weismer experiment, as well as other experiments (e.g., Johnson, Ladefoged, & Lindau, 1993), is that almost any “pattern” in acoustic or articulatory phonetics, for any sound segment, is likely to have a broad range of variability when a sufficient number of speakers is studied. The across-speaker variability may be so dramatic as to challenge the identification of a well-defined acoustic and/or articulatory pattern for a given sound segment. The reader is encouraged to keep this in mind as the acoustic characteristics of speech sounds are reviewed in this chapter.
The take-home messages from this discussion of vowel formant frequencies and their variability across and within speakers are as follows. First, sex and age have a dramatic effect on the formant frequencies for a given vowel because these variables are closely associated with differences in vocal tract size and length. In general, the longer and larger the human vocal tract, the lower the formant frequencies for all vowels. This explains why, for a given vowel, there is such a large range of formant frequencies across the population (see Figure 11–2). Second, even when vocal tract length/size factors are held constant, “target” formant frequencies for a given vowel may vary for several reasons. One reason may reflect the inherent constraints on a phonetic symbol system. Even though a vowel is transcribed as /u/ in several different languages, the F-pattern associated with productions of this vowel category may be substantially different (see Figure 11–6). The same conclusion can be made about the same vowel produced by speakers of different dialects of the same language (Clopper, Pisoni, & de Jong, 2005). Additional reasons for variation with constant vocal tract length include the effects of phonetic context, syllable stress, speaking rate, and speaking style.
The implication of the variation in “target” formant frequencies for a particular vowel is that if one is asked the question, “What are the formant frequencies for the vowel /u/ (or any other vowel)?” an answer cannot be supplied without additional information on the speaker, the nature of the speech material, the language being spoken, the style of the speech (formal versus casual), and so forth. Even with answers to all these questions, a definitive, precise answer is probably not feasible. It is more likely that a definitive, precise answer is not necessary because vowels may be perceived by focusing on relations among the formant frequencies, rather than absolute values of individual formants. In addition, it is likely that the formant frequencies measured near the temporal middle of vowels—the so-called target values—are only part of the information critical in distinguishing among the vowels of a language (see next section).
The discussion above presented a “slice-in-time” view of vowel formant frequencies but mentioned the possible importance of formant frequency change across the duration of a vowel. When single-slice formant frequencies are supplemented with information on formant frequency change throughout the vowel nucleus, identification/classification accuracy increases, sometimes substantially (Assman, Nearey, & Hogan, 1982; Hillenbrand & Nearey, 1999; Jacewicz & Fox, 2012). Both single-slice formant frequencies and formant movement throughout a vowel nucleus make important contributions to vowel identity.
The complexity of the mapping from articulatory to acoustic phonetics comes as no surprise to anyone who has studied tongue, lip, and jaw motions during speech production for even the simplest consonant-vowel-consonant (CVC) syllable. Figure 11–9 shows these motions for the vowel [ɪ] in the word [sɪp] produced by a young adult female. These data were collected with the x-ray microbeam instrument, which tracked the motions of very small gold pellets attached to the tongue, jaw, and lips (see Chapter 6). In Figure 11–9 the lips are to the right and the outline of the hard palate is seen in the upper part of the x-y coordinate system. The motions of two lip pellets (UL = upper lip; LL = lower lip), two mandible (jaw) pellets (MM = mandible at molars; MI = mandible at incisors), and four tongue pellets (T1–T4 arranged front to back roughly from tip [T1] to dorsum [T4]) are shown for the entire duration of the vowel [ɪ]. The shaded portion on the waveform in the lower part of the figure corresponds to the duration of the motions shown in the upper part of the figure. Arrows pointing up to the waveform indicate the onset and offset of the vowel, hence the beginning and end of the displayed pellet motions. The vowel duration is 115 ms. The direction of the tongue pellet motions throughout the vowel is shown by arrows with dashed lines. The final position of each pellet, at the last glottal pulse of the vowel (the operationally defined end of the vowel) before lip closure for [p], is marked by a small circle at the end of the motion track. All tongue pellets move down throughout the vowel, with the exception of the small upward motions in T2 and T3 at the beginning of the vowel. Throughout the syllable, the mandible moves up very slightly and the lips come together, as would be expected when a vowel is followed by a labial consonant. Although this display does not show the motions as a function of time (they are spatial displays of events that unfold over time but there is no time scale, other than the knowledge that the tracks cover a time interval of 115 ms), they are more or less continuous throughout the vowel nucleus and do not have obvious steady-state portions where the motion “freezes.” This is especially so for the tongue pellets, for which the downward motion is smooth and continuous from the beginning to the end of the vowel.
The continuous motions of the articulators for the short-duration vowel [ɪ] are consistent with changing formant frequencies throughout the vowel nucleus. The change in tongue position over time is associated with a change in the vocal tract area function over time, and it is the area function that determines formant frequencies. Based on these motions and the resulting formant transitions, it is not surprising that portions of the vowel in addition to the “slice-in-time” target measurement contribute to vowel identification. One future area of research is to generate better descriptions of these simple vowel motions, and to understand the role of such motions in vowel identification. This is an important area of research because of the significant contribution of vowel articulation to speech intelligibility deficits in the speech of individuals who are hearing impaired (Metz, Samar, Schiavetti, Sitler, & Whitehead, 1985) and who have dysarthria (Weismer & Martin, 1992), among other disorders.
Figure 11–9. Tongue (T, in four locations), mandible (M, at the level of the molars, M, and incisors, I), and upper and lower lips (UL, L) pellet motions throughout the vowel are shown by dashed arrows, and the final pellet positions at the end of the vowel are marked by circles at the end of the motion tracks. Data are shown in the x-ray microbeam coordinate system (Westbury, 1994), with the x-axis defined by a plate held between the teeth and the y-axis by a line perpendicular to the x-axis and running through the maxillary incisors. The interval of the speech waveform for which the motions are shown is highlighted in the lower part of the figure by the shaded box on the waveform.
Articulatory and Acoustic Phonetics
When speech scientists use the term “articulatory phonetics,” they have in mind the positions and movements of the articulators, as well as the resulting configuration of the vocal tract, as they relate to speech sound production. The term “acoustic phonetics” describes the relations between the acoustic signal (resulting from those positions, movements, and configurations) and speech sounds. Many scientists have studied the relations between articulatory and acoustic phonetics, and found them to be fantastically complex. Why is this so? There are many reasons, but here are two prominent ones: first, exactly the same acoustic phonetic effect can be produced by very different articulatory maneuvers. For example, the low F2 of /u/ can be produced by rounding the lips, backing the tongue, or lowering the larynx. And second, certain parts of the vocal tract—the pharynx, for instance—are exceedingly difficult to monitor during speech production, yet play a very important role in the speech acoustic signal. Scientists who study the relations between articulatory and acoustic data often use advanced mathematical and experimental techniques to determine just how an articulatory phonetic event “maps on” to an acoustic phonetic event.
Vowel durations have been studied extensively because of the potential for application of the data to speech synthesis, machine recognition of speech, and description and possibly diagnosis of speech disorders in which timing disturbances are present. What follows is a brief discussion of the major variables known to affect vowel durations.
An “intrinsic” vowel duration derives from the articulation of the vowel segment itself, as opposed to an external influence (as described more fully below). The easiest way to understand this is to imagine a fixed syllable such as a CVC frame, with vowel durations measured for all vowels inserted into the “V” slot. Figure 11–10 shows three sets of vowel duration values from a CVC frame, as reported by Hillenbrand et al. (2001). In this experiment the Cs included /p, t,k, b,d, g,h, w,r, l/, in all combinations (consonants such as /h/ and /w/ were restricted to initial position). Vowel duration in milliseconds (y-axis) is presented for each vowel (x-axis), averaged across all consonant contexts (filled circles, solid lines), across vowels surrounded only by voiceless Cs (unfilled circles, dotted lines), and only by voiced Cs (lightly filled diamonds, solid lines). The pattern of durations across vowels is essentially the same for these three contexts. Because the contexts stay constant for any one of the three curves, any differences in vowel duration must be a property of the vowels themselves—precisely what is meant by an “intrinsic” property. For each of the curves, low vowels such as /ӕ/ and /ɑ/ have greater duration than high vowels such as /i/, /ɪ/, /ʊ/, and /u/. The differences between low and high vowel durations typically are on the order of 50 to 60 ms, a very large difference in the world of speech timing.
The explanation for the intrinsic difference in the duration of low versus high vowels has sometimes been based on the greater articulatory distance required for the consonant-to-vowel-to-consonant path when the vowel is low compared with high. According to this idea, if one vowel requires articulators to travel greater distances than another vowel, it will take more time. The jaw travels a greater distance for the opening required for low versus high vowels, possibly explaining the intrinsic duration difference in low versus high vowels. This may explain part of the duration difference between low and high vowels, but doesn’t account for all of the 50 to 60 ms difference.
Figure 11–10. Vowel durations in fixed CVC frames for eight vowels in American English. Data are shown for environments in which C = voiceless (unfilled circles), C = voiced (lightly shaded diamonds), and for all C environments combined (filled circles). Data replotted from Hillenbrand et al. (2001).
The data in Figure 11–10 show another intrinsic vowel duration difference, between tense and lax vowels. In any one of the three consonant contexts, tense vowels are longer than their lax vowel “partners” (compare the durations of the /i/-/ɪ/ and/u/-/ʊ/ pairs for any of the three curves). The duration between tense and lax vowels has a wide range (from about 22 to 65 ms in Figure 11–10, depending on the consonant context), but always favors tense vowels when the consonant environment is held constant. This consistent difference between tense and lax vowel durations is not easy to explain, and may be related to the spectral similarity of tense-lax pairs and the resulting need to distinguish them by duration.4
Listeners are sensitive to the high-low and tense-lax intrinsic differences in vowel duration. When high-quality speech synthesizers are programmed, for example, the differences just described are built into the algorithms to generate natural-sounding speech.
Many extrinsic factors influence vowel duration. What follows is a brief discussion of a few of these influences. Readers interested in comprehensive surveys of how and why vowel duration varies in speech production can consult House (1961), Klatt (1976), Umeda (1975), the series of papers by Crystal and House (1982, 1988a, 1988b, 1988c), and Van Santen (1992).
Consonant Voicing. Vowels are typically longer when surrounded by voiced compared with voiceless consonants. This effect is seen in Figure 11–10 by comparing the “C voiceless” curve (unfilled circles, dashed lines) with the “C voiced” curve (lightly shaded diamonds, solid lines). The effect varies from vowel to vowel, but a reasonable generalization from the Hillenbrand et al. (2001) data is that vowels surrounded by voiced consonants are about 100 ms longer than vowels surrounded by voiceless consonants. The voicing of both the initial and final C in the CVC frame contributes to changes in vowel duration, but the largest influence is the voicing status of the final consonant of the syllable. If the syllable frame is marked as C1VC2, a voiced C1 will lengthen a vowel by somewhere between 25 to 50 ms compared with a voiceless C1, whereas a voiced C2 will lengthen a vowel by 50 to 90 ms relative to a voiceless C2. The magnitude of these effects is lessened, perhaps greatly so, in more natural speaking conditions (Crystal & House, 1988a, 1988b).
Stress. Lexical stress is a characteristic of multisyllabic words in many languages, the best known examples in English being noun-verb contrasts such as “rebel-rebel” (/ꞌrεbl/-/rəꞌbεl/) and “contract-contract” (/ꞌkɑntrӕkt/-/kənꞌ‘trӕ:kt/). Many other multisyllabic words have alternating patterns of stressed and unstressed syllables, as in “California” /kӕləꞌfↄrnjə/), where the first and third syllables have greater stress than the second and fourth syllables. When single syllables are stressed for emphasis or contrast (“Bob stopped by earlier”; “Did you say Barb stopped by?” “No, Bob stopped by”), the vowel in the emphasized syllable has greater duration than the original, lexically stressed production. All other things being equal, vowels in lexically or emphatically stressed syllables have greater duration than vowels in unstressed or normally stressed syllables (Fourakis, 1991). The magnitude of the duration difference between stressed and unstressed syllables, or between contrastively stressed and “normally stressed” syllables, is variable across speakers (Howell, 1993; Weismer & Ingrisano, 1979).
Speaking Rate. Vowel duration varies over a large range when speakers change their speaking rate. Slow rates result in longer vowel durations, fast rates in shorter vowel durations. Speaking rate also varies widely across speakers. Some speakers have naturally slow rates, some fast. Speakers who have habitually slow speaking rates have longer vowel durations than speakers with habitually fast rates (Tsao & Weismer, 1997).
Utterance Position. The same vowel has variable duration depending on its location within an utterance. If the duration of /i/ in the word “beets” is measured in the sentence, “The beets are in the garden” versus “The garden contains no beets,” the /i/ is about 30 to 40 ms longer in the second sentence. This effect is referred to as phrase-final or utterance-final lengthening (see Klatt, 1975). The degree of lengthening depends on the “depth” of the grammatical boundary. A major syntactic boundary yields more vowel lengthening compared with a “shallower” boundary. An extreme example is the greater lengthening at a truly end-of-utterance boundary (when the speaker is finished talking)—compared with a syntactic boundary between two consecutive phrases.
Speaking Style. Over the past quarter-century, since Picheny, Durlach, and Braida (1985, 1986) introduced the notion of “clear speech” as a phenomenon worthy of experimental attention, research on the acoustics and perception of speaking style has been popular (for reviews see Calandruccio, Van Engen, Dhar, & Bradlow, 2010 and Smiljanić & Bradlow, 2011). Clear versus casual speech styles are potentially relevant to such diverse considerations as speaking to someone with a hearing impairment, to someone whose native language is different from the language being spoken, and to someone listening to a native language who is not a fully effective processor of spoken language input (such as infants or toddlers, or persons with intellectual challenges, or computers programmed to recognize speech). A clear speech style is thought to enhance acoustic contrasts that are useful to a listener or machine trying to decode and identify segmental components of the incoming signal.
When speakers produce “clear speech,” they typically slow their speaking rate to produce longer vowel durations, and expand their vowel space. Whether clear speech exaggerates duration contrasts between vowels is, however, unclear. For example, in American English, vowel duration is not a critical contrastive characteristic—phoneme categories are not contrasted strictly by vowel duration—which may explain why clear speech does not clearly exaggerate the duration distinction of vowel pairs such as/ʌ/-/ɑ/ and /ɪ/-/i/, which vary in duration (the first member of each pair is typically shorter than the second member; see DeMerit, 1997) and perhaps in spectrum (formant frequencies; see footnote 4). On the other hand, there is evidence of greater lengthening of tense compared with lax vowels in clear English speech (Picheny et al., 1986), even though tense and lax vowels also have different formant frequencies. Croatian, a language in which each of five vowels may be either long or short, appears to have a tendency for clear speech to emphasize the duration difference (and thus the contrast between the long and short versions of the vowel; see Smiljanić & Bradlow, 2008). However, clear speech in Finnish, another language in which there are short and long vowels, does not seem to exaggerate the long-short vowel duration difference relative to its conversational speech difference (Granlund, Hazan, & Baker, 2012). A careful reading of the literature suggests a lot of speaker-to-speaker variability in the acoustics of speaking clearly. The question remains open of whether or not phonetic contrasts are “improved” by clear speech, if the improvements are seen for all important contrasts (e.g., between vowels, between fricatives), and if clear speech contrasts give the listener a perceptual benefit. Tuomainen, Hazan, and Romeo (2016) provide an interesting discussion of the clear speech literature.
The effects of clear speech reviewed above are becoming part of the fundamental knowledge base for speech-language clinicians, as speaking clearly is increasingly used as an approach to modifying articulatory impairment in a number of clinical populations (see, for example, Park, Theodoros, Finch, & Cardell, 2016; Lam, Tjaden, & Wilding, 2012; Whitfield & Goberman, 2017).
Imagine a large sample of talkers—say, 100 people—each of whom reads a passage from which speaking rates (in syllables per second) are measured acoustically. If the talkers were chosen randomly, you would find a huge range of “typical” speaking rates, from very slow talkers, to talkers of average rate, to very fast talkers. These experimental measurements would conform to the everyday observation that some people speak very slowly, some very rapidly. Now imagine that you chose the slowest and fastest talkers in this sample and asked them to produce the passage as fast as possible. If all talkers produced the passage at the same, maximally fast speaking rate, regardless of their “typical” rate, that would indicate that the very slow or fast “typical” rates were a kind of conscious choice on the part of individual talkers. However, if the slow talkers couldn’t speak as fast as the fast talkers, that would suggest that speaking rates reflect some basic neurological “wiring” that determines the “typical” rate. This experiment was performed by Tsao and Weismer (1997), who found that the maximal speaking rates of slow talkers were, in fact, significantly less than the maximal rates of fast talkers. It seems we are not all wired the same for typical speaking rate, and probably a bunch of other stuff, as well.
American English has five or six diphthongs, depending on the dialect of the speaker and which authority is describing the sounds. In some dialects some or all of the six diphthongs are not always diphthongized. The six diphthongs include /ɑɪ/ (“guys”), /ↄɪ/ (“boys”), /ɑʊ/ (“doubt”), /eɪ/ (“bays”), /oʊ/ (“goes”), and /ju/ (“beauty”). /ju/ is not considered a diphthong in many phonetics textbooks, but has properties similar to the other diphthongs. Spectrograms of the first five are shown in Figures 11–11 and 11–12. Diphthongs have not been studied as extensively as vowels, possibly because the former have sometimes been considered as sequences of the latter. The symbols used to represent diphthongs, after all, are combinations of two vowels. Is the diphthong /ↄɪ/, for example, an/ↄ/ connected to /ɪ/ by a relatively rapid change in vocal tract configuration? In Figure 11–11, at least for/ↄɪ/ and /ɑɪ/, the spectrographic data can be studied to address this question. Figure 11–12 shows spectrograms of the diphthongs /eɪ/ and /oʊ/, discussed below.
Figure 11–11. Spectrograms of the American English diphthongs /ↄɪ/, /ɑɪ/, and /ɑʊ/, spoken in the words “boys,” “guys,” and “doubt,” respectively. LPC tracks are shown in red for F1, F2, and F3. Speaker is a 57-year-old healthy male.
Figure 11–12. Spectrograms of the American English diphthongs /eɪ/ and /oʊ/, spoken in the words “bays” and “goes.” LPC tracks are shown in red for F1, F2, and F3. Speaker is a 57-year-old healthy male.
LPC tracks for F1-F3 are shown as red dashed lines throughout the vocalic nuclei. An LPC formant track is a sequence, over time, of estimated formant frequencies based on LPC analysis. In the case of the tracks in Figure 11–11 a formant frequency is estimated at each glottal pulse throughout the diphthong. For /ↄɪ/ there is a large, rising F2 transition preceded and followed by intervals of nearly unchanging formant frequencies—these are the so-called steady states mentioned earlier. For this /ↄɪ/, the steady state preceding the large F2 transition is of greater duration than the steady state following it, the latter being very brief and perhaps only visible in F2. For the /ɑɪ/ in Figure 11–11, there is also a large F2 transition preceded and followed by steady states. The formant tracks for /ɑɪ/ are somewhat more complicated than the ones for /ↄɪ/ because of the influence of the initial /g/, which causes the initial falling (decreasing frequency) transition in F2. The steady state is the brief interval following this initial downward transition, immediately before the sharp rising transition in F2. The steady state following the transition is, as in the case of /ↄɪ/, most evident in F2.
The spectrographic data are relevant to the question of whether diphthongs are two vowels connected by a rapid change in vocal tract configuration. The research logic is simple. Measure the formant frequencies at the steady states, and the hypothesis of two connected vowels is confirmed (in part) if they are similar to the formant frequencies of the vowels indicated by the transcription. For example, are the formant frequencies for the first steady state in /ↄɪ/ like those measured for the vowel /ↄ/, and the formant frequencies for the second steady state like those for the vowel /ɪ/?
The answer seems to be no. Studies by Holbrook and Fairbanks (1962), Gay (1968), and Lee, Potamianos, and Narayanan (2014; see their Figure 7) do not support the idea of diphthongs as sequences of two vowels, for several reasons. Holbrook and Fairbanks had 20 male speakers produce each of the diphthongs in an /hVd/ frame at the end of a short sentence. They made spectrographic measurements of formant frequencies at the first and last glottal pulses of the diphthongs, as well as three additional points roughly equidistant between the initial and final points. The formant frequencies of each diphthong are represented by these five measurement points throughout the duration of the vocalic nucleus. These data are summarized in Figure 11–13, which shows averages of the five measured F1-F2 points throughout each diphthong. The direction of the arrow next to each phonetic symbol indicates the direction of the five plotted points from beginning to end of each diphthong. For example, the points for /ↄɪ/ (filled triangles) are indicated by a bent arrow pointing up. The first measurement point, at the first glottal pulse, is located at F1 ~550 Hz and F2 ~800 Hz, and the final measurement point, at the last glottal pulse, is F1 ~500 Hz and F2 ~1900 Hz. The other diphthong paths in F1-F2 space can be interpreted in the same way. Also plotted in Figure 11–12 are the F1-F2 values for the vowels /ɪ/, /ʊ/, /ↄ/, and /ɑ/ reported for adult males by Hillenbrand et al. (1995) for the same /hVd/ frame used by Holbrook and Fairbanks. The phonetic symbol for each vowel identifies its average location in F1-F2 space. The oval enclosing the symbol has no meaning other than to set off the vowel locations from the diphthong points. These vowel points were chosen because the diphthong symbols include them either as end points (/ɪ/ being the end symbol for /ↄɪ/, /ɑɪ/, and /eɪ/; /U/ the end symbol for /oʊ/ and /ɑʊ/) or as starting points (/ɑ/ the start for /ɑʊ/ and /ɑɪ/; /ↄ/ the start for /ↄɪ/).
Compare the F1-F2 points for the vowel /ɪ/ with the end points for the diphthongs /ↄɪ/, /ɑɪ/, and /eɪ/. The end points for /ↄɪ/ and/ɑɪ/ are distant from the formant frequencies for monophthong /ɪ/. Although the end point for /eɪ/ is relatively closer to monophthong /ɪ/, it is substantially different in the F2 dimension. A similar analysis seems to apply to the comparison of /oʊ/ and /ɑʊ/ to the plotted point for /U/. The F1-F2 starting points of /ɑɪ/ and /ɑʊ/ are a good match for monophthong /ɑ/ especially for /ɑʊ/ (filled boxes) and less so for /ɑɪ/; the plotted point for /ↄ/ is very far from the start point for /ↄɪ/.
Holbrook and Fairbanks (1962) concluded that the trajectory of diphthongs in F1-F2 space did not begin and end in well-defined vowel areas. They noted, however, that a careful examination of diphthong paths suggests little or no overlap between the five diphthongs shown in Figure 11–13. A similar conclusion was reached by Lee et al. (2014). When the starting and ending frequencies plus the direction of the F1-F2 change are considered, the five diphthongs separate nicely.
Perhaps the difficulty of representing diphthongs as two sequenced vowels should have been obvious by examining spectrograms of natural productions of the sounds. As noted earlier, initial and final steady states can be identified for /ↄɪ/ and /ɑɪ/ (see Figure 11–11), but in the case of /ɑʊ/ F1 and F2 are changing at the beginning and end of the diphthong. A similar absence of initial and final steady states is seen for /eɪ/ and /oʊ/ (see Figure 11–12). The absence of steady states in diphthongs has been noted by previous scientists (Lehiste & Peterson, 1961), and cited as a potential complication in classifying diphthongs as a sequence of two vowels.
Figure 11–13. Diphthong paths in the F1-F2 plane, and F1-F2 points for four monophthongs. Each diphthong formant path is represented by five, equally spaced measurement points throughout the diphthong duration. The first point (at the beginning of the diphthong) is the one preceding the arrowhead, and the last point is the one terminating the path, in the direction indicated by the arrow (see text for a worked example). Diphthong path symbols, clockwise from /ɑʊ/, lightly shaded diamonds; /ↄɪ/, filled triangles; /oʊ/, unfilled triangles; /eɪ/, filled circles; and /ɑɪ/, unfilled circles. Diphthong data from Holbrook and Fairbanks (1962), vowel data from Hillenbrand et al. (1995).
If steady states are not a reliable characteristic of diphthongs, what is? A close look at Figures 11–11 and 11–12 suggests that each of the diphthongs has an identifiable, and in some cases substantial, transitional segment. The transitional segments, usually most pronounced in F2 but also seen in F1 and F3, reflect the rapid change in vocal tract shape between the initial and final parts of the diphthong.
Gay (1968) performed an experiment to determine which aspects of diphthong production varied and which remained constant across changes in speaking rate. Gay asked speakers to produce diphthongs at slow, conversational, and fast speaking rates, and measured F1 and F2 steady states at the beginning and end of the diphthongs as well as the slopes (speeds) of F2 transitions. The variation of speaking rate resulted in substantial changes in the duration of the diphthongs (see below), with stable F1 and F2 onset measurement. In contrast, the F1 and F2 offset measures (at the hypothetical “second vowel” of the diphthong) varied substantially. Of special interest was the finding that the slope of the F2 transition was essentially constant across the rate changes: “The second formant rate of change [that is, the slope] for each diphthong remains relatively constant across changes in duration and distinct from the rates of change of the other diphthongs” (Gay, 1968, p. 1571, emphasis added). For Gay, the slope of the F2 transition was a constant and distinguishing characteristic of diphthongs. His findings argued against the notion of diphthongs as merely sequences of two vowels connected by a transition. Rather, diphthongs appeared to be a sound class separate from vowels (see Watson & Harrington, 1999, for similar comments on the difference between vowels and diphthongs in Australian English).
Figure 11–13 makes the case that direction of movement in F1-F2 space, when included with starting and ending frequencies, separates the American English diphthongs nicely. It is as if inherent vocal tract movement characteristics must be accounted for to distinguish between diphthongs. This conceptual strategy for distinguishing between diphthongs apparently applies to some vowels as well. Like diphthongs, vowels such as /ɪ/, /ε/, and /U/ have inherent movement characteristics (Nearey & Assman, 1986) important for their identification. These movements are expressed as formant transitions throughout the vowel. These vowels—the lax vowels, mostly—are also quite variable across dialects, and their inherent movement characteristics may be specific to particular dialects. Recently, Jacewicz and Fox (2012, see their Figures 1 and 2) plotted data for lax vowels produced by speakers from southern Wisconsin and western North Carolina, in a way similar to the diphthong data in Figure 11–12. The plot shows that, when formant transitions are taken into consideration for these vowels, almost always identified as monophthongs, much of the confusion among the vowels disappears. So, are /ɪ/, /ε/, and /U/ vowels or diphthongs?
The duration characteristics of diphthongs have not been well studied, but if asked, most speech scientists expect diphthongs to be somewhat longer than monophthong vowels in equivalent environments and speaking conditions. Data published by Umeda (1975) for a single speaker suggest that /ɑɪ/,/ɑʊ/, and possibly /eɪ/ have greater duration than monophthongs. The diphthong duration data reported for adults by Lee et al. (2014) and Tasko and Greilick (2010) are 30 to 70 ms greater than durations reported for monophthong vowels (Gopal, 1996; Klatt, 1976; Umeda, 1975) produced at conversational speaking rates. The intuition of speech scientists, at least in the case of diphthong duration in contrast to monophthong duration, appears to be correct.
Data on the acoustic characteristics of nasals are more limited than those on vowels. There are no large-scale studies on formant and antiresonance frequencies (Chapter 9) in nasals, or their variation across speakers due to age and sex. The relatively small body of work on nasals has been concerned with the acoustic characteristics associated with nasal manner and place of production.
As discussed in Chapter 9, nasal articulations are described acoustically in two broad categories. One, the nasal murmur, concerns acoustic characteristics during the interval of complete oral cavity closure with an open velopharyngeal port. For example, the nasal murmur for /m/ occurs during the interval when the lips are sealed and sound waves travel through the open velopharyngeal port and radiate from the nostrils. During the nasal murmur, the speech spectrum includes resonances of the combined pharyngeal and nasal cavities, as well as antiresonances originating in the closed oral cavity and the sinus cavities. The second category of nasal articulation is nasalization, or the articulation of vowels (that is, with an open oral cavity) with a velopharyngeal port sufficiently open to “add” nasal resonances and antiresonances into the oral vowel resonances.
Figure 11–14 shows spectrographic characteristics of the nasal murmur interval for /m/ in stressed CVC syllables surrounding an /i/ (left side of upper panel) and /ɑ/ (right side of upper panel). The boundaries of the prestressed (first /m/) murmur intervals are marked below the spectrogram baseline by short, vertical bars. These murmur intervals have durations of slightly greater than 100 ms (left spectrogram) and just under 100 ms (right spectrogram). These murmur durations are likely to be typical for other speakers (see Umeda, 1977), and like other segment durations variable with speaking rate, stress, and other factors.
The nasal murmur intervals of both utterances shown in the upper panel of Figure 11–14 are clearly less intense compared with the surrounding vowels. Using the marked boundaries of the murmur intervals as reference points, note the dramatic change in intensity from vowel to murmur or murmur to vowel. The intensity difference is reflected in the relative darkness of the vowel and nasal murmur traces. As discussed in Chapter 9, nasals tend to be less intense than vowels and other sonorant sounds (such as liquids, glides, rhotics) for two reasons. One is the presence of antiresonances in the spectrum, which result in a substantial reduction of energy at the exact frequency of the antiresonance and at frequencies in the immediate vicinity of the antiresonance. An antiresonance from the middle 50 ms of the first nasal murmur of /hə‘mim/ is indicated at 750 Hz by a downward-pointing arrow in the Fast Fourier Transform (FFT) spectrum shown below the spectrogram. Note the general depression of spectral energy around 1000 Hz, as well as the “white space” from 500 to 1100 Hz in the spectrogram, reflecting the broad influence of the antiresonance. Because a speech sound’s total energy is the sum of all energies at all frequencies, the presence of antiresonances in nasal murmur spectra makes their overall intensity relatively low compared with vowels. The second reason for the relatively weak intensities of nasal murmurs is the greater absorption, and therefore loss, of sound energy when acoustic waves propagate through the nasal cavities (see Chapter 9). The greater absorption of sound results in wider formant bandwidths and lower peak amplitudes of resonances.
Figure 11–14. Spectrographic characteristics of the nasal murmur interval for /m/ in stressed CVC syllables surrounding an /i/ (left side of top panel) and /ɑ/ (right side of top panel). Short, vertical bars immediately below the baseline of the spectrogram mark the onsets and offsets of the nasal murmur intervals. An FFT spectrum from the middle 50 ms of the /m/ preceding the /i/ is shown in the lower part of the figure; the downward-pointing arrow indicates the approximate location of an antiresonance.
The first formant of the nasal murmur is marked in both spectrograms as F1n. The subscript “n” indicates that the resonance is from the combined pharyngeal and nasal cavities (hereafter, nasal cavities). This lowest resonance of the nasal cavities is a nearly constant characteristic of nasal murmurs, with the greatest intensity among nasal formants and a frequency around 300 to 400 Hz (Fujimura, 1962). The nasal murmur spectrum contains a fair number of resonances in addition to F1n, as well as antiresonances. The spectrograms in Figure 11–14 illustrate this well, with the prestressed murmur of /həꞌmim/ showing an F2n around 1500 Hz (at least for the first part of the murmur), a possible pair of formants (F3n, F4n) around 2000 Hz, and another around 3000 Hz. In the prestressed murmur of /həꞌmɑm/ (Figure 11–14, upper right panel), there is a similar pattern of resonances above F1n but of much weaker intensity compared with /həꞌmim/. Fujimura described the substantial variability in patterns of nasal resonances and antiresonances, both across and within speakers. Presumably, the within-speaker variation is primarily a result of changing phonetic contexts. The differences in the nasal resonances and major antiresonance for the /i/ and /ɑ/ contexts can be seen in Figure 11–14 (antiresonance location indicated by the different white spaces in the two nasal murmurs).
It may seem odd to see spectral evidence of different vowels during a nasal murmur, for which sound transmission is fully (or almost nearly so) through the nasal cavities. Such vowel-context effects on nasal murmur spectra have been demonstrated in modeling studies based on human vocal tract and nasal tract cavity measurements. In these studies the vocal tract is represented with different vowel shapes with the velopharyngeal port set to “wide open” to see what happens to nasal spectra as the vocal tract configuration is varied. Serrurier and Badin (2008, Figure 23) present a beautiful plot showing how their model reveals subtle vowel effects on nasal murmur spectra. Such model data are consistent with the spectrographic differences shown in Figure 11–14 for /mim/ versus /mam/.
Perhaps one explanation for the absence of an acoustic data base for nasals comparable to vowels is the difficulty of identifying formants and antiresonances during the nasal murmur. Many nasal formants above F1n are challenging to locate because of their weak intensity, and antiresonances are often inferred from the absence of energy, rather than by something definitive in the spectrogram or spectrum.5 An alternative approach to identifying important acoustic characteristics of nasal murmurs (or any speech sound) is to make several measures of the murmur spectrum and use those measures in an automatic classification analysis. This work is usually done by scientists interested in computer recognition of speech. They want to know which acoustic features allow the most rapid and accurate identification of individual speech sounds.
A good example of this work is found in Pruthi and Espy-Wilson (2004). These investigators were interested in machine recognition of segments having a nasal manner of articulation. Pruthi and Espy-Wilson (2004) noted that nasals are often confused with liquids (/l/, /r/) and glides (/w/, /j/) when computers classify speech sounds using extracted acoustic measures (that is, a computer algorithm that identifies segments through temporal and spectral measures). They designed four acoustic measures, based on consideration of the acoustic characteristics of nasal murmurs relative to those associated with the constriction interval of liquids and glides (the relatively constant formant frequencies of liquids and glides—see below), as classification parameters for nasal manner of articulation. In a sense, the selection of the four measures was a hypothesis concerning the acoustic characteristics required to identify the nasal manner of production. These measures, with a brief explanation of why they were chosen, are listed in Table 11–1.
The point of this discussion is not to consider in detail the four measures selected by Pruthi and Espy-Wilson (2004) but to demonstrate the value of their experiment for understanding critical acoustic features of speech sounds. Pruthi and Espy-Wilson, using a combination of these measures, correctly classified nasal manner of production for 94% of the more than 1000 nasal murmurs in a large database of sentences spoken by many different speakers. These measures were chosen carefully, to reflect aspects of nasal murmur acoustics previously described in the research literature. The measures also made sense in terms of the theory of vocal and nasal tract acoustics. The very successful classification performance suggests that these measures are strong candidates for further acoustic studies of nasal production and perception.6
Table 11–1. Four Acoustic Measures Used by Pruthi and Espy-Wilson (2004) for Automatic Classification of Nasal Manner of Articulation
Note. The measures were chosen as most likely to separate nasal manner from liquid and glide manner, because nasals are often mistaken as liquids or glides by speech recognition devices. Descriptions and explanations of the measures have been modified slightly from the original presentation.
There is a long history of documenting the acoustic correlates of place of articulation for consonants, including nasals. Much of this history is concerned with the acoustic cues used by listeners, not computers, to identify place of articulation. Although speech perception is considered in greater depth in the next chapter, the case of nasal place of articulation provides a good introduction to the interplay between studies of acoustic characteristics of speech sounds and their role as cues in speech perception.
Consider the following thought experiment. Imagine a CV syllable, where C = /m/ or /n/ and V = /i/, /ε/, /ӕ/, /ɑ/, /o/, or /u/. The nasals are chosen to represent the two (/m/ and /n/) that occur in syllable-initial position of English words. The vowels are chosen to sample different locations around the vowel quadrilateral, to maximize potential coarticulatory influences on murmur acoustics—that is, to create variability in murmur acoustics due to phonetic context. The complete set of 12 syllables (2 nasal consonants × 6 vowels) is spoken by several speakers and saved as computer wave files. Speech analysis programs are used to make certain measurements as well as to “pull out” from each syllable selected temporal pieces from the wave files for presentation to listeners. The waveform pieces are (1) an interval from the murmur, (2) an interval that straddles the murmur-vowel boundary (therefore containing formant transitions from the murmur release into the vowel) and (3) and interval that includes only the transitions after the murmur release. Figure 11–15 illustrates these waveform pieces with a spectrogram of the utterance /əꞌmε/. The boundaries of the nasal murmur are indicated by the short, upward-pointing arrows at the baseline. The three waveform pieces are shown by narrow rectangles (“windows”) superimposed on the spectrogram. Each window is approximately 25 ms in duration. The leftmost window is the murmur piece, the middle window the piece straddling the boundary between murmur offset and vowel onset (murmur + transition piece), and the right-hand window the transition piece—the piece of the vowel containing transitions to the vowel, immediately after the release of the murmur.
These waveform pieces can be used to address the question “Which piece(s) of the waveform, or which acoustic characteristics within each piece, allow computer and human classification of nasal place of articulation?” Stated in a different way: “Do the three pieces allow equivalent accuracy in classifying place of articulation for nasals, or is one piece ‘better’ than another?” For example, if only the nasal murmur piece (the leftmost box) is presented to listeners, can they use this acoustic information to make accurate identification of place of articulation? Alternatively, if certain acoustic characteristics are extracted from the murmur windows for /m/ and /n/, can these be used by an automatic classification algorithm to separate the two places of articulation? These types of experiments appear several times in the speech acoustics literature (e.g., Harrington, 1994; Kurowski & Blumstein, 1984, 1987; Repp, 1986). The results of the experiments, although not always precisely consistent, suggest that any of these three pieces, when presented to human listeners or classified statistically, allow fairly accurate identification of place of articulation for syllable-initial nasals. Moreover, when two or more of the pieces are presented (or classified) together, the accuracy of place identification improves.
Figure 11–15. Spectrogram of the utterance /əꞌmε/, shown with three 25-ms “windows” centered at different times. The leftmost window (light green shading, “murmur piece”) is within the nasal murmur, the center window (light blue shading, “murmur + transition piece”) straddles the murmur-vowel boundary, and the rightmost window (light yellow shading, “transition piece”) is completely within the vowel, during the formant transitions from the murmur to the vowel.
First consider the results for isolated murmurs. If nasal place of articulation can be identified (classified) accurately from just the murmur, something about its acoustic characteristics must be systematically different for labials versus lingua-alveolars. The duration of the murmur can be ruled out as a measure that separates /m/ from /n/, but there are almost certainly spectral differences between the two nasal murmurs. The spectral differences are a result of different resonances and antiresonances of the coupled pharyngeal and oral tracts. In Chapter 9, the frequency locations of the antiresonances for /m/ versus /n/ were discussed, the former having a lower region (according to Fujimura , roughly 750–1250 Hz), the latter a higher region (1450–2200 Hz). The unique resonances for /m/ in Fujimura’s study included a two-formant cluster in the vicinity of 1000 Hz, and for /n/ a similar cluster above 2000 Hz. Qi and Fox (1992), in an analysis of the first two resonances of nasal murmurs for /m/ and/n/ produced by six speakers, reported average second resonances for /m/ and /n/ of 1742 and 2062 Hz, respectively. Careful examination of the /m/ and /n/ murmurs in Figure 11–16 shows formant patterns, and inferred antiresonance locations, generally consistent with the observations of Fujimura and of Qi and Fox. Both /m/ and /n/ have the expected first formant around 300 Hz, but in the frequency range above this formant, the spectra are different. Between the F1 and the next evidence of a higher resonance, both murmurs have an obvious “white space.” For /m/, the white space extends roughly from 500 to 1150 Hz, and for/n/ it is located between 500 and 1600 Hz. The different antiresonance locations for the /m/ and /n/ are expected from the different cavity volumes behind the respective places of articulation (Fant, 1960). Under the assumption that the exact antiresonance frequencies are approximately in the middle of these white space ranges, their center frequencies are 850 Hz for /m/ and 1050 Hz for /n/. Immediately above the antiresonance, the /m/ murmur has a second and perhaps third resonance around 1500 Hz, and above that a resonance around 2300 Hz. In comparison, the /n/ murmur has a second resonance around 1800 Hz, and what appears to be a cluster of two resonances around 2500 Hz. Clearly, the two murmurs have different spectral characteristics. It is reasonable to expect that listeners can use these differences to identify place of articulation.7 In Repp (1986), listeners made accurate place judgments for /m/ and /n/ when given only the murmur piece of CV syllables, and Harrington (1994) obtained excellent statistical classification for these two nasals when using acoustic information from single spectrum “slices” taken from the murmur.
The spectrograms in Figure 11–16 also show different patterns of formant transitions as the murmur is released into the following /ε/. There is a long history of considering formant transitions at CV boundaries as strong cues to consonant place of articulation. This history originated in the late 1940s and early 1950s, at the Haskins Laboratories, where experiments showed that synthesized formant transitions cued place of articulation in the absence of consonant spectra (Liberman, Delattre, Cooper, & Gerstman, 1954). Figure 11–16 shows spectrograms of /ə ꞌmε/ and /ə ꞌnε/, where the differences between the F2 and F3 transitions following release of the murmur are shown. Figure 11–17 shows a spectrogram of the utterance /ə ꞌŋε/ to contrast its transitions with those in Figure 11–16. The patterns of F2 and F3 transitions over the first 40 or 50 ms following release of the murmur are unique to the different places of nasal articulation. Transitions coming out of the /m/ murmur have rising F2 and F3, from /n/ are more or less flat, and from /ŋ/ F2 falls and F3 rises. This latter pattern (see Figure 11–17) shows F2 and F3 starting at the murmur-vowel boundary nearly at the same frequency and separating throughout the transition. Details of these F2-F3 transition patterns depend on the identity of the vowel following the murmur, but in most cases the patterns are different for the three places of articulation. In English the velar nasal /ŋ/ does not appear in the prestressed position shown in Figure 11–17, and this is why the examples discussed above have been restricted (until now) to /m/ and /n/. The absence of /ŋ/ in word-initial position is not a physiological limitation—there are languages in which the sound can appear in this position—so the point concerning place-specific transition patterns is relevant and as shown below applies to transition characteristics of stop place of articulation.
Figure 11–16. Spectrograms of the utterances /əꞌmε/ and /əꞌnε/, produced by a 57-year-old healthy male. Note between-place differences in the murmur spectra and the pattern of formant transitions as the nasal is released into the vowel.
Figure 11–17. Spectrogram of the VCV utterance /əꞌŋε/, produced by a 57-year-old male. Note the F2-F3 transition pattern immediately following release of the nasal murmur.
Place of articulation information is present in these unique transition patterns, as demonstrated for listeners (Repp, 1986) and statistical classification (Harrington, 1994). The accuracy of place identification from the transitions (i.e., the “transition piece” of the spectrogram shown in Figure 11–14) is similar to the accuracy from the “murmur piece.” Place information for nasals, therefore, seems to have acoustic correlates in at least two different locations throughout a CV syllable, and these correlates are sufficiently stable to be reliable for listeners and statistical classification of nasals.
Finally, in Figure 11–15 the middle “piece” straddles the boundary where the murmur ends and the transition begins. This interval has a very rapid change from the low-energy, unique resonance patterns associated with murmurs to the high-energy formant patterns for vowels. Several scientists (Kurowski & Blumstein, 1984, 1987; Seitz, McCormick, Watson, & Bladon, 1990) have argued that each place of articulation is associated with a unique frequency pattern for this rapid change. Whether or not this particular “piece” is more important in the classification of nasal place of articulation compared with the murmur or transitions alone is a matter of considerable debate.
Nasalization, as reviewed in Chapter 9, is a concept with broad application in general and in clinical phonetics. Nasalization of vowels, which is of concern here, involves complex acoustics resulting from the mix of oral and nasal tract formants with antiresonances originating in the sinus cavities.8 Only a few studies have reported data on the spectra of nasalization. Theoretical treatments of vowel nasalization can be found in Feng and Castelli (1996), Pruthi, Espy-Wilson, and Story (2007), Rong and Kuehn (2010), Stevens, Fant, and Hawkins (1987), and Serrurier and Badin (2008).
Chen (1995, 1997) developed two acoustic measures of nasalization, one of which is described here. Recall that a Fourier spectrum of a vowel shows the amplitude of the consecutive harmonics (where the first harmonic = F0) produced by the vibrating vocal folds. Furthermore, the varying amplitude of the harmonics reflects, in part, the resonance characteristics of the vocal tract. When a vowel is produced with an open velopharyngeal port and is nasalized, some glottal harmonics in the region of the nasal resonances have increased amplitude (relative to the amplitudes when the vowel is not nasalized), and some harmonics in the region of oral formants have reduced amplitude, due to nearby antiresonances and damping from increased absorption of sound energy in the nasal cavities (see Chapter 9). Chen (1995) took advantage of these facts and constructed a spectral measure of nasalization in which the amplitude of the harmonic closest to the first formant was compared with the amplitude of a harmonic close to the location of the second nasal resonance. The technique is illustrated in Figure 11–18.
Figure 11–18 shows two FFT spectra, both for the middle 30 ms of the vowel /i/ in the CVC [bib] (left spectrum) and [min] (right spectrum). In both spectra, two harmonics are of interest. One, labeled “A1,” is the harmonic in the immediate vicinity of the low-frequency F1 of /i/. As expected, the A1 harmonic in the two spectra has relatively high amplitude compared with the other harmonics. This makes sense because the amplitude of this harmonic is “boosted” by the typical first resonant frequency (F1) of /i/, which is approximately 300 Hz in adult males. The other harmonic of interest is the one labeled “P1.” Chen (1995) identified this harmonic as “boosted” when the oral and nasal cavities are coupled via opening of the velopharyngeal port. The P1 harmonic is close to the second resonance of the nasal cavity. In theory, the amplitude of the P1 harmonic is quite low when the velopharyngeal port is closed and relatively greater when the port is open. Based on theoretical and experimental work, Chen recommended identifying the P1 harmonic by locating the FFT peak closest to 950 Hz.
Chen’s (1995) index of nasalization requires measurement of the A1-P1 amplitude difference within a vowel spectrum. In Figure 11–18, the A1 and P1 relative Lamplitudes for [i] surrounded by [b] are roughly 0 and −45 dB, respectively, yielding an A1-P1 index of 45 dB. When [i] is surrounded by nasal consonants (right spectrum), and likely to be partially nasalized due to coarticulatory influences, A1 =−4 dB and P1 = −38 dB, giving an A1-P1 index of 38 dB. As expected, the A1-P1 index for the vowel in a nasal environment is smaller than the index when the vowel is between non-nasal consonants. In Figure 11–18 this is due to both a decrease in the amplitude of A1 and an increase in the amplitude of P1, as expected from Chen’s analysis strategy.
Figure 11–18. Two /i/ spectra showing the application of Chen’s (1995) acoustic technique to the quantification of the degree of vowel nasalization. The left spectrum is for /i/ in a non-nasal context, the right spectrum for /i/ is in a nasal context. “A1” is the highest-amplitude harmonic in the vicinity of the F1 for /i/, “P1” a harmonic around 950 Hz, close to the second resonance of the nasal tract. The acoustic measure is the amplitude difference, in dB, between A1 and P1.