Speech Perception


Speech Perception


Speech perception is a multifaceted and complicated topic that depends in important ways on information presented in Chapters 7 to 11. The study of speech perception, however, has its own jargon and theoretical content. The complications occur partly because there are several competing theories of speech perception, and partly because it is not always clear how experimental data bear on a choice between those theories. The same data, or at least the same general form of data, can be used to support two theories that provide opposing accounts of speech perception. This is somewhat different from speech acoustics, where a single theory seems to explain the major phenomena, and the relation of data to theory is fairly straightforward.

The goal of this chapter is to present a review of speech perception that has potential relevance to clinical applications. Because history is very important to a proper understanding of the scientific study of speech perception, the early days of speech perception research are reviewed to show how these beginnings dictated the course of thinking in the area for the past half-century, indeed to the present day. The major theories of speech perception are reviewed, together with selected experiments whose results are consistent (or inconsistent) with these theories. We discuss the differences in traditional approaches to speech perception, which focus on segment identity (or identity of several segments in sequence), and the recognition of spoken words. The idea of spoken word recognition as a key part of speech perception is introduced. The chapter concludes with a review of speech intelligibility and its direct relevance to clinical practice in speech and hearing diagnosis and management.


As reviewed by Cole and Rudnicky (1983), scientists in the late nineteenth and early twentieth centuries were interested in speech perception. Systematic work in this area, and the creation of speech perception research as a flourishing scientific discipline, began in the late 1940s and early 1950s. Scientists at Haskins Laboratories—initially in New York City, later and currently in New Haven, Connecticut—employed an early kind of speech synthesizer to perform experiments on the perception of speech. The scientists who developed the synthesizer—Franklin Cooper, Alvin Liberman, and John Borst—were already familiar with the spectrogram, at that time a relatively new way to display speech acoustic events (Chapter 10). Cooper, Liberman, and Borst (1951) described a machine that allowed them to use spectrographic features of natural speech, such as vowel formant frequencies, formant transition characteristics, and stop burst spectra, and manipulate them in small steps for presentation to listeners. Such a device, they reasoned, would allow them to identify the important acoustic cues for the identification of specific speech sounds.

A diagram of this speech synthesizer, called a pattern-playback machine, is shown in Figure 12–1. The machine used the principle that fluctuations in light can be transformed into sound waves, under the proper conditions. Figure 12–1 shows how a light source was directed at a rotating wheel that was a circular film negative. On this negative a series of rings were arranged to produce frequencies from 120 Hz to 6000 Hz, in a consecutive-integer harmonic series. The rotating circle was aptly called a “tone wheel.” A spectrographic pattern was painted on a transparent sheet (such as the overhead transparencies used in classroom presentations before the advent of computer-based presentations) and mounted on a movable surface exposed to light coming through the tone wheel. As light passed through the harmonic rings on the film negative, the painted spectrogram was transported by a moving belt through the machine. The dark parts of the spectrogram reflected certain of the “light frequencies” generated by the wheel, whereas the clear parts of the spectrogram generated no reflection. The movement of the belt simulated the time dimension of speech. The reflected light frequencies, changing over time as the belt moved, were transmitted to a device that converted the reflections to sound (the light collector in Figure 12–1). These sound waves were amplified and output by a loudspeaker, producing the speech-like sounds resulting from the time-varying patterns of light reflection.

The key to understanding how the device was used to create small changes in spectrographic patterns is this: the clear sheet on which the spectrographic representation was painted reflected whatever pattern the investigator desired; any type of “speech signal” could be painted. The Haskins scientists asked the question, how are these different patterns heard? Cooper and colleagues (1951) performed a set of experiments with the pattern-playback machine in which they discovered that a two-formant pattern with proper F1 and F2 transitions elicited perception of stop-vowel syllables, even though stop bursts were not included in the painted pattern. Such patterns—like stick figure representations of real speech signals—are shown in Figure 12–2 for the voiced (top) and voiceless (bottom) set of English stop consonants. The portion of the signals between the dotted, vertical lines shows the transitions, or changing formant frequencies as a function of time. The steady states of F1 and F2 were painted to elicit perception of the vowel /ɑ/. For the voiceless stops, the F1 transition was “cut back” in time relative to the F2 transition. More is said about this later.

When the top patterns in Figure 12–2 were transported by the moving belt and their reflected light converted over time to sound, most listeners heard the leftmost pattern as /bɑ/, the middle pattern as /dɑ/, and the rightmost pattern as /gɑ/. Cooper and colleagues (1951) selected these patterns to mimic the ones they had seen in spectrograms of natural speech. They already knew the F2 transition was different for the three places of stop consonant articulation in English, at least when the stops were followed by /ɑ/. The finding that a burstless pattern elicited the perception of stop consonants, with the correct place of articulation dependent on the pattern of transitions shown in Figure 12–2, was somewhat of a surprise.

The /ba/-/da/-/ga/ Experiment

Cooper and his colleagues (1951) asked the question: What happens to listeners’ perception when the starting frequency of F2 is changed in small and systematic steps over a large range of frequencies? Using the pattern-playback machine, the scientists painted a series of two-formant patterns varying only in the F2 starting frequency.1 The result was a series, or continuum, of stimuli, like those shown in Figure 12–3.

Each of these two-formant stimuli is labeled with a number, ranging from −6 to +6. The stimulus labeled 0 had no F2 transition, and the two endpoint stimuli had the most extreme F2 transitions—that is, transitions covering the widest range of frequencies but moving in opposite directions. Stimulus −6 had the lowest starting frequency and, therefore, an extensive, rising F2 transition (where the transition starts at a lower frequency and moves to a higher one). As an example, the measurement point for the F2 starting frequency is shown for stimulus −6; F2 starting frequency is located at the beginning of the painted F2 for all stimuli. Stimulus +6 had the highest starting frequency and, therefore, a very extensive, falling F2 transition. The other features of these stimuli were constant across the entire continuum, including the rising F1 transition and the voice bar preceding it. The voice bar was a low frequency part of the painted patterns, seen as the flat bar before the F1 transitions; this ensured that all phonologically voiced stops were heard as voiced. The previous discussion of the three patterns in Figure 12–2 suggests that endpoint stimulus −6 was heard as /bɑ/ and endpoint stimulus +6 as /gɑ/. This series of stimuli tested the listeners’ response to systematic variations in F2 starting frequency between these two extremes. How did listeners respond to the other stimuli along the continuum?

Cooper and colleagues (1951) first obtained identification data from listeners. In this experiment, listeners were asked to label the presented stimuli. Listeners had three labels from which to choose: /b/, /d/, or/g/. Each stimulus was presented several times to each member of a crew of listeners, and the results were plotted as the percentage of /b/, /d/, or /g/ responses across the series of stimuli. A typical plot of the results is shown in Figure 12–4.

In this plot, stimulus number (see Figure 12–3) is on the x-axis, which is also labeled “F2 starting frequency” because the different stimulus numbers indicate different starting frequencies. The percentage of /b/ (pink boxes), /d/ (white boxes), and /g/ (blue boxes) labels for each stimulus number is on the y-axis. A quick glance at these identification results shows that stimuli from −6 to −3 were heard almost exclusively as /b/, stimuli from −1 to +2 as /d/, and stimuli from +4 to +6 as /g/. A few stimuli, such as −2 and +3, were ambiguous, but −2 was only ambiguous for /b/ and /d/ responses (no /g/ responses) and +3 was only ambiguous for /d/ and /g/ responses (no /b/ responses). The boundary of the /d/ and /g/ categories is labeled in the figure; the boundary for the /b/-/g / categories follows the same principle (50% /b/ responses, 50% /d/ responses).

The take-home message from this experiment was that relatively continuous variation of the physical stimulus—the starting frequency of the F2 transition—did not result in a continuous change in the perceptual response. Rather, place of articulation seemed to be perceived categorically, with a series of adjacent stimuli yielding one response, as in the case of stimuli −6 through −3 producing a percept of /b/, followed by a sudden change in response pattern to /d/ at the next step along the continuum (e.g., at stimulus −2). The same pattern was observed between /d/ and /g/. When the labeling functions for two adjacent phonemes (like /b/ and /d/, or /d/ and /g/) changed suddenly, they crossed at a point where 50% of the responses were for one label, and 50% for the adjacent label (see Figure 12–3). For this stimulus, the labeling appeared to be at chance levels—the listeners responded to the stimulus as if making a guess between the two adjacent labels such as /b/ or /d/, or /d/ and /g/. This 50% point was called the phoneme boundary and was taken to indicate the stimulus defining the categorical distinction between two sounds.

Categorical Perception: General Considerations

The finding that stop place of articulation was perceived categorically, not continuously, has had a profound effect on speech perception research and theory. Categorical perception is discussed here more thoroughly before returning to the interpretation of the /bɑ/-/dɑ/-/gɑ/ experiment described above.

Categorical perception is demonstrated when continuous variation in a physical stimulus is perceived in a discontinuous (i.e., categorical) way. The study of psychological reactions to variations in physical stimuli is called psychophysics (see Chapter 14). Categorical perception is an example of a psychophysical phenomenon. A schematic illustration of categorical perception, and how it contrasts with the psychophysical phenomenon of continuous perception, is shown in Figure 12–5.

Both graphs in Figure 12–5 show hypothetical relationships between a continuously varying physical variable (x-axis) and a perceptual (psychological) response (y-axis). The perceptual response in this case is a number assigned by the perceiver to the magnitude or quality of each stimulus along the physically varying continuum. In the left-hand graph, each change of the physical variable from a lower to higher value elicits a corresponding change in the perceiver’s mind and, therefore, on the number scale. This results in the straight-line, 45-degree function relating the changes in the physical stimulus to changes on the psychological scale. The function in the left-hand graph is labeled “continuous perception” because a given increment along the physical scale always results in the same increment along the perceptual scale.

The right-hand graph shows a different relationship. Here the continuous variation of the physical stimulus is the same as in the left-hand graph, but the perceptual response is much different. The initial changes in the physical stimulus, beginning at the lowest values, produce no change in the psychological scale value. The listener treats the different stimuli as belonging to a single psychological event. When the stimulus reaches the value indicated by the first vertical dotted line, the perceptual scale “jumps” to a higher number. The new psychological number remains the same even as the physical stimulus continues to increase. The same thing happens at the second dotted line, where the perceptual scale value jumps to yet another higher value and remains at that value as the physical stimulus increases in magitude.

The vertical dotted lines in the right-hand graph are labeled “Boundary 1” and “Boundary 2.” The lines indicate locations along the physical stimulus continuum where a small change in the value of the stimulus results in a sudden, large change in the perceptual response. Alternatively, as noted above, it is a stimulus at which 50% of the responses are for one category, and 50% for the other, as if the responses were based on chance (guessing). The function in the right-hand graph shows categorical perception because the two boundaries separate the psychological reaction to a continuously varying stimulus into three categories, labeled CAT1, CAT2, and CAT3. In categorical perception, the same increment along the physical stimulus produces very different psychological responses, depending on whether the increment is within a category versus straddling a category boundary. Within a category, a small change along the physical continuum leads to little or no difference in the psychological response. The same change in a physical stimulus across a category boundary results in a major change in the psychological response.

This simple description of the difference between continuous and categorical perception applies directly to the pattern-playback results for stop consonant place of production, shown in Figure 12–4. The F2 starting frequency was changed continuously from low to high values, but listeners heard only categories, not smoothly changing phonetic events. The categorical labeling functions shown in Figure 12–4 are consistent with the schematic illustration of categorical perception shown in Figure 12–5. Small changes in the F2 starting frequency, beginning from the lowest value, all yielded /b/ responses. At a certain point along the physical continuum of F2 starting frequencies the same change resulted in a sudden shift to /d/ responses. The perception of place of articulation for stop consonants appeared to be categorical even though the starting frequency of F2 was changed continuously.

Labeling Versus Discrimination

One more experiment was required to verify the categorical perception of stop consonant place of articulation. The categorical perception functions shown in Figure 12–4 were obtained in an experiment in which listeners heard a stimulus and labeled it as either /b/, /d/, or /g/. The resulting categorical perception functions may have reflected nothing more than the listeners’ restriction to just three response categories. For example, listeners were not permitted to respond, “This stimulus sounds as if it is midway between a /b/ and /d/ (or between a /d/ and /g/),” even though it was possible some stimuli sounded this way. If such responses were available to the listeners, the functions may not have looked as categorical as those shown in Figure 12–4.

To address this potential problem, the labeling experiment was followed by a discrimination experiment. As indicated above in the discussion of Figure 12–5, in a true categorical perception function, an increment of fixed magnitude in the physical stimulus may result in either no psychological change or a large psychological change, depending on where the increment is located along the entire physical continuum. A fixed physical increment between two stimuli, located within a category determined by a labeling experiment, should produce little or no psychological change. The same increment located across a category boundary should produce a large psychological change. In a discrimination experiment, a true categorical perception function is determined when listeners cannot discriminate two different stimuli chosen from within a category, but easily discriminate two stimuli chosen across a category boundary (one chosen from one category, the other from the adjacent category). The physical difference between the two stimuli is the same in both cases, but the psychological reaction to the difference between the stimuli is radically different.

Following the labeling experiments, Cooper and colleagues (1951) performed these discrimination experiments. The discrimination experiments produced the expected results. When listeners were asked if two stimuli (presented one after the other) were the same or different, they said “same” for stimuli chosen within a category, and “different” when stimuli were chosen from adjacent categories (i.e., when one stimulus was immediately to the left of a category boundary determined in the labeling experiments, and the other to the right). This result was obtained even when the actual physical difference between the two judged stimuli was the same. The categorical labeling functions (see Figure 12–4) were confirmed by the discrimination experiment.

Categorical Perception: So What?

What was important about the demonstration of categorical perception for place of articulation? Liberman, Cooper, Shankweiler, and Studdert-Kennedy (1967), in their famous paper “Perception of the Speech Code,” pointed to categorical perception as a cornerstone of the motor theory of speech perception. Listeners do not hear the continuous changes in F2 starting frequency, at least until a category boundary is reached, because they cannot produce continuous changes in place of articulation. Consideration of the different places of articulation for English stops shows why Liberman and his colleagues reasoned this way. How do you produce a stop between a bilabial /b/ and a lingua-alveolar /d/? Or between a /d/ and dorsal /g/? The places of articulation for stops are essentially categorical, allowing no “in-between” articulatory placements.2

The motor theory was built on the idea that speech perception was constrained by speech production. In this view, categorical production of a speech feature, such as place of articulation for stops, limits speech perception to the same categories. Detection of acoustic differences within categories is therefore not possible. Liberman et al.’s (1967) focus on the role of speech production in speech perception, however, extended beyond the demonstration of categorical perception. Recall from Chapter 11 the discussion of pattern-playback experiments in which very different F2 transition patterns cued the perception of a single stop consonant (/d/, in the case covered in Chapter 11; see Figure 11–34). Liberman et al. reviewed several experiments in which a great deal of acoustic variability, primarily due to varying phonetic context, was associated with perception of a single stop consonant. In Chapter 11, this was described as the “no acoustic invariance” problem. Liberman et al. regarded the lack of acoustic invariance for a given stop consonant as a problem for a theory of speech perception in which listeners based their phonetic decisions on information in the acoustic signal. Instead, the constant factor in speech perception, at least for stop consonants, was thought to be the articulatory characteristics of a stop consonant. A stop such as /d/ may have varying acoustic characteristics—especially as seen in the F2 transition, and possibly also in the burst—depending on the identity of a preceding or following vowel, but the lingua-alveolar place of articulation remains constant across all phonetic contexts. For Liberman et al., it made more sense for listeners to base their phonetic decisions on these constant articulatory characteristics, rather than the highly variable speech acoustic signal.

It is one thing to claim that speech is perceived by reference to articulation; it is another to say exactly how this is done. Liberman et al. (1967) argued for a species-specific mechanism in the brain of humans—a specialized and dedicated module for the perception of speech. An important component of this claim was the link between speech production and perception, specifically the “match” between the capabilities of the speech production and speech perception mechanisms. The match was proposed as an evolutionary, encoded form of communication. The encoding is on the speech production side of communication; the decoding is provided by the special perceptual mechanism in the brain of humans. For Liberman et al. the tight link between speech perception and production was part of the evolutionary history of Homo sapiens.

Mirror, Mirror, in the Brain

When the motor theory was first proposed, the brain mechanisms for the (hypothesized) special module were unknown. Experiments were done to show that the likely location of the module was in the left hemisphere (Studdert-Kennedy & Shankweiler, 1970), but these were perception experiments and the inference to actual brain mechanisms involved a long interpretative leap. Fast forward to the twenty-first century and the use of imaging and stimulation techniques to uncover brain function for complex behavior (like speech), and we have the concept of “mirror neurons.” These are neurons that appear to be active during both the production and perception of action. When someone produces a gesture (such as a speech gesture), the perceiver of that gesture has greater activity in the neurons that are involved in producing the gesture. The motor neurons are said to “mirror” the perception of the gesture. Perhaps this is the brain basis of the species-specific speech module proposed in motor theory (see Watkins & Paus, 2004).

There is more to the motor theory. Specifics are provided by Liberman et al. (1967) on how speech production is encoded in the acoustic signal emerging from a speaker’s mouth, and how this signal is decoded by the human brain to recover articulatory behaviors. The original motor theory was later revised in an important way as described by Liberman and Mattingly (1985). In the original motor theory, the focus was on the encoding and decoding of place of articulation for stops (and by extension, to other obstruents and possibly nasals as well). The revised motor theory changed the articulatory invariant to gestures, rather than positions. In the revised motor theory articulatory gestures, such as the tongue gesture from a stop to a following vowel, are encoded by production and then decoded by the species-specific perceptual module.

For the purposes of this chapter, details of the differences between the original and revised motor theories are not critical. The phenomenon of trading relations in phonetic identification, and how it fits into the revised motor theory, is taken up later in the chapter. Both the original and revised motor theories share the idea of a special speech module, and both have faced strong scientific challenges. What is critical for both versions of the theory are two general claims: (a) speech perception is a species-specific human endowment; and (b) the speech acoustic signal associated with a given sound is far too variable to be useful for speech perception, but the underlying articulatory behavior is not, hence the claim that speech is perceived by reference to articulation.

Speech Perception Is Species Specific

The ability to speak and form millions of novel sentences is exclusive to humans. It makes sense that a theory of speech perception as a capability “matched” to speech production is regarded by many scientists as an exclusively human capability. The notion of co-evolved mechanisms for production and perception of vocalizations, and especially of dedicated perceptual mechanisms “tuned” to species-specific vocalizations, is not limited to humans, however. There is evidence in monkeys, bats, and birds (and other animals) of perceptual mechanisms matched to the specific vocalizations produced by each of these animals (Andoni, Li, & Pollak, 2007; Davies, Madden, & Butchart, 2004; Miller & Jucszyk, 1989). The possible existence of such a match in humans is consistent, in principle, with evolutionary principles derived from the study of vocal communication in other animals.

“In principle” evaluations of a theory are fine, but they do not go far enough. A theory should be testable, either by natural observation or experimentation. Karl Popper (2002a, 2002b), a famous philosopher of science, argued that a theory can only be considered “scientific” if it can be disproved by a proper experiment. According to Popper, these experiment-based “falsifications” of a theory are the basis of scientific progress. Popper first published these ideas in the 1935 German edition of The Logic of Scientific Discovery. The book was published in English translation in 1959. Popper’s ideas have had a profound influence on modern science in general, and specifically on the motor theory of speech perception.

The Motor Theory of Speech Perception: Proofs and Falsifications

Can the motor theory be falsified? The scope of the present chapter does not allow a detailed answer to this question, but there have been attempts to falsify the claims of the motor theory. Some of these experiments are described below. For a detailed and challenging article on the issue of falsifying the motor theory of speech perception, see Galantucci, Fowler, and Turvey (2006).

Categorical Perception of Stop Place of Articulation Shows the “Match” to Speech Production

The motor theory was criticized for failing to explain why certain individuals who could not speak (as in some cases of cerebral palsy, or other neurological diseases) were able to perceive speech in a normal way. This complaint was misguided, however, because the motor theorists never argued for the ability to produce speech as a requirement for normal speech perception abilities. To the contrary, the species-specific module for speech perception was thought to be innate (Liberman & Mattingly, 1985)—a property of the human brain at birth. The demonstration of categorical perception in infants as young as 1 month of age was taken as evidence for this innate mechanism, and hence as strong support for the motor theory of speech perception. The infant categorical perception functions were very much like those obtained from adult listeners, even though infants do not produce speech. The categorical perception functions were obtained in infants by taking advantage of something infants do quite well, which is to suck for long periods of time. Over time, as babies suck, the strength of the suck varies with the degree of novelty in their environment. If the environment remains the same, sucking becomes less intense. The introduction of a novel stimulus (e.g., something seen, heard, smelled) results in a sudden increase in suck strength, frequency, or both. Early studies of infant speech perception used sucking behavior to assess babies’ reactions to speech stimuli within and across speech sound categories. Categorical perception functions, closely resembling those obtained from adults, were obtained in infants with the sucking paradigm. An excellent review of early infant speech perception research and methods is found in Eimas, Miller, and Jusczyk (1990).

What About Talking Birds?

The motor theory is species specific to humans. The link between articulatory and speech perception capabilities is special because humans are the only species who produce speech sounds for communication. But wait. What about talking birds?—mynahs, crows, budgerigars (small parrots, often called parakeets), and African Greys, for example. Talking birds produce speech using a very different apparatus from humans—they have no lips, and do not produce a sound source at the larynx but rather have a sound-producing mechanism deep in their chests called a syrinx. Yet the major question is not, “Can these birds articulate?” (because they obviously produce intelligible speech, even if mimicked), but “When they articulate, can they make voluntary adjustments to produce different speech sounds?” If the answer is “Yes, they make such adjustments,” then the species-specific claim of motor theory runs into some difficulty. Patterson and Pepperberg (1998) made just this claim for an African Grey parrot named Alex (1977–2007, https://en.wikipedia.org/wiki/Alex_(parrot)). An opposing view, that the speech produced by talking birds is nothing more than non-inventive mimicry, was expressed by Lieberman (1996).

An apparent falsification of the motor theory was, ironically, a logical outgrowth of the findings in infants. If nonspeaking infants had categorical perception of sound contrasts, perhaps the same would be true of animals. Perhaps the reason infants showed the effect had little to do with a species-specific speech perception mechanism, but instead reflected some general property of mammalian auditory systems. In fact, work by Kuhl and her colleagues (Kuhl, 1986; Kuhl & Miller, 1975; Kuhl & Padden, 1983) and others demonstrated categorical perception for voice onset time (VOT) and stop place of articulation in chinchillas and monkeys, respectively. If categorical perception is the result of a special linkage between human speech production and perception, as claimed by Liberman et al. (1967), the finding of categorical speech perception in animals is a falsification of the linkage specifically, and the motor theory in general. A kinder interpretation of the animal data is that they raise questions about the motor theory but do not falsify the theory in an absolute way. Miller and Jusczyk (1989) summarized this position in this way: “In principle, there are many ways to arrive at the same classification of a set of objects. Hence, the fact that the animals can achieve the same classification does not prove that they use the same means to do so as humans” (pp. 124–125, emphasis added). The data may be the same in adult humans, human infants, and animals, but not necessarily because of a common mechanism. This is an example of the same findings (categorical perception of speech signals) supporting opposing theoretical views.

Duplex Perception

The possibility of a special speech-perception module in the brains of humans does not, of course, eliminate the need for general auditory function. A host of everyday auditory perceptions requires analysis by mechanisms external to the speech module. Presumably, a speech signal “automatically” engages the speech module; the option is not available to “turn it off” for short periods of time, even if this would occasionally be a nice idea. Other auditory signals engage “general” (non-modular) auditory mechanisms for analysis and perception. The idea of a separation3 between mechanisms for speech perception versus general auditory perception was explored experimentally to prove the existence of the speech module.

The schematic spectrograms in Figure 12–6 (adapted from Whalen & Liberman, 1987) illustrate the physical basis of duplex perception, the phenomenon in which the speech module and general auditory mechanisms seem to be activated simultaneously by one signal (Mann & Liberman, 1983; Whalen & Liberman, 1987). The upper graph shows a schematic spectrogram, with fixed patterns for F1 and F2, and two different transitions for F3. The steady-state formant frequencies are appropriate for the vowel /ɑ/, and the F1 and F2 transitions convey the impression of a stop consonant preceding the vowel. Synthesis of this F1-F2-F3 pattern with a rising F3 transition causes listeners to hear /gɑ/. A falling F3 transition cues the perception of /dɑ/. The different perceptual effects of the rising versus falling F3 transition are consistent with naturally produced /gɑ/ and /dɑ/ syllables.

If the F3 transition portion (either the rising or falling one) is edited out from the schematic signal in the upper part of the figure and played to listeners, the brief signal (~50 ms in duration) sounds something like a bird chirp or whistle glide. In the case of the transition for /g/, the pitch of this “chirp” rises quickly, and for /d/ it falls quickly. The isolated transitions are shown in the lower right-hand graph in Figure 12–6. Regardless of exactly how people hear these isolated transitions—as “chirps,” quick frequency glides (glissandi, in musical terms), or outer space noises—they are not heard as phonetic events.

This situation suggests something of a perceptual mystery. Listeners hear the three-formant pattern at the top of Figure 12–6 as either /g/ or /d/, depending on whether the F3 transition is rising (/g/) or falling (/d/). But when that brief, apparently critical F3 transition is isolated from the spectrographic pattern and played to listeners, they hear something with absolutely no phonetic quality.

What do people hear when presented with the spectrographic pattern minus an F3 transition? In Figure 12–6, this pattern is referred to as the “base,” which listeners hear as a not-very-clear /d/, different from the unambiguous /d/ heard when the falling F3 transition is in place.

The phenomenon of duplex perception, and its relationship to the concept of a special speech perception module in the brains of humans, can now be explained. Consider the base and either one of the isolated transitions shown in Figure 12–6 as two separate signals and imagine one of these signals (the base) delivered to one ear, and the other (an isolated transition) delivered to the other ear. This experimental arrangement is shown in the cartoon of Figure 12–7, where the “base” is sent to a listener’s right ear and one of the “isolated transitions”—the one appropriate to either /g/ or /d/—is sent to the left ear. The isolated F3 transition is delivered to the left ear with proper timing relative to the F1 and F2 transitions in the base, meaning the transition is sequenced properly in time relative to the base (as in the top graph of Figure 12–6, showing the “complete” patterns for either /d/ or /g/).

What did listeners hear when the experiment depicted in Figure 12–7 was performed? Mann and Liberman (1983), among others, showed that listeners heard a “good” /dɑ/ or /gɑ/ (depending on which F3 transition was played) plus a chirp. The simultaneous perception from the same signal of two events—speech and nonspeech—suggested the term “duplex perception.” Duplex perception seemed to show the human listener operating simultaneously in the special speech mode and in the general auditory mode. The perception of the “good” /dɑ/ or /gɑ/ was the result of the “base” and the “isolated transition” combining somewhere in the nervous system, automatically engaging the speech mode of perception. At the same time, the isolated F3 transition was processed as a chirp by general auditory mechanisms. The F3 transition did double duty, engaging two different kinds of hearing mechanism at the same time. One of those mechanisms, according to Mann and Liberman and Whalen and Liberman (1987), must be the specialized speech perception module proposed as the centerpiece of the motor theory. As Whalen (1997) pointed out, the perception was not “triplex,” which would have included three ambiguous signals with the missing F3 transition (the percept when only the base was presented to listeners), the clear /dɑ/ or /gɑ/ (depending on which F3 transition was used), and the chirp. Listeners heard only two events, the clear syllable and the chirp. The ambiguous phonetic percept elicited by the base was gone.

This experimental finding seemed to be a strong endorsement of a species-specific module for perceiving speech. What other explanation could account for the same signal (the F3 transition) evoking two simultaneous perceptions, one clearly phonetic and consistent with previous studies on the role of the transition in cuing /g/ or /d/, the other clearly non-phonetic and consistent with the direction of the frequency glide (rising versus falling)? Duplex perception seemed like ironclad evidence for a special mode for perceiving speech, distinct from general auditory processes. However, an experiment reported by Fowler and Rosenblum (1991) cast doubt on this interpretation.

Fowler and Rosenblum (1991) recorded the acoustic signal produced by the closing of a metal door. They computed a spectrum of this acoustic event, like the one shown in the upper graph of Figure 12–8. Then they separated the spectrum into two parts, a lower frequency part (from 0–3.0 kHz) and a higher frequency part (from 3.0–11.0 kHz; see lower left and lower right graphs in Figure 12–8). This separation was accomplished by filtering the original signal (0–11.0 kHz) to obtain the 0 to 3.0 kHz and 3.0 to 11.0 kHz parts. Fowler and Rosenblum presented these three signals (the original, “full” signal; the 0–3.0 kHz part; and the 3.0–11.0 kHz part) to listeners for separate identification. Listeners reported hearing a metal door closing or some “hard collision” for the full signal (upper graph of Figure 12–8), a duller, wooden door-closing sound for the 0 to 3.0 kHz signal (lower left graph, Figure 12–8), and a shaking can of rice, a tambourine, or jangling keys for the 3.0 to 11.0 kHz signal (lower right graph, Figure 12–8).

Fowler and Rosenblum (1991) took advantage of a variation of the duplex perception finding. Whalen and Liberman (1987) discovered that a duplex perception was obtainable when the base and isolated F3 transition were delivered to the same ear, provided the isolated F3 transition was increased in intensity relative to the base. When the “chirp” intensity was relatively low in comparison with the “base,” listeners heard a good /dɑ/ or /gɑ/ depending on which F3 transition was used. As the F3 “chirp” was increased in intensity, a threshold was reached at which listeners heard both a good /dɑ/ or /gɑ/ plus a “chirp.” Listeners perceived the signal as duplex, as in the earlier experiments in which the “base” and “chirp” were in opposite ears. Fowler and Rosenblum repeated this same-ear experiment, but using signals lacking phonetic content, in this case the slamming metal door signal split into a base and chirp, as described above. Relatively low “chirp” intensities (the 3.0–11.0 kHz part of the signal) in combination with the “base” (the 0–3.0 kHz signal) produced a percept of a slamming metal door, consistent with the percept elicited by the original, intact signal (top spectrum in Figure 12–8). As the “chirp” intensity was raised, a threshold was reached at which listeners heard the slamming metal door plus the shaking can of rice/tambourine/jangling keys. Fowler and Rosenblum thus evoked a duplex percept exactly parallel to the one described above for /dɑ/ and /gɑ/, except in this case for nonspeech sounds.

If the original duplex perception findings (Liberman & Mattingly, 1985; Mann & Liberman, 1983; Rand, 1974; Whalen & Liberman, 1987) provided compelling evidence for a special speech perception module in humans, the demonstration of duplex perception for a slamming metal door is strong evidence against the idea of the speech module. As Fowler and Rosenblum (1991) pointed out, if the phonetic part of duplex perception (i.e., the “good,” unambiguous signal) was regarded as the output of a special speech module, the perception of a slamming metal door plus the shaking can of rice when the 3.0 to 11.0 kHz signal was raised to a sufficient intensity was evidence of a special human module for, the perception of door-slamming, and metal door slamming at that. It is hardly absurd to imagine a human biological endowment for perceiving speech signals as an evolutionary adaptation, but the existence of a brain module dedicated to the perception of a slamming metal door is a non-starter. Perhaps there is a special biological endowment for perceiving speech, but duplex perception is not the critical test for its existence.

Acoustic Invariance

The lack of acoustic invariance for speech sounds was an important catalyst for the development of the motor theory of speech perception. As discussed in Chapter 11, Blumstein and Stevens (1979) performed an acoustic analysis of stop burst acoustics that led them to reject this central claim of the motor theorists. Blumstein and Stevens used the stop burst spectrum to classify correctly 85% of word-initial stop consonants in a variety of vowel contexts. They regarded this finding as a falsification of the motor theorists’ claim that there was too much context-conditioned acoustic variability to allow listeners to establish a consistent and reliable link between speech acoustic characteristics and phonetic categories. For the motor theorist, consistency associated with speech sound categories was found in the underlying articulatory behavior for specific sounds, even when the context of a sound was changed. The “underlying articulatory gesture” was the neural code for generating the gesture. This code was assumed to be “fixed” regardless of its phonetic context. Whatever changes occurred to the actual articulatory gestures—the observed, collective movements of the lips, tongue, mandible, and so forth—were not relevant to the motor theory. In the motor theory, perception of speech depended on these more abstract neural commands, higher up in the process, that were not coded for phonetic context. The phonetic context effects (see Chapter 11, Figure 11–33) were stripped away by the special speech module, leaving the abstract commands for just the phoneme (original motor theory) and/or gesture (revised motor theory). These invariant commands were assumed to be part of the special speech module.

Blumstein and Stevens’ (1979) apparent falsification of the lack of acoustic invariance for sound categories is a bit more involved than a simple demonstration of consistency between a selected acoustic measure (such as the shape of a burst spectrum) and a particular sound. Liberman and Mattingly (1985), in a very fine review of why they believed the acoustic signal was not sufficiently consistent to establish and maintain speech sound categories in perception (where “categories” = “phonemes”), identified complications with so-called auditory theories of speech perception. Auditory theories claim that information in the speech acoustic signal is sufficient, and sufficiently consistent, to support speech perception. These theories regard the auditory mechanisms for speech perception to be the same as mechanisms for the perception of environmental sounds, music, or any acoustic signal. One specific auditory perspective on speech perception (Diehl, Lotto, & Holt, 2004; Kingston & Diehl, 1994) claims that speakers control their speech acoustic output to produce speech signals well matched to auditory processing capabilities. Another auditory perspective, described in Chapter 11, is based on locus equations which represent formant transitions that provide reliable information on place of articulation for stops and fricatives.

Why were Liberman and Mattingly (1985) so adamant in rejecting auditory theories of speech perception? First, Liberman and Mattingly pointed to what they termed “extraphonetic” factors that cause variation in the acoustic characteristics of speech sounds. These factors include (among others) speaking rate and speaker sex and age. The speaker sex/age issue is particularly interesting because the same vowel has widely varying formant frequencies depending on the size of a speaker’s vocal tract (Chapter 11). An auditory theory of speech perception either requires listeners to learn and store all these different formant patterns or must employ some sort of cognitive process to place all formant patterns on a single, “master” scale. This issue, of how one hears the same vowel (or consonant) when so many different-sized vocal tracts produce it with different formant frequencies, is called the “speaker (or talker) normalization” problem (interesting papers on speaker normalization are found in Johnson & Mullenix, 1997; see also Adank, Smits, & van Hout, 2004; and see discussion below). The motor theory finesses this problem by arguing that the perception of different formant transition patterns is mediated by a special mechanism that extracts intended articulatory gestures and “outputs” these gestures as the percepts. For example, the motor theory assumes that the intended gestures (the neural code for the gestures) for the vowel in the word “bad” are roughly equivalent for men, women, and children, even if the outputs of their different-sized vocal tracts are different. The special speech perception module registers the same intended gesture for all three speakers, and hence the same vowel perception (or the same consonant perception). The motor theory makes the speaker normalization problem go away.

A second reason to reject auditory theories, according to Liberman and Mattingly (1985), is the interesting case of trading relations in the acoustic cues for a given sound category. For any given sound, there are at least several different acoustic cues that can contribute to the proper identification of the sound. As Liberman and Mattingly pointed out, none of these individual values are necessarily critical to the proper identification of a sound segment, but the collection of the several values may be. Among these several cues, the acoustic value of one can be “offset” by the acoustic value of another to yield the same phonetic percept. For example, Figure 12–9 shows spectrograms of a single speaker’s production of the words “say” and “stay.” These two words can be described as a minimal-pair opposition defined by the presence or absence of the stop consonant /t/. In “stay” (but not “say”) there is the obvious silent closure interval of approximately 60 to 90 ms, but the “say”-“stay” opposition also involves a subtle difference in the starting frequency of the F1 transition for /eɪ/. Figure 12–9 shows the F1 starting frequency in “stay” to be somewhat lower than the starting frequency in “say” (compare frequencies labeled “F1 onset”). The lower starting frequency in “stay” is consistent with theoretical and laboratory findings of F1 transitions pointing toward the spectrographic baseline—that is, 0 Hz—at the boundary of a stop consonant and vowel (Fant, 1960). The difference is subtle but measurable.

In a frequently cited experiment, Best, Morrongiello, and Robson (1981) performed a clever manipulation of these two cues to the difference between “say” and “stay.” The two cues—the closure interval and the lower F1 starting frequency following the stop closure—were used to demonstrate the “trading relations” phenomenon. Figure 12–10 shows one version of the stimuli used by Best and her colleagues. The gray, stippled interval represents the voiceless fricative /s/, the narrow rectangles the closure intervals for /t/, and the two solid lines the F1-F2 trajectories for /eɪ/. Best et al. synthesized these sequences in two ways, one with an F1 starting frequency of 230 Hz (Figure 12–10, left side), the other with an F1 starting frequency of 430 Hz (Figure 12–10, right side). Best et al. changed the duration of the stop closure interval between 0 (no closure interval) and 136 ms, sometimes with the lower F1 starting frequency, and sometimes with the higher F1 starting frequency. When the pattern was synthesized with one of the longer closure intervals, close to 136 ms, listeners clearly heard the sequence as “stay.” When the pattern was synthesized with a very short or nonexistent closure interval, “say” was heard. None of this is surprising and is consistent with the spectrographic patterns shown in Figure 12–9.

The interesting findings occurred when the length of the closure interval between the /s/ and /eɪ/ was rather short (~30–50 ms) and resulted in roughly equal “say” and “stay” responses. For these stimuli, the presence or absence of a /t/ closure was ambiguous. Best et al. (1981) determined that when the F1 starting frequency was the higher one (430 Hz in Figure 12–10), a longer closure interval was required for listeners to hear “stay.” When the F1 starting frequency was the lower one (230 Hz in Figure 12–10), a shorter closure interval allowed the listeners to hear “stay.” In other words, the two cues to the presence of a /t/ between the /s/ and /eɪ/ seemed to “trade off” against each other to produce the same percept—a clear /t/ between the fricative and the following vowel. “Trading relations” is the term used for any set of speech cues that can be manipulated in opposite directions to yield a constant phonetic percept.

Speech Synthesis and Speech Perception

The pattern-playback machine allowed speech scientists to synthesize speech signals, but the quality of these signals was—let’s be gracious—not particularly good. Developments in computer technology, knowledge of acoustic phonetics, and the sophistication of software codes have greatly improved speech synthesis. Current synthetic speech signals are so good they sometimes cannot be distinguished from natural speech. These developments have allowed speech perception researchers to make very fine adjustments in signals to learn about speech perception while avoiding the problem of the fuzzy speech signals produced by the pattern-playback machine. In 1987, Dennis Klatt from the Massachusetts Institute of Technology published a wonderful history of speech synthesis and provided audio examples of synthetic speech signals from 1939 to 1987 (Klatt, 1987). You can hear these examples at http://www.cs.indiana.edu/rhythmsp/ASA/Contents.html

For Liberman and Mattingly (1985), the trading relations phenomenon proved the point about the inability to connect a specific acoustic characteristic with a specific sound category. The potential, multiple acoustic cues to a given phonetic category were simply too numerous to be used by a listener to develop and maintain the category identification. In the case of “say”-“stay,” the lower versus higher F1 starting frequency, or the precise duration of the closure interval, was not by itself sufficient to serve as an acoustic constant of a phonetic category. The collection of these several cues, however, reflected the underlying gesture for the sound category. Small variations in one cue could be compensated for by variations in a different cue, but in the end the sum of these various cues yielded a single percept. In support of Liberman and Mattingly’s theoretical cause, trading relations have been demonstrated for many different phonetic distinctions. It is not just a “say”-“stay” phenomenon (see Repp [1982] and Repp and Liberman [1987] for reviews of trading relations in phonetic perception).

The trading relations experiment takes advantage of the categorical perception method by varying a stimulus continuum (in the case above, between “say” and “stay”) and finding a specific stimulus along the continuum where 50% of the responses are “say” and 50% “stay.” This is the category boundary for the “say”-“stay” contrast, and modifications of parts of the signal cause changes in the boundary as described above. Stimulus continua for many phonetic contrasts have been used in speech perception research, including /s/-/ʃ/, /s/-/z/, and /r/-/w/, among others. The stimuli are synthesized to construct a signal continuum with endpoints that are excellent exemplars of the two sounds (e.g., clear /s/, clear /ʃ/ for an /s/-/ʃ/ continuum). As the stimuli are adjusted to be less like an endpoint (less /s/-like, less /ʃ/-like), the respective signals are ultimately adjusted to produce a middle stimulus that is ambiguous, as if halfway between the endpoints. For this stimulus, 50% of the responses are one of the endpoint percepts, and 50% the other endpoint percept. The 50-50 percept defines the phoneme boundary.

Categorical perception as a method to study speech perception is sometimes thought of as outdated. Some people think of it as a method associated with the motor theory of speech perception. As reviewed below in the section “Speech Perception and Word Recognition,” the method is used in contemporary research to identify the influence of the lexicon on phonetic perception.

The Competition: General Auditory Explanations of Speech Perception

An obvious approach to understanding speech perception is to regard the speech acoustic signal as a sufficiently rich source of information for a listener’s needs. In this view, the speech acoustic signal contains reliable, learnable information for the identification of sounds, words, and phrases intended by a speaker. A general auditory explanation of speech perception seems simple, and perhaps the logical starting point—like a default perspective—for scientists who study speech perception. Contrary to this apparent logic, general auditory explanations have fought an uphill scientific battle since the motor theory was formulated in the 1950s.

The information presented above describes the reasons for the development of the original and revised motor theories of speech perception. Those reasons led to one overarching assumption concerning speech perception. A special perceptual processor is required because general auditory mechanisms were not up to the task of perceiving speech. In contrast, a central theme of general auditory explanations of speech perception is that special perceptual mechanisms are not required.

As in the published work on the motor theory, a set of reasons in support of a general auditory account of speech perception has been carefully articulated in the scientific literature. These are summarized below.

Sufficient Acoustic Invariance

As noted earlier, Blumstein and Stevens (1979) demonstrated a fair degree of acoustic consistency for stop consonant place of articulation, and many of the successful automatic classification experiments described in Chapter 11 imply consistency in the acoustic signal for vowels, diphthongs, nasals, fricatives, and semivowels. Recall from Chapter 11 that Lindblom (1990) argued for a more flexible view of speech acoustic variability. In this view, listeners do not need absolute acoustic invariance for a speech sound, but only enough to maintain discriminability from neighboring sound classes.

General auditory accounts of speech perception rely on this more flexible view of acoustic distinctiveness for the perception of speech sounds. Presumably, an initial front-end acoustic analysis of the speech signal by general auditory mechanisms is supplemented by higher-level processing which resolves any ambiguities in sound identity. The front-end analysis is like a hypothesis concerning the identity of the sequence of incoming sounds, based on initial processing of the incoming acoustic signal. The higher-level processes include knowledge of the context in which each sound is produced, plus syntactic, semantic, and pragmatic constraints on the message. Listeners bring more to the speech perception process than a capability for acoustic analysis. These additional sources of knowledge considerably loosen the demand for strict acoustic invariance for each sound segment.

Scientists often refer to the front-end part of this process as “bottom-up” processing, and the higher-level knowledge affecting perception as “top-down” processing. The top-down processes influence the bottom-up analyses, taking advantage of the rich source of information in the auditory signal. Stevens (2005) has proposed a speech perception model in which bottom-up auditory mechanisms analyze the incoming speech signal for segment identity and top-down processes resolve ambiguities emerging from this front-end analysis. When an account of speech perception is framed within the general cognitive abilities of humans, including top-down processes, a role for general auditory analysis in the perception of speech becomes much more plausible (Lotto & Holt, 2006). In this view, the lack of strict acoustic invariance for speech sounds cannot be used as an argument against a primary role of general auditory mechanisms in speech perception.

Recall from Chapter 11 the discussion of locus equations. Chapter 11 presents the equations as acoustic measurements of F2 onset and F2 target that have unique slopes and y-intercepts for the three places of stop consonants. These two parameters yielded unique linear functions for each of the three stop places of articulation in English. Fruchter and Sussman (1997) explored the perceptual value of locus equations by varying the parameters in small steps and presenting the resulting linear functions to listeners for identification of /b/, /d/, and /g/. This approach is similar to the early experiments performed by the Haskins Lab scientists when they varied F2 onset in small steps and determined that listeners responded to these variations with categorical labels.

Fruchter and Sussman (1997) found that the varying combinations of F2 onset and F2 target were not heard as continuous variations, but were clustered in the categories /b/, /d/, and /g/, consistent with the acoustic measurements that separated the three stops. This is an auditory explanation of the perception of place of articulation, but with a twist. The twist, described by Sussman, Fruchter, Hilbert, and Sirosh (1998), is that many animals have special neural mechanisms for connecting two acoustic events and using those connections to establish categories (see Andoni & Pollak, 2007). In the case of locus equations, the two events are the F2 onset and F2 target and the connection between them forms the linear functions shown in Chapter 11. Sussman et al. argued that these connections may be species-specific and matched to vocalization characteristics of the different species. The perceptual basis of locus equations is like a combination of auditory and special speech module theories: processing of auditory information (the combinations of F2 onset and F2 target) and special, specific mechanisms to use the information to establish categories such as stop place of articulation.

Replication of Speech Perception Effects Using Nonspeech Signals

Many categorical perception effects have been demonstrated using synthetic speech stimuli. Several of these experiments are reviewed above. There are also demonstrations of similar effects using nonspeech signals. Categorical perception of speech signals has been a centerpiece of the original and revised motor theory. The demonstration of the same effects with nonspeech signals, however, seems to damage the proposed link between speech production and speech perception implied by findings of categorical perception for speech signals.

The approach employed by scientists interested in a general auditory theory of speech perception is to reproduce a categorical perception effect for speech signals using nonspeech signals. If the results of a nonspeech experiment are the same as a speech experiment, the categorical perception effect can be attributed to general auditory, not perceptual, mechanisms specialized for speech. A well-known example of such an experiment was published by Pisoni (1977), who showed that categorical perception functions for the voiced-voiceless contrast were probably due to auditory, not speech-special mechanisms.

Pisoni (1977) reviewed speech perception experiments in which labeling and discrimination data suggested categorical perception of voiced and voiceless stops. A typical set of data for this experiment is shown in Figure 12–11, where VOT is plotted on the x-axis and labeling (red points, left axis) and discrimination (blue points, right axis) are plotted as two y-axes. In this experiment, the stimuli are synthesized to sound like stop-vowel syllables, with all features held constant except for VOT. Assume the transitions are synthesized to evoke perception of a bilabial stop and that VOT is varied in 5-ms steps from lead values (negative VOT values) to long-lag (positive) values. Figure 12–11 shows a VOT continuum ranging from −10 ms to +50 ms. For each VOT value, listeners are asked to label the stimulus as either /b/ or /p/. In the discrimination experiments, two stimuli from the VOT continuum are presented, always separated by 20 ms along the continuum (e.g., one stimulus with a VOT of 0 ms, the other with a VOT of 20 ms). In the discrimination experiment, listeners are asked if the stimulus pairs are the same or different. The results in Figure 12–11 show only /b/ labels for stimuli with VOTs from −10 ms to +15 ms, half /b/ and half /p/ labels for the stimulus with VOT = 20 ms, and all /p/ labels for stimuli with VOT = 30 ms or more. The change from /b/ to /p/ labels occurs quickly in the 20 to 30 ms range of VOTs, as required for an interpretation of categorical perception. The discrimination data (blue points) show perfect discrimination when the two VOT stimuli cross the labeling “boundary,” but much poorer discrimination when the two stimuli are within a category. These labeling and discrimination VOT data meet the requirements for categorical perception as reviewed above. The interpretation of categorical perception of VOT was consistent with the motor theory of speech perception. Speakers cannot produce continuous changes in VOT, so they cannot perceive them.

Stay updated, free articles. Join our Telegram channel

Aug 28, 2021 | Posted by in OTOLARYNGOLOGY | Comments Off on Speech Perception

Full access? Get Clinical Tree

Get Clinical Tree app for offline access