Audiovisual Speech Perception and Speech Perception Training

last night” will impart an entirely different meaning than a talker who is grinning and disentangling a refrigerator magnet from her hair.


In addition to imbuing a spoken message with nuance and subtle semantic meaning, the visual speech signal conveys basic information, such as when a talker is about to speak, and the rhythm and cadence of speech. It also conveys phonemic information, such as indicants of place of production (e.g., /p/ versus /f/ versus /g/) and manner (e.g., /l/ versus /d/). Interestingly, studies with fMRI have shown that silent lipreading activates core regions of the temporal auditory cortex (Calvert, Bullmore, Brammer, Campbell, Williams, Mcguire, Woodruff, Iverson, & David, 1997; Pekkola, Ojanen, Autti, Jääskeläinen, Möttönen, Tarkainen, & Sams, 2005), just as if an individual were listening to auditory speech. The visual speech signal also intensifies the auditory cortex’s response when added to the auditory speech signal, as during speechreading (Okada, Venezia, Matchin, Saberi, & Hickok, 2013). Such cortical activation is “speech-specific” and does not happen when a person is watching simple mouth movements such as the mouth performing a chewing behavior.


Lipreading Performance


The visual speech signal is very difficult to recognize when it is presented apart from the auditory speech signal. For example, on a test of isolated word recognition, a group of young adults recognized only about 35% of the words correctly when they could see but not hear the talker, where the talker’s head and upper head were clearly displayed on the monitor. Performance was even worse for sentences—participants recognized only about 16% of the words (Sommers, Tye-Murray, & Spehar, 2005). With this said, lipreading shows remarkable variability across people. For example, in this study, performance ranged from 5% correct to 80% words correct, and on the same stimuli! This variability has baffled scientists for decades and there is no consensus as to what variables predict performance, although an individual’s general intelligence, degree of education, sex, duration of acquired hearing loss, and gaze behavior appear not to be predictive of who is and is not a talented lipreader (Hygge, Rönnberg, Larsby, & Arlinger, 1992; Rönnberg, 1995; Tye-Murray, Sommers, & Spehar, 2007a). Certain cognitive skills, such as spatial working memory and processing speech, are somewhat predictive (Feld & Sommers, 2009), with better skills being associated with better lipreading ability, but even so, these cognitive skills tend to account for little of the variance found in group studies. Adults who have prelingual hearing loss—that is, hearing loss acquired before the acquisition of spoken language—also tend to be better lipreaders (Pimperton, Ralph-Lewis, & MacSweeney, 2017). The most reliable predictor—and of particular relevance to the topic of audiologic rehabilitation for adults—is age, which appears not only to be the most reliable predictor of lipreading performance but also the most robust predictor ever considered in the literature.


The Effects of Age on Lipreading Performance


It is a sad twist of fate that just as hearing sensitivity and word discrimination begin to decline because of aging, so too does lipreading ability (Sommers et al., 2005; Tye-Murray, Sommers, & Sphear, 2007b; Tye-Murray, Sommers, & Spehar, 2008). For example, Figure 14–1 shows results from the author’s study in which the team presented words in a closed-set matrix sentence test (called the “build-a-sentence test”) to a group of adults who ranged in age from 22 to 92 years of age (Tye-Murray, Spehar, Myerson, Hale, & Sommers, 2016). Matrix-style tests are useful for testing vision-only speech perception because they avoid the “floor effects” often associated with open-set tests of sentences or words. In the build-a-sentence test, the target words for each test sentence are selected randomly without replacement from a closed set of 36 nouns and placed in one of possible sentence contexts (e.g., “The boys and the dog watched the mouse” or “The snail watched the girls and the whale”). A response screen for one version of the build-a-sentence test appears in Figure 14–2.


As Figure 14–1 reveals, lipreading performance decreased across the lifespan, falling on average by 30% words correct from young adulthood to old age. The reasons for this decline are not well understood, although changes in peripheral visual acuity can be ruled out since participants had to have visual acuity within normal limits to be included in the study (with corrective lenses, when appropriate).


Factors That Influence Lipreading Performance


As indicated in Table 14–1, a number of factors can affect lipreading performance, including the talker, the message, and the environment (see Tye-Murray, 2020, for a review). For example, if a talker speaks very rapidly with minimal mouth movement, that talker will be extremely difficult to lipread. A word like basketball, which has few words in the English language that look similar on the mouth, will be easier to lipread than a word such as can, which not only had many visually similar words (e.g., car, guard, hen, etc.) but also is not produced with very visible articulatory movements (i.e., speaking the word does not entail lip movement and the initial consonant is produced at the back of the palate). Background noise can diminish lipreading, apart from how it adversely affects audiovisual and auditory-only speech recognition. For example, young and old adults can lipread speech more accurately in a quiet test condition as compared to a condition that has white noise presented simultaneously with the visual speech signal. Performance declines even more if the background noise is multitalker babble (Myerson et al., 2016).





Many of these factors can be externally controlled, and might be targeted during family-centered care in an adult audiologic rehabilitation plan (Lind et al., 2019). For example, during a discussion of “where and when communication goes well and where it is a problem” (p. 35), a family member might note that the patient never seems to respond when he or she talks to the person while water runs from the kitchen faucet, which in turn could lead to a discussion about the effects of noise on speechreading and listening. For a family member who “talks a mile a minute,” training for the person in the art of speaking with clear speech (Picheny, Durlach, & Braida, 1985) might be included in an audiologic rehabilitation plan. Patients themselves can learn to optimize their ability to utilize the visual speech signal; for example, by seeking favorable seating during a lecture or podium presentation.


Audiovisual Speech Perception


The contribution of the visual signal to face-to-face communication is best demonstrated during speech perception in noisy environments. Compared to listening alone, when a conversational partner can both see and hear the talker, speech perception is substantially more accurate (Tye-Murray, Sommers, & Spehar, 2007b). Conversational partners can also tolerate more unfavorable signal-to-noise ratios without a loss in accuracy when speechreading as compared to listening (Grant & Seitz, 2000; MacLeod & Summerfield, 1987; Sumby & Pollack, 1954).


Being able to see the talker has greater benefits than simply improving accuracy. An important advantage of speechreading over listening is a reduction in perceptual effort (Gosselin & Gagné, 2011). As the name implies, perceptual effort is the effort that someone expends to perceive speech, sometimes at the expense of depleting other cognitive resources such as working memory and attention. It is perhaps for this reason that audiovisual presentations as compared to auditory-only presentations also lead to better comprehension of short stories and dense philosophical texts (Arnold & Hill, 2001; Reisberg, McLean, & Goldfield, 1987).


Assessment of Audiovisual Speech Perception Performance


Sadly, routine audiologic examinations do not regularly include assessment of audiovisual speech perception in either quiet or with a noisy background, even though most everyday conversations occur in a face-to-face format and people are more likely to engage in speechreading than in listening alone.


Scores of performance in an audiovisual condition could serve as a powerful counseling tool, for both the patient and family members. For example, knowing how well a patient can speechread could set the stage for developing realistic expectations about the kind of benefit a hearing aid might provide or for communications strategies training, which might include a discussion about the importance of making sure the patient can see the talker’s face. Tests are available for this purpose. The test formats may include any of the following:


Matrix style sentences such as that shown in Figure 14–2;


Word lists that have been constructed so as to avoid floor effects and that include words like basketball (Tye-Murray & Geers, 2002);


Topically related sentences that the patient learns beforehand, such as summertime (Boothroyd, Hanin, & Hnath-Chisolm, 1985);


Sentences spoken by a variety of talkers (Tyler, Preece, & Tye-Murray, 1986);


Sentences that either are supported by a picture illustration (i.e., the patient sees a picture of a flower in a window sill and then is asked to lipread the sentence, “The flowers were placed on the window ledge.”) or that require a “gist response” (i.e., the patient sees a sentence spoken and then selects a corresponding illustration from a choice of four) (Tye-Murray, Hale, Spehar, Myerson, & Sommers, 2014).


Perhaps one of the easiest and most effective ways an audiologic practice can distinguish itself in today’s competitive market is to incorporate comprehensive speech perception assessment into their diagnostic battery, which includes the ecologically valid measure of audiovisual speech perception.


The Audiovisual Speech Advantage and Analyzing Test Results


The advantage of seeing and hearing the talker, as opposed to only hearing the talker, is often referred to as the audiovisual speech advantage, and might also be discussed using two other terms, visual enhancement and auditory enhancement. The term visual enhancement is used when we talk about how the visual signal enhances the ability, typically of someone who has hearing loss, to recognize speech. In this case, we might measure how well a person with a new hearing aid performs on a speech test in an auditory-only condition, without the visual signal, and then repeat the testing in an audiovisual condition, presenting both the talker’s auditory signal and visual speaking image. The boost in the patient’s performance when the visual signal is added is the visual enhancement, and indicates how the patient’s use of the visual signal enhances his or her listening-alone performance.


The term auditory enhancement is used when we talk about how the auditory signal enhances listening performance. For example, in the early days of cochlear implants, a primary goal of providing a patient with a cochlear implant was simply to improve audiovisual speech perception—auditory-only word recognition was a dream for the future (since realized) (Tyler, 1993). In this case, how well a person with a new cochlear implant performed on a speech test in a vision-only condition was measured without the auditory signal, and then the test was repeated in an audiovisual condition. The boost in performance that occurred with the auditory signal was added was the amount of auditory enhancement, and was considered to be an indicant of the extent to which the electrical stimulation provided by the cochlear implant was improving face-to-face speech perception.


To compute either visual enhancement or auditory enhancement, patients must be tested in an audiovisual condition and either an auditory-only or vision-only condition, respectively. Comparable test stimuli must be used in each condition so any gains (or losses) are due to better audiovisual speech perception as opposed to differences in test list difficulty. For this reason, the type of matrix test presented in Figure 14–2 is often used because the test items are the same across test conditions and learning effects with repeated exposure to test items is not a problem, as would be the case with open-set test lists. Computations might be either a simple difference score, computed by subtracting the percent words correct in the unimodal condition from the multimodal condition, or a normalized ratio, whereby you first determine how much room there is available for improvement in the unimodal condition and then referencing the amount of improvement gained in the multimodal condition to that amount available. For example, if you are interested in measuring visual enhancement, then you would use the formula:


AV% correct − A% correct/100% − A% correct,


Where AV% correct refers to the percentage of words identified correctly in an audiovisual condition and A% correct refers to the percentage of words identified correctly in an auditory-only condition (and of course, 100% is perfection).


Imagine the power of knowing a patient’s visual enhancement score when counseling family members. For example, you might say to a patient’s wife, “When your husband can only hear you, he recognizes 50% of what you have to say. However, if you are sure to get his attention before you begin to speak, and if you make sure that he can see your face clearly, he’ll be able to recognize 90%.” Most likely, a light bulb will go off in the wife’s head, as she realizes the potential for reducing both the occurrence of communication breakdowns and the need for using expressive repair strategies.


A patient’s auditory enhancement score can also be useful, especially as an outcome measure of hearing aid benefit. For example, you might be able to demonstrate to a patient that without amplification, his or her ability to recognize speech during everyday communication situations is 50% but when he or she is wearing the new hearing aids, performance improves by 25%.


Theoretical Underpinnings of the Audiovisual Speech Advantage


A great deal of research has been directed towards understanding the audiovisual speech advantage (Blamey, Cowan, Alcantara, Whitford, & Clark, 1989; Braida, 1991; Grant, Walden, & Seitz, 1998; Massaro, 1996; Tye-Murray et al., 2008) and there is no consensus as to why seeing and hearing the talker is “super-additive,” meaning that a patient will perform significantly better in an audiovisual condition then would be predicted by adding together the patient’s percent correct scores obtained in an auditory-only test condition and a vision-only condition. For example (and similar to the point made at the onset of this chapter), one patient may score 30% in auditory-only condition and score 16% in a vision condition, and yet, in an audiovisual condition, performance is not 46%, which one might predict if performance was simply additive, but rather, 70% words correct. Researchers also have attempted to understand the wide variability of visual enhancement found within groups of research participants, such as that reported by Sommers et al. (2005).

Stay updated, free articles. Join our Telegram channel

Mar 2, 2020 | Posted by in OTOLARYNGOLOGY | Comments Off on Audiovisual Speech Perception and Speech Perception Training

Full access? Get Clinical Tree

Get Clinical Tree app for offline access