Speech Acoustic Measurement and Analysis
Chapters 8 and 9 are devoted to the theoretical bases of speech acoustics, with acoustic patterns of selected speech sounds presented to illustrate the theory. A variety of techniques are available for generating the speech acoustic displays shown in Chapters 8 and 9, and for using the displays to obtain speech acoustic measurements. These displays and measurements are the subject matter of the current chapter.
The current use of the term “techniques” goes beyond consideration of the instruments used to store, analyze, and represent (i.e., graph) the speech acoustic signal. In this chapter, the term includes the conceptual tools that have been developed to make sense of vocal tract output. When a speech signal—the acoustic output of the vocal tract—is displayed in the several ways discussed below, a large amount of information is available, not all of which is relevant to each of the many reasons for studying speech acoustics. For example, some individuals study the speech signal to make inferences about the articulatory behavior that produced the signal (as discussed in Chapters 8 and 9). Others may be interested in the characteristics of the signal used by listeners to understand speech. Still others may be interested in which parts of the speech signal are the best candidates for computer recognition of speech. And, of course, many scientists have studied the speech acoustic signal to develop computer programs for high-quality speech synthesis. Sometimes, what is important about the speech signal is relevant to all four of these areas of interest, but this is not always the case. Thus, the conceptual tools that make sense of the speech signal may be specific to a particular purpose.
Computers Are Not Smarter Than Humans
Speech recognizers are computer programs that analyze a speech signal to figure out what was said. The programs use acoustic analysis and other data (such as stored information on the probability of one sound following another) to produce a set of words that represents a best “guess” about the true nature of the input signal. Many of these programs learn the patterns of a single talker’s speech, and in doing so improve their recognition performance over time for that talker. Unfortunately, the improved performance does not always transfer to a new talker, whose acoustic-phonetic patterns are different enough from the original talker’s to confuse the speech recognition program. Humans, it should be noted, typically have no trouble transferring their speech recognition skills from one talker to another (but—see Chapter 12 for why this statement may not be completely true).
A brief history of the technology of speech acoustics research illustrates how much progress has been made in a relatively short time. A starting point is the multitalented German scientist Herman von Helmholtz (1821–1894), who in the 1850s was very much interested in explaining the acoustical basis of vowel quality. Like most phoneticians who puzzled over the relationship between “mouth positions” and different vowel qualities, Helmholtz used an ancient piece of equipment—the ear—as a spectral analyzer to determine the rules linking vocal tract shape and vocal tract output. Helmholtz’s innovation in the study of vowels was to insert between the mouth of the speaker and his own ear a kind of spectral analysis tool—in some cases a series of tuning forks, in others a series of Helmholtz resonators (Figure 10–1). When he used tuning forks (see Figure 10–1, top), Helmholtz asked a laboratory assistant to configure his vocal tract in the position of a specific vowel, and then struck the tuning fork and held it close to the assistant’s lips. If the tuning fork produced a very loud tone, Helmholtz assumed that the cavity inside the vocal tract was excited by the sound waves and was therefore “tuned” to the frequency of the fork. If the sound was weak, the natural frequency of the mouth cavity was assumed to be far from the frequency of the fork. Helmholtz was using the resonance principle discussed in Chapter 7, with the tuning fork serving as input, the vocal tract as resonator, and his ear as the detector of the output (see Figure 7–18 for a model of an input-resonator-output system). When a tuning fork produced a tone that resonated in the assistant’s vocal tract configured for a specific vowel, Helmholtz assumed that the frequency of the fork was a “natural” frequency of that vowel. By using a whole series of tuning forks, ranging from low frequencies to high frequencies, Helmholtz constructed a diagram of the important frequencies for different vowels, or stated differently, the frequencies that distinguish the vowels from one another. In another series of experiments, Helmholtz’s assistants phonated different vowels into the neck of a resonator while he listened at the other end (see Figure 10–1, bottom). By changing resonator size to sample a wide range of frequencies, Helmholtz identified which resonators, and hence which frequencies, seemed to produce the loudest sound at his ear. These frequencies were taken as the “natural” frequencies of the vowels. Both types of experiments show how the principle of resonance was applied to the problem of frequency analysis for vowels.
Helmholtz tried to circumvent a strictly subjective ear analysis by objective measurements, using physical instruments (i.e., tuning forks and Helmholtz resonators) with known frequency characteristics. After all, a sufficient number of tuning forks or resonators provides something like a low-tech Fourier analysis. In Helmholtz’s case, however, the perceptual magnitude (i.e., the loudness), rather than a physical amplitude, was identified for selected frequency components in the spectrum. The selected components were limited to the number of tuning forks or Helmholtz resonators available for the analysis. Even with these limitations, Helmholtz’s approach was innovative and creative for the middle of the nineteenth century, long before electronic instruments were available to analyze and quantify energy at different signal frequencies.
Figure 10–1. Helmholtz’s experimental arrangements for determining the important frequencies of vowels. Top panel shows the tuning fork approach. Bottom panel shows the resonator approach.
There were other attempts, prior to the electronics age, to obtain objective data from the speech signal. For example, in the latter part of the nineteenth century, W. König knew that the acoustic output of the vocal tract was in the form of pressure waves (Chapter 7, Figure 7–2). These rapidly varying regions of high and low pressures, König reasoned, could be studied by looking at the effect they produced on an observable event. König found such an event in the form of a gasfed flame like that used in chemistry labs. Figure 10–2 shows a schematic diagram of the König apparatus. Gas was fed to a chamber beneath the burner (“inflow of gas” in Figure 10–2). With a constant gas volume in the chamber the flame had a constant height. One side of the chamber consisted of a distensible (flexible) membrane, so movements of the membrane into the chamber compressed the gas and raised the height of the flame, whereas outward movements of the membrane expanded the volume of the chamber, which rarefied the gas and lowered the height of the flame. Participants in König’s experiment phonated vowels into a pipe terminated by this distensible membrane, and the compressions and rarefactions of the speech wave within the pipe caused rapid pushes and pulls on the membrane. These movements made the flame dance up and down as the gas volume was alternately compressed and expanded. By filming the motions of the dancing flame during phonation of vowels, König was able to record a crude version of speech waves.
The conceptual basis of König’s technique, of having sound waves create movement in a structure that is transformed into an observable event, is the basis of modern electronic recording of speech signals. For example, many microphones transform acoustic to electrical energy by means of a very thin membrane whose vibratory movement creates varying voltages in response to air pressure fluctuations of a sound wave. These varying voltages are recorded on digital tape or directly to disk, where they provide an electronic replica of the speech wave.
Devices for the transformation of aeromechanical (in this case, pressure) to electrical energy did not become readily available until the early part of the twentieth century. The great advantage of electronic devices is the ability to record and analyze a wider range of frequencies than possible using strictly mechanical devices, and the related ability to capture very accurate details of sound waves. Devices such as König’s were not able to record rapidly changing details of sound waves.
Why do electronic devices extend the range of detailed frequency and amplitude analysis past that of a device like König’s? As described in Chapter 7, there is an inverse relationship between frequency and period. When sound waves contain energy at higher frequencies, the motion of the air particles is very rapid (i.e., they have relatively short periods). In a device like König’s, the ideal situation is for these very rapid motions of air particles to strike the membrane and set it into vibration at exactly the same high frequencies. The high-frequency motions of the membrane would be reflected in high-frequency fluctuations in flame height, and the flame would provide a precise representation of the energy in the sound wave.
Figure 10–2. The König device for visualizing and recording pressure fluctuations in the speech wave. Compressions and expansions of the gas volume produced rises and falls in the flame height, which were filmed to obtain primitive speech waveforms.
Unfortunately, this is not always the case with a mechanical vibrator such as the membrane in König’s device. Because the membrane has mass and therefore demonstrates inertia, it resists being accelerated by certain forces—especially those that last only a short time. The rapidly vibrating air molecules associated with high-frequency energy, and therefore short periods, result in short-lasting compressions and rarefactions. Because the membrane demonstrates inertia, it is likely to move in response to these high-frequency vibrations only a little bit, or perhaps not at all, because high-frequency forces are not applied over a long enough time interval to overcome the opposition to acceleration. Think of it this way: a short-lasting compression is applied to the membrane to compress the gas in the chamber, but the membrane does not respond immediately to the applied force because it has mass and opposes being accelerated. By the time the membrane begins to respond with inward motion, the rapidly varying pressure applied to the membrane has changed to a rarefaction phase, pulling the membrane outward, away from the gas-containing chamber and potentially enlarging the volume of gas within the chamber. In a sense, the membrane never gets a chance to respond accurately to the applied forces because of its inertial properties. The membrane acts like a filter, responding accurately with motion to lower frequency signals whose energy can be applied to it over a long enough time interval, but being insensitive to, and hence filtering out, the energy associated with higher-frequency vibration. The vibration of the membrane does not represent accurately all details of the pressure wave applied to it. The vibration of the membrane distorts the details of the pressure wave, and this distortion is passed on to the variation in flame height. Thus, the fluctuation in flame height over time is not an entirely accurate representation of the acoustic event.
In the case of the König device, the relatively massive membrane transduced the pressure waves via compression or rarefaction of the gas in the chamber. König’s membrane had to be relatively massive to produce effective compressions and rarefactions of the gas molecules. When electronic recording and storage of acoustical signals became available, this problem was more or less eliminated. Electronic transducers, which form the heart of a microphone, are still membrane-like but are extremely delicate and have minimal mass. The vibrations of these transducers produce changes in the motion of electrons in electrical circuits. The ease of electron acceleration and deceleration (due to negligible mass) allows these thin, minimal-mass membranes to respond faithfully to very high frequencies in sound waves.
Early electronic recordings and analyses of vocal tract output resulted in waveforms such as those shown in Figure 10–3. A 70-ms waveform “piece” is shown for each of the four corner vowels of American English (/ɑ/, /i/, /æ/, /u/, clockwise from upper left-hand panel). These vowels were produced as sustained phonations by an adult male. The 70-ms pieces shown in the figure were extracted from the sustained sounds using computer-editing techniques. These waveforms are faithful representations of the pressure waves associated with the vocal tract outputs (they are “clipped” a bit at the bottom, by the picture, not the recording process) for the corner vowels, with some obvious similarities across the four waveforms. First, each waveform has a repeating period, one of which is marked in each panel. The repeating period is expected because all vowels in English are produced with nearly periodic vibration of the vocal folds. Second, each waveform shows smaller amplitude vibrations between the largest amplitude peaks. The largest amplitude peaks reflect the energy produced at the instant of vocal fold closure (see Chapter 8), and the smaller energy peaks are produced by resonances in the vocal tract. From the vowel theory presented in Chapter 8, it is known that the vocal tract resonances are different for the four corner vowels; these different resonances explain the different appearance of the four waveforms in Figure 10–3. For example, the /ɑ/ and /æ/ waveforms seem to have more complex energy patterns within a given period compared with the /i/ and /u/ waveforms. Note the many amplitude fluctuations within the periods of the first two vowels, compared with the smaller number of amplitude peaks within the periods of /i/ and /u/.
The earliest microphone was developed in 1876 or 1877. It used a membrane that was displaced by sound waves, and the motion of the membrane was transmitted to a metal pin sitting in an acid solution. The motion of the membrane moved the pin to various depths of the acid solution, which changed the electrical characteristics of the system. Happily, this “liquid transmitter” system didn’t catch on, because Alexander Graham Bell found he had to shout at the membrane to produce even a barely audible sound at the other end of the device (three miles away!).
Figure 10–3. Seventy-millisecond waveform pieces from each of the four corner vowels of English. Pieces were extracted from sustained vowels. The waveforms are “clipped” a bit at the bottom, by the picture, not the recording process. Note the individual glottal pulses, and the period (T) marked on each vowel waveform. Note also the different waveform appearance depending on which vowel was produced.
Although the resonance patterns are reflected in the different waveform patterns displayed in Figure 10–3, the differing resonant frequencies of the vowels cannot be determined merely by looking at the waveforms. The resonant frequencies can be determined using the technique of Fourier analysis, where a complex waveform is decomposed into the frequencies and amplitudes of the component sinusoids (see Chapter 7). Originally, this may have been done with paper and pencil and the application to the measurements of the correct formulas, or with a mechanical device called a Henrici analyzer. In either case the process was tedious. As electronic devices became more sophisticated, specialized instruments were developed for automatic computation of a Fourier spectrum. These devices, called spectrum analyzers, took a waveform as input and stored some part of it in an electronic memory. The stored part might be a 70-ms piece such as the ones shown in Figure 10–3. The spectrum analyzer performed a Fourier analysis of that piece and showed the computed spectrum on a display screen. These analyses were quite accurate, but required a fair amount of computation time. Moreover, the spectrum results were more accurate when the stored piece was longer, rather than shorter.
The duration of the waveform undergoing analysis was of great concern for speech scientists, who even at the dawn of the electronic age had a good idea that the articulators, and hence the shape of the vocal tract, were in nearly constant motion. A discussion of the relationship between vocal tract resonances and the shape of the vocal tract is presented in Chapter 8. When the shape of the vocal tract changes, so do the resonant frequencies. During speech production, many phonetically relevant changes in vocal tract configuration are quite rapid, in some cases occurring within intervals as brief as 40 to 50 ms. For example, when a lingual stop consonant such as /d/ is released into a low vowel such as /æ/, the vocal tract changes from the consonant to vowel configuration in about 40 ms. This fast change in the shape of the vocal tract results in rapidly changing vocal tract resonances. These time-varying resonances of the vocal tract are called formant transitions.
Spectral analysis of rapidly changing formant transitions is often performed to determine the frequency range covered by a formant transition, or the rate of the frequency change as a function of time (that is, the slope of the frequency change). A schematic second formant (F2) transition for the syllable /bæ/, produced by an adult male speaker, is shown in Figure 10–4. The upward-pointing arrow at 0 ms indicates the beginning of the formant transition, which occurs at the first glottal pulse (vocal fold vibration) following release of the /b/. The downward-pointing arrow at 50 ms marks the end of the formant transition, or the point at which the frequency ceases changing as a function of time (note the relatively constant F2 value after the downward-pointing arrow, indicating a steady vocal tract shape for the vowel “target.” This F2 transition, which covers a frequency range of 600 Hz in 50 ms, presents problems for the “static” spectrum analyzer described above. If the entire 50-ms piece of the waveform is submitted for spectral analysis, the resulting spectrum is an average of all the changing frequencies along the transition. This analysis clearly misrepresents the true vocal tract output. The “smeared” spectrum resulting from this analysis is virtually useless, especially with respect to interpretation of the articulatory behavior that produced the transition. The analysis can better match the true frequency event by picking brief temporal pieces along the transition, analyzing the spectrum of each piece, and combining the successive spectra to reconstruct the transition. This is shown by the successive, short-time pieces, along the bottom edge of the transition, each having a duration of 10 ms. As shown in Figure 10–4, a complete analysis of the transition using these 10 ms pieces requires five consecutive spectral analyses (5 pieces × 10-ms per piece = 50 ms, or the duration of the entire transition). This approach minimizes the spectral smear that occurs when the 50 ms piece is analyzed as a single interval, but even so it is not particularly attractive because: (a) the amount of work involved in separately isolating and analyzing each 10 ms piece is substantial; (b) the short-time pieces, as noted above, result in a loss of accuracy in the analysis of the frequency components; and (c) the spectral smear problem is not entirely eliminated, because each 10 ms piece covers a 120 Hz change (120 Hz for each of five 10 ms pieces = 120 × 5 = 600 Hz); a formant frequency change of 120 Hz over a 10-ms interval is a significant change in vocal tract output.
Figure 10–4. Schematic drawing showing an F2 transition for /bæ/. Upward-pointing arrow at 0 ms shows beginning of the transition. The end of the transition is shown by the downward-pointing arrow at 50 ms. The pieces spanning the duration of the transition, labeled 1 through 5, show individual 10-ms intervals that can be extracted from the transition for analysis. See text for details.
The electronically based spectrum analyzers available to early speech researchers, therefore, had notable limitations. The analyzers allowed a previously unknown precision of frequency analysis, but the requirements of the analysis (e.g., a long “piece” of the signal for precise analysis) were not well matched to many of the important features of a rapidly changing vocal tract output. Spectral analyses of long-duration speech signals can be used effectively for waveforms extracted from sustained vowels (see Figure 10–3) or some target intervals for a vowel in connected speech. In these cases, the vocal tract shape remains constant over a long time (sustained vowels) or a long enough time (targets in connected speech) to permit a reasonable, fast spectral analysis. But constancy of resonant frequencies over time is not typical of speech production. Rather, rapid frequency change over time is the rule in speech acoustics for both vowels and consonants, and almost certainly plays an important role in speech perception as well.
In summary, electronics allowed the recording and analysis of acoustic waveforms associated with vocal tract output. These electronic waveforms represented the actual fluctuations in pressure waves with a high degree of accuracy but did not provide immediate (i.e., visible) access to the formants, the information critical to an understanding of vocal tract shape (see Chapter 8). Inspection of the waveform revealed the presence of formants, reflected in the complex vibratory features within each waveform period (see Figure 10–3), but not the actual frequencies of the resonances. Early spectral analyzers took pieces of waveforms and performed very accurate measures of formant (resonant) frequencies, but only for relatively long duration waveforms. Because speech production involves rapidly changing vocal tract shapes over relatively short time intervals, these were not ideal analyses. What was needed was an analyzer that was able to display formants as a function of time. Such an analyzer would perform a spectral analysis as a (nearly) continuous function of time and display the spectral peaks (formants) in such a way that a changing shape of the vocal tract could be inferred from a mere glance at the physical record. The development of the instrument capable of doing this kind of analysis was, in part, an ironic by-product of the human race’s oldest failure of communication—war.
THE SOUND SPECTROGRAPH: HISTORY AND TECHNIQUE
Throughout the course of World War II (1939–1945), there was an increasing use of encoded messages sent between different command posts, or from central locations to troop locations on the battlefield. This encoding, or encryption, was necessary because the warring parties were constantly monitoring communications sent by the other side. One side employed the talents of many different people to develop codes for effective encryption of a message, and the other side employed a large staff to figure out how to break these codes. The code breakers typically worked with paper and pencil, laboring over an encoded message and working through possible decoding solutions. At some point during the war, the allies (the United States, Soviet Union, Great Britain, and France) assembled a team of linguists, puzzle and code experts, mathematicians, engineers, psychologists, and other specialists to develop an automated decoding device. The theory behind such a device was relatively simple. An encoded message, such as a radio voice transmission, would be fed into a machine that performed various types of analysis to decode the message (as in the 2014 film The Imitation Game). The project of the allies failed to solve the problem of automated decoding of encoded messages, but one of its products was a device called the sound spectrograph.
The ideal spectrum analyzer for speech, as suggested above, can display formants as a continuous function of time. This is precisely what was produced when the sound spectrograph was used to analyze speech. Scientists in the Soviet Union—prisoner scientists, forced to work on projects ordered by the state—developed a sound spectrograph in the late 1940s, as related by the renowned writer A. I. Solzhenitsyn (1969) in his documentary novel The First Circle. In the novel, Solzhenitsyn’s character Major Roitman explains how the spectrograph displays the speech signal in what he calls a voice print:
In these voice prints speech is measured three ways at once: frequency, across the tape;—time, along the tape; and amplitude—by the density of the picture. Therefore, each sound is depicted so uniquely that it can be recognized easily, and everything that has been said can be read on the tape. (p. 217)
A sample spectrogram, with a time-synchronized waveform immediately above, is shown in Figure 10–5. When Major Roitman described frequency across the tape, he was referring to the y-axis of the spectrogram, which extends in this example from 0 Hz (the baseline of the spectrogram) to just below 8.0 kHz. The x-axis in this spectrogram is time, corresponding to Roitman’s along the tape dimension. In the present case, the time axis is marked off in 100-ms increments. The spectrogram shows a number of dark bands, all of which seem to vary in height (i.e., along the y-axis) across time. When Major Roitman described amplitude in terms of the density of the picture, he was referring to the varying darkness of different locations on the spectrogram. The darker the spectrogram at any given point in time and frequency, the greater the energy at that point. In Chapter 8, the formants of vowels were described as peaks in the spectrum, the locations in the spectrum where energy is at a maximum. Because the dark bands in the spectrogram shown in Figure 10–5 indicate locations of very high energy, they are the formants of the vowels. These dark bands vary in height (along the y-axis) across time, showing that the formant frequencies change substantially throughout an utterance.
Today, spectrum analyzers are digital and are programmed to the needs of speech analysis. These software packages produce spectrograms and other displays of a host of speech analyses. Figure 10–6 illustrates how the early versions of the spectrograph (the instrument) generated spectrograms (the pictures produced by the spectrograph). The operator of the spectrograph flipped a switch that caused rotation of a turntable platter as shown in Figure 10–6. Wrapped around the edge of the platter was a magnetic strip that served as an electronic recording medium. A speech signal1 was recorded onto the rotating magnetic band, which functioned as a closed tape loop permitting storage of no more than about 2.5 seconds of continuous speech. The frequency and amplitude information in the speech signal, as a function of time, was stored as magnetic patterns on this loop. With the recorded signal in place on the magnetic loop, the turntable was then rotated much faster than the recording speed2 and the magnetic patterns served as the input to a spectrum analyzer.
Figure 10–5. A sample spectrogram, showing time along the x-axis, frequency on the y-axis, and intensity as the darkness of the trace. The interval between each vertical tick on the time axis is equal to 100 ms, and horizontal lines running across the spectrogram from bottom to top (y-axis) mark 1000 Hz increments. The darkness of the tracing at any location indicates the relative intensity at that time-frequency coordinate. The speech waveform is above the spectrogram.
Figure 10–6. Schematic diagram of the components of the classic sound spectrograph. Speech is recorded onto the continuous magnetic tape around the turntable platter, and the magnetic fluctuations corresponding to the speech signal are passed to a spectrum analyzer. The voltage output of the spectrum analyzer is sent to a heated stylus that burns patterns onto special paper mounted on the rotating drum. See text for details.
When the world-famous writer Alexander Solzhenitsyn (1918–2008) wrote The First Circle, he had the firsthand experience of the sharashka, the Russian word for a prison camp for scientists. In the Soviet Union, many writers, artists, and politically active people were imprisoned simply for their beliefs. Most were sent to forced-labor camps, described in Solzhenitsyn’s monumental work, The Gulag Archipelago. A very few, like Solzhenitsyn, were more fortunate and were sent to a sharashka to work on a scientific project. Solzhenitsyn’s role in the development of the spectrograph was as a linguist. Life at a sharashka was infinitely better than in the regular prison camps, but Solzhenitsyn’s title reveals his feelings about his time there. It is taken from the “first circle of hell” in Dante’s The Divine Comedy.
The spectrum analyzer performed its analysis by moving a fixed-width analysis band, or filter, across the entire frequency range. This is illustrated in Figure 10–7, which shows the hypothetical results of a Fourier analysis of the vowel /i/. Frequency is shown on the y-axis, increasing from bottom to top (i.e., the presentation of the spectrum is rotated 90o counterclockwise and then flipped for this example), and amplitude is shown on the x-axis. Each of the lines in this Fourier spectrum is a harmonic (the first three are indicated on the spectrum), and the length of each line extending to the right from the baseline indicates the relative amplitude of the harmonic (e.g., the third harmonic has greater amplitude than the second harmonic). At the bottom of the frequency scale is a small bracket that extends over a frequency range of 300 Hz. This bracket is the fixed-width (width = 300 Hz) analysis band mentioned at the beginning of this paragraph. The arrow pointing down and to the left from this analysis band is labeled voltage output, which is the key to the spectrum analysis performed by the spectrograph. The analysis band senses the overall energy on the magnetic tape, which it transforms into a voltage. Higher voltages are associated with greater energy, lower voltages with lesser energy. Because the analysis band covers a range of only 300 Hz, the voltage output is only for the frequencies within the band and is an average of all energy within the band. If the analysis band is swept continuously across the entire frequency range it will provide overall voltage outputs as a continuous function of frequency. In Figure 10–7, the arrow pointing up from the 300 Hz analysis band indicates the direction in which the band is moved continuously across the frequency range. The band starts at the lowest frequencies, where the output voltage reflects the overall energy from 0 to 300 Hz. As it moves continuously upward in frequency, the analysis band produces output voltages from the 5- to 305-Hz interval, the 10- to 310-Hz interval, and so forth. The band is always 300 Hz wide; as it moves across the frequency scale it records the average energy for successive 300 Hz bands at progressively higher frequencies, and sends these voltages as output to the stylus shown in Figure 10–6.
Figure 10–7. Schematic diagram showing how the spectrograph performs spectral analysis by sweeping an analysis band of fixed width (e.g., 300 Hz) across the frequency range of interest, and recording the average voltages from the analysis band as a continuous function of time and frequency. Frequency is on the y-axis, intensity on the x-axis. Each harmonic of a vowel is shown as a line extending to the right; the length of the line indicates the relative intensity of that harmonic (H1 = first harmonic [F0], H2 = second harmonic, H3 = third harmonic, and so forth).
Recall that the magnetic patterns on the tape are fed into (i.e., serve as input to) the spectrum analyzer while the turntable is rotating. The rotation of the turntable means that the spectrum analysis is conducted as a function of time because different “pieces” of the utterance pass the analysis band at different points in time. Thus, at any instant in time, the 300 Hz analysis band provides a voltage output for the frequencies it is covering. When the analysis band is swept across the frequency range continuously, and the turntable rotates enough times to allow the analyzer to sample every instant of an utterance, the result is a set of voltages at all frequencies (i.e., within some predetermined frequency range) and at every instant in time around the tape loop. Thus, the example of Figure 10–7 is for a single instant in time—a spectral “slice in time”—and the sum of all such slice analyses results in a spectrogram of the type presented in Figure 10–5.
How does the voltage from each analysis band result in marks of varying darkness on the spectrogram? The schematic drawing of the spectrograph (see Figure 10–6) shows a drum attached to the turntable, and a stylus marking the drum. When the turntable rotates for the spectral analysis of the recorded speech signal, the attached drum rotates at the same speed. A piece of special heat-sensitive paper is wrapped around the drum, and the stylus is applied to the paper and heated in proportion to the voltage output from the analysis. As the analysis band is swept slowly across the frequency range, the stylus is synchronously transported up the vertical dimension of the spectrogram (see frequency dimension along the drum in Figure 10–6). The varying voltages from the analysis band are burned onto the special paper, with darker regions representing areas of relatively greater acoustic energy, and lighter regions representing areas of relatively lower acoustic energy. The entire process of recording a speech signal onto the tape loop, mounting the paper around the drum, and burning a complete frequency-by-time pattern onto the paper takes place over about 100 seconds. All this to obtain acoustic knowledge of no more than 2.5 seconds of speech.
The Original Sound Spectrograph: Summary
A detailed discussion has been devoted to the origins and function of the sound spectrograph for several reasons. Most importantly, the invention of this instrument initiated a scientific revolution in the study of speech production. For the first time, and with relative ease, study of the time-varying acoustic results of articulatory processes became possible. These time-varying characteristics were revealed most prominently by the always-changing formant frequencies. If articulator movements are changing as a function of time, and therefore changing vocal tract configuration as a function of time, the changes are reflected in formant transitions—formant frequencies that change over time. The discovery of these changes led to new ideas and insights about the behavior of the articulators in speech production.
Another reason for the detailed discussion is to (hopefully) demystify the general engineering concepts of speech acoustic analysis. An engineering degree is not necessary to understand the conceptual basis of spectrographic analysis. The amplitude and frequency characteristics of a speech signal are stored as a function of time on magnetic tape. The time-varying patterns of electromagnetic strength are submitted to a spectrum analyzer in the form of time-varying voltages (corresponding to the time-varying intensity of the magnetic fields on the tape), where voltage is proportional to sound intensity (greater voltage = greater intensity) and the speed with which the voltage changes is proportional to frequency (faster voltage changes [shorter periods] = higher frequencies). The energy in the spectrum is sampled using an analysis band, or filter, that has a bandwidth of 300 Hz that is swept continuously across the entire frequency range of interest. Because the voltage output from the analysis band is available for all frequencies and at every point in time, the spectrograph creates a total picture of the speech spectrum as a function of time. This picture is created by burning the energy patterns onto a piece of heat-sensitive paper, which results in a spectrogram. This process is summarized in Figure 10–6, by following the arrow from the turntable (the magnetic tape) all the way around to the stylus at the spectrograph drum.
Today, when scientists and clinicians make spectrograms to study speech, they do so digitally, using a desktop or laptop computer, a tablet, or smart phone. These digital spectrograms are displayed on screens and look like the one shown in Figure 10–5. Digital spectrograms are produced almost instantaneously after an utterance has been recorded. The computer allows a spectrogram to be generated in just a fraction of the time required to produce the burned records described. The principles discussed above for the original spectrograph are basically the same in digital spectrograms; the frequency and amplitude analysis are performed by moving a digital filter from low to high frequencies with the output being a digital magnitude that is proportional to amplitude as a function of frequency. The spectrograms shown in this textbook were produced using digital techniques, as are other forms of speech acoustic analyses. The rapid development of computer-based analysis of speech has resulted in a host of new analyses for speech acoustics, but the spectrogram remains the gold standard because it is such an immediate and rich source of information about speech production, and speech perception (see Chapter 12).
A detailed presentation of spectrograms and their interpretation is now provided. Selected information on the application of spectrographic analysis to the understanding of speech disorders is presented in Chapter 11.
Speech Acoustics as a Health Hazard?
Those of us of a certain age (your authors included), who made and analyzed spectrograms before digital spectrograms became a reality, may read the title of this sidetrack and find themselves smiling and sniffing nostalgically. As the pattern was burned onto the special paper, carbon smoke floated away from the spinning drum and filled the room with the smell of speech acoustics. That smell was something like the exhaust of a car with a corroded, burned-out muffler, seasoned with fume from an electrical fire. The wearing of light-colored clothes to the lab was discouraged—one would find fine black specks on a nice white sweater after a few hours of spectrogram-making. Some people—especially graduate students assigned to prepare spectrograms—took to wearing surgical masks in the lab.
Interpretation of Spectrograms: Specific Features
Figure 10–8 shows a spectrogram of the utterance Peter shouldn’t speak about the mugs. A broad phonetic transcription of the sounds in the utterance is provided at the bottom of the display. Immediately above the spectrogram, on the same time scale, is the waveform of the utterance. The utterance was produced by an adult male aged 52 years, at a normal rate of speech and without special emphasis on a particular word. The utterance was chosen for its ability to showcase certain spectrographic patterns, not because it has special meaning (as far as we know, Peter does not plan on ruining a surprise birthday present of nice coffee mugs by telling the intended gift-receiver about them before the package is opened). This spectrogram was produced with the computer program TF32, written by Professor Paul Milenkovic of the Department of Electrical and Computer Engineering at the University of Wisconsin–Madison. TF32 is a complete speech analysis program that includes algorithms for recording, editing, and analyzing speech waveforms, as well as displaying the speech signal as a spectrogram. Most of the speech analysis displays shown in this text were produced with TF32 (http://userpages.chorus.net/cspeech/). Another very popular speech analysis program is Praat (http://www.fon.hum.uva.nl/praat/), which performs and displays many of the same analyses performed by TF32, as well as some unique analysis features. Praat is a free download on the Internet.
The important features of the spectrographic display in Figure 10–8 include the x-, y-, and z-axes; glottal pulses; formant frequencies; silent intervals; stop bursts; and aperiodic intervals. Each of these features is discussed below. It is important to point out that a casual glance at the spectrogram suggests a series of chunks, or segments, as the pattern is inspected from left to right. An individual with no training in speech acoustics, shown this spectrogram and asked to find natural “breaks” in the pattern along the time axis, probably would be able to do this easily (try it!). The chunks, or segments, are important because they often correspond roughly to speech sounds. Chapter 11 presents detailed information on the specific acoustic characteristics of the sound segments of English, and in some cases of other languages as well.
The x-axis is time, marked off in successive 100 ms intervals by the short vertical lines occurring at regular intervals along the baseline of the spectrogram. These 100-ms calibration intervals are similar in length to many of the segments in this spectrogram, suggesting a relatively short time span for important events of speech production. Using these calibration intervals, it is possible to estimate the entire duration of the utterance at just under 2000 ms, or a little less than 2 s. This does not seem like a particularly long time, but it is typical for utterance durations. An utterance of 2 s duration contains many distinct segments.
Figure 10–8. Spectrogram showing important features of a spectrographic display. Follow text description for information on axes, glottal pulses, formant frequencies, silent intervals, stop bursts, and aperiodic intervals.