Auditory psychophysics (also called psychoacoustics) is a branch of science that is concerned with the effects of physical stimuli (in this case, sound waves) on the psychological responses of humans. The term “psychophysics” is relevant to any biological system that transduces and processes physical attributes of events in the environment and transforms them into sensations and perceptions. Such systems include visual, olfactory, tactile, taste, and, of course, auditory systems.
The “physics” part of “psychophysics” refers to characteristics of the physical stimulus, the “psycho” part to the psychological responses to the composition and change of the physical stimulus. On the psychological side of the term, there has been a tendency to equate “sensation” with peripheral (outside the central nervous system) and perhaps brainstem responses to a physical stimulus. The term “perception” has been reserved for the more elaborated psychological processes taking place at subcortical and cortical levels.
Why is a chapter on auditory psychoacoustics necessary, separate from a chapter on characteristics of acoustic signals (Chapter 7)? If acoustic signals are simply conveyed by the peripheral and central structures of the auditory system in their physical form of frequencies, amplitudes, durations, and other features, knowledge of acoustic signal characteristics would be sufficient. However, the auditory system, like other sensory-perceptual systems, is not passive; it not only transmits, but also transduces, transforms, and codes signal characteristics to produce psychological responses to sound that are not mere reflections of the signal. The pinnacle of these complicated processes is the way in which speech acoustic signals are transformed to speech and language perception. In the current chapter, only non-speech auditory psychophysics are discussed (Chapter 12 focuses on the perception of speech).
The topics covered in this chapter include the psychophysics of thresholds and loudness, pitch, timbre, time, and sound source localization. The presentation is tailored to students with no background in auditory psychophysics, with the assumption that the previous chapter on auditory anatomy and physiology has been studied carefully. An excellent source of in-depth information on auditory psychophysics is Moore (2013).
Loudness is a psychological phenomenon that is related to the physical magnitude of sound energy but is not the same as the physical quantity. Whereas the magnitude of sound energy can be measured by an instrument, such as a sound level meter, loudness cannot. Loudness is a sensation or perception that can only be measured by listener responses. Students are encouraged to read the Appendix to Chapter 7, which explains the use of the term “decibels” (dB) to express the sound pressure level (SPL) of a sound wave. SPL is a measure of the physical magnitude of sound.
The determination of the lowest sound energy that humans can detect has a long history, extending at least back to 1933 when Sivian and White reported results of careful measurements on 14 participants with normal hearing (see Sivian & White’s review of work prior to their publication, pp. 299–304 and pp. 308–312). Sivian and White (1933) used sinusoids of relatively long duration (at least 1 second) to obtain data on a quantity they called minimal audible field (MAF). By this they meant the minimal sound intensity audible to listeners with normal hearing as they sat 1 meter in front of a loudspeaker that delivered the tones. In other words, they wanted to determine the threshold for sound detection in a sound field, for multiple frequencies.
Determination of minimal audible field, or its nonidentical twin minimal audible pressure (where the sound is delivered by a headphone covering the pinna or via a tube in the ear canal), is a highly technical business. Many variables affect the measurement of thresholds, ranging from the equipment used to generate and deliver tones to the individual characteristics of a person’s head—in particular, the pinna. Another variable is the room in which a participant is tested. For the six lowest frequencies they tested, Sivian and White used sinusoids to establish minimal audible field. In the case of frequencies at 1100 Hz and higher, the signal used to establish minimal audible field was a “warble tone.” These are nearly sinusoidal signals whose frequency varies subtly around a single frequency at a constant rate (e.g., five warbles per second). A warble tone is a bit like a sinusoid enhanced with a small degree of vibrato. Why use warble tones at the higher test frequencies? First, Sivian and White thought the warbles were more interesting to listeners and less likely to contribute to the substantial fatigue so common in psychoacoustic experiments; listening carefully for a long time is hard work. Second, sinusoids presented in more or less closed rooms result in standing pressure waves where multiple peaks and valleys of pressure are distributed across the listening space in a “frozen” pattern. The sound waves reflect off walls in a way that places the peaks and valleys of pressures in precisely the same location as the original wave. This results in “standing” waves of high and low pressures across the room. For higher frequencies, the precise placement of a listener’s head might locate the test ear at a peak or valley of the standing pressure wave, which affects the threshold estimate. Warbling a sinusoid does not allow standing waves to be “caged” in the test room.
A sound field is a listening environment in which the ears are not covered by earphones. For this experiment, Sivian and White (1933) placed their listeners in a heavily sound-absorbent enclosure to reduce extraneous environmental noise as much as possible. The purpose of this was to establish minimal detectable sound energy as if the limits of hearing detection were being estimated. The collection of auditory threshold data under nearly optimal listening conditions is, in fact, a standard approach to one aspect of the clinical assessment of hearing capabilities.
Listeners signaled with a button when they heard a tone. The experimenter controlled the frequency, intensity, and duration of the tones. Thresholds were obtained for 21 discrete frequencies between 100 Hz and 15,000 Hz. At each frequency, the threshold was determined by first presenting the signal at a level (intensity) that was clearly audible; this established the pattern of the listener pressing the button when he or she heard the tone. The experimenter then lowered the intensity in systematic steps, in separate presentations, where each tone had a duration of roughly two seconds. The intensity was decreased until the listener stopped responding, apparently because the intensity was insufficient to elicit a response. The tone intensity was then raised slightly until a response was again obtained, lowered slightly to eliminate the response, raised again to elicit a response, and so on, until the experimenter recorded the signal intensity corresponding to the “just audible” responses. The intensity at this “just audible” level was recorded as the threshold for that frequency.
The minimal audible file curve in Figure 14–1, labeled “Threshold”, is not taken directly from Sivian and White (1933), but rather is a composite of their results and those of other scientists (Killion, 1978). The x-axis is frequency, scaled logarithmically (base 10) to reflect the large range of tested frequencies on an axis of reasonable length, and the y-axis is SPL in dB with a reference pressure of 20 μPa (micropascal, or the smallest average pressure that humans can hear in their most sensitive frequency range; see Appendix to Chapter 7). The “0 dB” level on the y-axis does not indicate the absence of sound energy, but rather a measured sound pressure equivalent to the standard reference pressure (20 μPa). If SPL= 20 log p1/p0, where p1 = the measured pressure and p0 = the reference pressure, when p1=p0 the pressure ratio is “1” (log of 1 = 0) which gives “0 dB.” Thus, it is possible for a threshold to be negative when the sound pressure measured at threshold is less than the reference pressure. In clinical practice, negative thresholds are found frequently among young adults with normal hearing. The threshold curve shown in Figure 14–1 is the binaural, sound-field sensitivity curve of the normal human auditory system as a function of frequency.
Figure 14–1. Minimal audible field (lower curve) as determined in Sivian and White’s (1933) experiment and similar experiments performed by other scientists. The upper curve (labeled ~90 dB above threshold) is discussed in the text, as is the term “phon.”
Two features of the minimal audible field (threshold) curve in Figure 14–1 are critical to the present chapter. First, the SPL at auditory threshold varies across signal frequency. Specifically, a much greater SPL is required to reach threshold at low and high frequencies compared with the SPL at thresholds for midrange frequencies (roughly 1000–5000 Hz). For example, threshold at 100 Hz requires a sound pressure of approximately 28 dB re: 20 μPa (reference pressure), whereas the sound pressure required for threshold at 3000 Hz is just a little below the reference pressure of 20 μPa (that is, the threshold at 3000 Hz is just below 0 dB).
The fact that the SPL required to reach auditory threshold varies across frequencies has direct application to clinical evaluation of hearing. An audiometer is the instrument used in audiology clinics to establish sensitivity to sinusoids. The audiometer has the same calibrated steps of SPL (usually ranging from 0 to 110 dB) for each frequency. When an examiner delivers a sinusoid (pure tone) to a listener at a dial reading of 0 dB, the SPL generated by the audiometer corresponds to the average threshold level at the selected frequency, as determined from a large population of normal-hearing young adults. This means that the 0 dB setting on an audiometer is associated with varying output SPL for different frequencies. The varying SPLs at 0 dB, across frequencies, are very similar to the minimal audible field thresholds shown in Figure 14–1. This is why the dB levels on the intensity controls of an audiometer, and the levels shown on an audiogram (a record of a patient’s thresholds as a function of frequency) are referred to as dB hearing level (HL). Two frequencies with a common HL setting (same number on the audiometer dial for level; for example, 25 dB HL) deliver signals with different SPLs. These different SPLs are most prominent across frequency when sound levels are near threshold (0 dB HL). As the sound level increases, the differences across frequency become less dramatic (see below, discussion of the 90 dB curve in Figure 14–1).
The second important aspect of the threshold curves in Figure 14–1 is that the most sensitive region for hearing is between 1000 and 5000 Hz (Moore, 2013). The resonant frequencies of the external auditory meatus (~3300 Hz) and the ossicular chain (~1400 Hz) make a significant contribution to this frequency region of maximum sensitivity (Chapter 13).
Is it possible to determine just how much the conductive resonances (of the external auditory meatus and the ossicles) contribute to auditory thresholds? The answer is “yes.” In clinical testing, pure-tone thresholds are most frequently determined under headphones or with insert loudspeakers (where the sound output is delivered to the tympanic membrane by a narrow tube extending into the external auditory meatus from a firm seal at the entrance to the canal). Whether by earphone or tube, the open end of the external auditory meatus is closed and the resonant characteristics of the meatus are modified. The 1/4-wavelength rule no longer applies to the resonance because the tube is now closed at both ends. In theory, the threshold around 3500 Hz under an earphone should be higher (less sensitive) compared with the 3500 Hz threshold in a sound field. This is precisely what has been found: in the region between 3000 and 4000 Hz minimal audible field thresholds are 10 to 15 dB lower (more sensitive) than thresholds determined under earphones or with a probe tube.
Thresholds obtained under headphones or with insert tubes are higher (worse) at virtually all frequencies when compared with thresholds obtained in the sound field, but the differences are most notable in the region centered around the resonance of the external auditory meatus. Even when thresholds are obtained in the sound field (no headphones or insert loudspeakers) with the external auditory meatus of only one ear occluded with cotton, the minimal audible field threshold is worse than when both ear canals are open.
A general conclusion concerning auditory thresholds is that many factors influence the threshold measured at any frequency. Courses on audiometry provide additional information on how, in the clinical setting, these factors are controlled. The information on auditory thresholds provided above is necessary for discussion of the more advanced psychoacoustic phenomena presented below.
Another way to think about the threshold curve in Figure 14–1 is as an equal-loudness contour. The contour shows the SPLs required to just reach audibility across different frequencies, so it follows (as an example) that equal loudness of tones at 3000 Hz and 100 Hz is achieved at threshold when the intensity of the 100 Hz tone is roughly 28 dB greater than that of the 3000 Hz tone. This underscores an interesting psychophysical fact, that equal sound energy does not imply equal loudness. Note the distinction between the terms “sound energy” and “loudness.” “Sound energy” is a physical quantity, measured for magnitude in dB by an instrument such as a sound level meter. “Loudness” is a psychological quantity, reported by a listener in response to a physical stimulus or inferred from a loudness match between two tones. Sound energy and loudness are not the same.
Details of equal-loudness contours differ across sound energy levels. The contour labeled “~90 dB above threshold” in Figure 14–1 shows the SPLs across frequency that result in equal loudness when the signal level is roughly 90 dB above threshold at 1000 Hz. This curve was generated by using the 90 dB SPL signal at 1000 Hz as a standard and determining the SPL required at other frequencies to match the loudness of the 1000 Hz tone. The use of 1000 Hz as a standard for loudness matches between two frequencies, at any level of presentation, is common.
The 90 dB equal loudness contour shows that there is equal loudness of the 100 Hz and 3000 Hz tones when the former is roughly 13 dB greater in intensity (or sound energy, or level) than the intensity of the 3000 Hz tone. As noted in the previous paragraph, at threshold the sound level difference for equal loudness of the two frequencies is about 28 dB, or 15 dB greater for equal loudness of the same frequencies at 90 dB. An eyeball comparison of the “~90 dB above threshold” and “Threshold” contours shows the higher-intensity contour to be relatively compressed across frequency, compared with the threshold contour. In general, SPL differences for equal loudness between frequencies become progressively diminished—the contour becomes flatter—as the overall level of presentation is increased.
How is the perceptual phenomenon of loudness related to the physical energy of a sound wave? This problem has occupied hearing scientists for a long time, and the history of research on this question points to the difficulty of answering it in a simple way. In this section a summary of loudness perception of sinusoids is provided; Marks and Florentine (2011) published a detailed, technical account of the history and contemporary status of research on loudness.
Loudness, in lay terms, is the perceived magnitude of a sound. People with normal hearing usually can agree on the difference between a soft and a loud sound—the loud sound has greater magnitude than the soft sound. These same people can also probably agree on the difference between a very soft and a soft sound, or between a loud and an extremely loud sound, using the concept of different magnitudes.
Scientists prefer to refine these lay comparisons by developing a simple formula that relates changes in perceived loudness of sinusoids to changes in the amount of energy in an acoustic signal. Measurement of the physical energy in an acoustic signal is not controversial; a proper instrument (such as a sound level meter) provides a value in decibels that, with careful calibration and the use of a standard reference pressure (20 μPa), gives reliable, reproducible results (because the standard reference pressure of 20 μPa is almost always used for sound levels in air, dB SPL values are given without the reference pressure). In contrast, the measurement of loudness, in which numbers are assigned to different SPLs or loudness matches are used to generate scales for the perception of sound magnitude, is complicated and controversial. Two well-established loudness scales, phons and sones, are described here.
A loudness scale is inherently subjective; there is no definitive way to know if the numbers generated by a listener reflect her perception of the loudness of a sound. Moreover, perceived loudness may be affected by factors other than variation in SPL. For example, loudness perception may be influenced by a sound’s frequency, its spectral complexity (the pattern of amplitudes as a function of frequency in a multifrequency acoustic signal), and even the way in which a series of sounds is presented to the listener (Marks & Florentine, 2011). Efforts to investigate how loudness varies with SPL have simplified the problem to understand the relation of SPL to loudness.
An early approach to the scaling of loudness used a physical standard to serve as a reference point for perceptual judgments. The concept of “phons” was used to develop one type of loudness scale. In a sense, phons have been discussed above, but the term and its definition have not. A phon is a unit of loudness level that is tied to a specific frequency presented at a particular SPL. For example, a 1000 Hz tone presented at 40 dB SPL is said to have a loudness level of 40 phons. Similarly, a 1000 Hz tone presented at 60 dB SPL is said to have a loudness level of 60 phons. Recall the minimal audible field (threshold) and 90 dB loudness curves shown in Figure 14–1. Both are “equal loudness contours.” As noted above, the substantial variation of SPL at threshold, across frequency, shows that the same loudness requires a different SPL depending on the signal frequency. For the ~90 dB equal loudness contour, it appears that a 3000 Hz sinusoid presented at 80 dB SPL sounds equally loud as a 1000 Hz sinusoid presented at 90 dB SPL. The 1000 Hz sinusoid at 90 dB SPL has a loudness level of 90 phons (following the logic above), as does the 3000 Hz sinusoid presented at 80 dB SPL.
Another way to think about equal loudness contours and phons as a loudness level is to imagine an experiment in which a 3000 Hz tone is presented at 80 dB SPL and the listener is asked to adjust the loudness of a reference tone of 1000 Hz until its loudness matches the 3000 Hz tone. The dB SPL for the 1000 Hz sinusoid selected by the listener as the loudness match for the 3000 Hz tone is the loudness level in phons for the 3000 Hz signal. In this example, a typical listener with normal hearing adjusts the 1000 Hz sinusoid to 90 dB SPL to match the loudness of the 3000 Hz sinusoid presented at 80 dB SPL. Using the 1000 Hz tone as the reference, the loudness level of the 3000 Hz sinusoid is 90 phons. Note that phons are expressed as a loudness level (not as loudness), in dB units. A set of equal loudness contours in 10 dB increments from threshold (0 dB) to 120 dB at 1000 Hz is shown in Figure 14–2 (adapted from Fletcher & Munson, 1933).
The phon scale is not a “direct” scale of loudness, because the loudness of a tone is dependent on a loudness match to another tone. Nor are phons just like dB on an intensity level (IL) or SPL scale: they are expressed as loudness levels. Phons are a perceptual phenomenon, whereas IL or SPL are physical properties. Phons played a role in the development of a more direct scaling of loudness, described in the next section.
An influential loudness scale, called the sone scale, was developed by the famous psychophysicist S. S. Stevens (1906–1973), who based his work on research from the early twentieth century. A sone is a unit of loudness. The foundation of this scale was the simple idea that listeners were likely to judge the relative loudness of two sinusoids as a ratio, as in the simple case of perceiving one sinusoid to be twice as loud as another sinusoid (or one sinusoid to be half as loud as another sinusoid). Consider an experiment in which a reference sinusoid at a known SPL is presented to a listener and assigned a number such as “1” or “10.” Next, tones of the same frequency but having different SPLs than the reference tone are presented to the listener, who is asked to assign numbers to them as ratios relative to the loudness of the reference. Under these conditions, the SPL that produces a sound perceived to be twice as loud as the reference tone should be assigned a number twice that of the reference number of “1” or “10” (whichever one is chosen for the experiment—any number can be used as the reference). An extension of this simple idea was the presumed ability of listeners to assign numbers to a tone presented at many different SPLs, and to assign the numbers as ratios to the standard stimulus, whatever the ratio may be (i.e., 1.3:1, 1.8:1, 2.0:1, and so forth). The SPLs of the presented stimuli were known, of course, so combining the known physical (SPL) variation of the tone with the numbers assigned to them allowed Stevens to construct a mathematical function relating perceived loudness (according to the ratio scaling) to SPL.
Figure 14–2. Equal loudness contours across frequency (in kHz) according to presentation level in dB. The equal loudness contours are referenced to the SPL at 1000 Hz (vertical red line) and are presented in 10 dB steps from threshold (dashed line) to 120 dB.
Why was Stevens interested in doing such experiments? One compelling reason was his knowledge, from previous work and laboratory experience, that ratios of the perceptual experience of loudness differed from ratios along the SPL scale. As described in an influential paper, Stevens (1955) knew that if a “standard” 1000 Hz sinusoidal tone was presented at 50 dB SPL, listeners (on average) would report a comparison tone presented at 58 to 60 dB SPL as twice as loud as the standard; a comparison tone presented at 38 to 40 dB SPL would be reported as half as loud as the standard tone. These 2:1 and 1:2 loudness ratios did not correspond to 2:1 and 1:2 ratios of sound energy (i.e., a 100 dB SPL sinusoidal tone was much more than twice the loudness of a 50 dB SPL tone). The SPL and loudness scales were not interchangeable.
In many of his experiments, Stevens arbitrarily assigned a value of 1 sone to the loudness of a 1000 Hz tone presented to a normal-hearing listener at 40 dB SPL re: 20 μPa. Additional experiments were conducted with standards at different SPLs (e.g., 20 db SPL). Stevens collected number assignments of the loudness of tones of different presentation levels relative to the loudness of the standards. He then plotted and “fit” the data, which means he applied mathematical functions to the bivariate (two-variable) data that provided the best fit. When the number estimates of loudness—the sones—were plotted against SPL, the function was curved (Figure 14–3, left panel). When the sone scale was converted to a logarithmic (base 10) scale, the log-log plot (dB is already a log scale; see Appendix on decibels in Chapter 7) revealed a straight line relating perceived loudness to SPL. This straight line suggested that the perception of loudness was related to dB SPL by a constant proportion between any two stimuli along the SPL range. Figure 14–3 shows some sample data in a sone-SPL plot (left) and with sones transformed to logs (right panel).
Figure 14–4, adapted from Stevens (1955, Figure 4), and Buus and Florentine (2001), shows an example of data obtained with an 80 dB SPL standard. The data shown are based on magnitude estimates—the assignment of numbers corresponding to perceived loudness, relative to the 80 dB SPL standard—plotted as a function of SPL for a 1000 Hz tone. Listeners were told to assign a value of “10” to the standard and to scale loudness of all stimuli relative to this standard value. Thus, a tone that was perceived as half the loudness of the standard was to be assigned a value of “5,” a tone twice as loud as the standard a value of “20,” and so forth. For this plot the magnitude estimates (sones) have been log-transformed, as in the right plot in Figure 14–3. The straight black line, the mathematical best-fit function, appears to fit the 10 plotted points quite well and is consistent with a power function relating perceived loudness to SPL. Functions such as the one in Figure 14–4 are the basis of Stevens’ famous and very influential power law. Stevens presented a simple equation to summarize the law (here presented in terms of sound pressure level, not sound intensity):
Figure 14–3. Sones plotted as a function of sound pressure level, with sones on a linear scale (left plot) and sones converted to logarithms (right plot). The reference tone is shown by the arrow in the left-hand plot.
Figure 14–4. Sones plotted on a logarithmic scale as a function of sound pressure level for an 80 dB SPL, 1000 Hz tone standard. The black line in midrange SPLs is based on Stevens (1955), while the red lines near the threshold and at very high SLPs are based on Buus and Florentine (2001).
L = k* SPL0.6
where L = perceived loudness, SPL = sound pressure level in dB, and “k” is a constant determined by the units used for loudness estimation and other experimental variables. For the remaining discussion “k” is ignored, so for the present purpose L = SPL0.6.
The exponent of 0.6 in this equation—the power to which SPL is raised to obtain perceived loudness—is the slope of the log-log function relating units of perceived loudness to units of sound pressure level. The slope of the black line is less than “1”, as indicated by the exponent; a twofold change in SPL does not correspond to a “twice loudness” judgment. In fact, and as noted above, along the middle part of the SPL range, an increase of roughly 10 dB results in a twofold increase in loudness. A twofold increase in loudness does not nearly require a twofold increase of SPL; in fact, a twofold increase of SPL produces a huge increase in loudness, far more than “twice the loudness.” Stevens’ power law was based on carefully designed research and exerted a monumental influence on hearing science and audiology. An important conclusion reached from this work is that perceived loudness does not “grow” in an additive way with SPL.
There are questions and concerns with Stevens’ power function, ranging from the degree to which the fitted log-log function reflects loudness judgments of individual subjects (versus the “group” data on which the functions were based), the extent to which the functions are determined by the specific methods used to elicit loudness judgments, and of course the ability to generalize from loudness judgment of sinusoids to more complex sounds such as noise, multitone acoustic events, and even speech signals. Clearly, loudness functions change depending on these factors.
Recent research shows that even for simple tones Stevens’ power law is not precisely correct (Florentine & Epstein, 2006). Across the midrange of sound pressure levels (e.g., 20–70 dB) the function resembles the one described by Stevens, with perhaps a slightly shallower slope (i.e., a smaller exponent). Close to threshold, however, the log-log function has a much steeper slope compared with midrange SPLs, and at very high SPLs the function is also steeper but not to the same degree as the function at threshold (Buus & Florentine, 2001). This means that near threshold, small SPL differences yield more rapid (greater) increases in loudness than the same small differences in the 20 to 70 dB range (the same for the SPL-loudness relationship at very high SPLs). The fact that the “growth rate” of loudness near threshold and at very high SPLs is different from loudness growth across midrange SPLs suggests that for the entire range of SPL, a more complicated formula than Stevens’ power law is required to relate SPL to perceived loudness of tones. The short red lines in Figure 14–4 show approximate slopes of the loudness-SPL function close to threshold (lower left of plot) and at very high SPLs (upper right of plot).
Complex sounds, defined in Chapter 7, have energy at more than one frequency, and typically at many frequencies. Although pure-tone (single-frequency) signals are used extensively in auditory research and as an important part of the clinical evaluation of hearing, sounds in the world are complex. In this brief introduction to the loudness of complex sounds, the concept of auditory filters is introduced followed by comments on the loudness of two different types of complex sound. Readers interested in a more advanced understanding of this topic are referred to Florentine, Buus, and Bonding (1978) and Moore (2013).
A general discussion of filters is presented in Chapter 7; here they are summarized and related to auditory psychophysics. A filter is any device or computer program that permits some objects (which may include vibrating air molecules or time-varying voltages) to pass through it while blocking others. For example, the filter in your home furnace permits passage of air molecules but blocks passage of larger molecules, including those that form dust and plant pollens. In acoustics a filter is a physical device, software code, or anatomical/physiological process that allows certain frequencies to pass through while blocking the passage of other frequencies. Acoustic filters can be configured in many ways. Some may allow low frequencies to pass through while blocking high frequencies, whereas others do the opposite. Or, the acoustic filter can be one that allows a range, or “band,” of frequencies to pass while blocking all frequencies below and above this band. Many scientists believe that the peripheral auditory system processes the human range of audible frequencies through a series of bandpass filters arranged across the basilar membrane from base to tip.
The concept of auditory bandpass filters was introduced in the preceding chapter, although the term “auditory filter” was not used. When neural tuning curves were described and illustrated (Figure 13–23), a neuron attached to a hair cell was said to have a characteristic frequency—the one with the lowest firing threshold. Frequencies above or below the characteristic frequency required more sound energy to make the neuron fire. As frequencies moved farther away from the characteristic frequency, the neuron-firing threshold became increasingly higher, until a frequency was reached which failed to elicit an action potential from the neuron. The neural tuning curve shown in Figure 13–23 is basically a bandpass filter turned upside down, with the “best” passage of energy at the characteristic frequency, and increasingly reduced passage of energy—for the neuron under study—as frequency becomes more distant from the characteristic frequency.
Neural tuning curves are derived from animal experiments. Very clever, non-invasive experiments in humans have demonstrated the same types of auditory filters by using auditory masking. Imagine an experiment in which a threshold is determined in a human for a tone of 2000 Hz. This threshold is obtained in the standard way, which is to say that the listener only hears the test tone; otherwise the listening channel is quiet. After the threshold is determined, a noise signal is played simultaneously with the 2000 Hz tone. The noise signal is composed of many frequencies which are not related harmonically and have time-varying amplitudes and phases. The noise is designed such that the average energy in a small range of frequencies is equivalent to the average energy in any other similarly sized range. For example, if a noise signal includes frequencies from 1800 to 2200 Hz, the average energy within the frequency range 1800 to 1805 Hz is equivalent to the energy within the frequency range 2100 to 2105 Hz (or 2000–2005 Hz, 1900–1905 Hz, and so forth). This description is consistent with the discussion of white noise presented in Chapter 7, of average energy within any range of frequencies, and in fact across the entire frequency band of the noise. The important difference in the current case is the band-limited nature of the 1800 to 2200 Hz noise; white noise typically has (in theory) an infinitely wide bandwidth. Think of this 400 Hz noise band, ranging from 1800 to 2200 Hz, as derived from a white noise with energy filtered out below 1800 Hz and above 2200 Hz.
Now imagine an even narrower band of noise, one centered at 2000 Hz and having a width of 50 Hz. The 50 Hz band ranges from 1975 to 2025 Hz. This noise band is presented simultaneously with the 2000 Hz pure tone, and an experimenter determines the noise presentation level that just makes the 2000 Hz tone inaudible. In this example, the 2000 Hz tone is masked by the narrowband noise, requiring additional energy in the 2000 Hz tone, relative to its energy at threshold in quiet, to make it just audible again. Assume the level of the tone must be raised 1 dB relative to the quiet (unmasked) threshold to make it audible in the presence of this noise signal. The new threshold is called a masked threshold. The narrowband noise signal presented simultaneously with the 2000 Hz tone makes the previously audible tone inaudible by “hiding” it until its level is increased to make it audible again.
The interesting part of this experiment comes with the next step, when the bandwidth of the noise signal is increased (made wider). Because the average energy is the same and constant for all frequencies within the noise signal and the overall noise level is the sum of all these individual component energies, widening the bandwidth increases the overall level of the noise. If the noise signal is widened to 100 Hz, still centered at the 2000 Hz frequency (the noise extends from 1950–2050 Hz), the auditory system is stimulated by greater noise energy than it was by the 50 Hz wide noise. The tone is once again masked by the wider noise bandwidth. To make the tone audible again, its level must be increased, perhaps by another 1 dB. This new masked threshold is roughly 2 dB greater than the threshold in quiet. As the width of the noise band is progressively increased, so is the SPL required to reach threshold for tone detection. This makes sense because a greater bandwidth is associated with a greater noise level.
When the noise bandwidth reaches a certain value the threshold does not increase even though the bandwidth is increased. In fact, even further increases in noise bandwidth do not affect the masked threshold. In one such experiment, the increase in threshold for a 2000 Hz tone was roughly 4 dB for increases in noise bandwidth up to 400 Hz, centered around the 2000 Hz tone. Noise bandwidths wider than 400 Hz failed to produce further increases in the masked threshold of the 2000 Hz tone (Schooneveldt & Moore, 1989).
The interpretation of this result, and results from many other similar experiments, is that the tone threshold does not increase past a certain noise bandwidth, even though the level of noise continues to increase, because the bandwidth of the noise exceeds the frequency range of an auditory filter. The auditory filter around (in this case) 2000 Hz has a frequency range that rejects energy relatively distant from the its center frequency. The center frequency is analogous to the “characteristic frequency” of the neural tuning curves discussed in Chapter 13. Schooneveldt and Moore (1989) estimated the width of the filter centered around 2000 Hz to be about 400 Hz. When the noise bandwidth was increased beyond 400 Hz, the energy of the whole noise band affected the motions of the basilar membrane and the associated neural responses, but did not affect the output of the auditory filter with a center frequency of 2000 Hz.
The change in threshold of the 2000 Hz tone with changes in the width of a noise band centered at 2000 Hz is shown schematically in Figure 14–5, panels (a) through (e). Panel (a) shows the threshold in quiet to be roughly 2 dB SPL re: 20 μPa (see Figure 14–1 for minimal audible field at 2000 Hz). In panels (b), (c), and (d) the width of the noise band is increased symmetrically around 2000 Hz, which increases the overall energy of the noise band and causes the threshold of the 2000 Hz tone to increase. When the noise bandwidth is increased past 400 Hz, the threshold of the 2000 Hz tone remains the same, as can be seen by comparing the heights of the vertical red lines in panels (d) and (e). Energy within the 400 Hz band around 2000 Hz seems to be processed “together”—that is, the energy of both the target signal plus the masker is processed together—whereas energy outside this band has little or no effect on the output of this auditory filter.
Figure 14–5. Schematic illustration of how increases in noise bandwidth around a center frequency result in increasingly higher detection thresholds for a pure tone at the center of the noise band (2000 Hz in this case), but only up to a specific bandwidth. Further increases in the width of the noise band do not cause higher pure tone (sinusoid) detection thresholds. The height of the vertical red line shows the SPL required for detection of the 2000 Hz tone; the black rectangle shows the width of the noise band centered symmetrically around 2000 Hz. A. Threshold of the 2000 Hz tone in quiet. B. Masked threshold for a noise band with a width of 50 Hz. C. Masked threshold for a noise band with a width of 100 Hz. D. Masked threshold for a noise band with a width of 400 Hz. E. Masked threshold for noise bandwidths greater than 400 Hz.
Another way to demonstrate auditory filters in the peripheral auditory system is to obtain psychoacoustical tuning curves (PTCs) from human listeners. PTCs are obtained using masking of one signal by another but in a way somewhat different from the method described above. In a typical experiment (see Kluk & Moore, 2004 for an example), the signal is a pure tone and the masker is a narrow band of noise. The tone is fixed at a single frequency and a low level of presentation (often 10 dB above the quiet threshold for the tone). The level of a fixed-width, symmetrical narrow band of noise, centered at the frequency of the test tone, is adjusted to determine the threshold of the test tone in noise. Under this condition, the level of the noise band required to make the signal inaudible should be fairly low—the center frequency of the noise is the same as the test tone, so the levels of both signals are very close to the quiet threshold.
The threshold of the test tone is then determined, repeatedly, in the presence of the noise band centered at each of many frequencies, below and above the frequency of the test signal. For example, assume the frequency of the test tone is 1000 Hz and the noise band has a width of 160 Hz. When the noise band is centered at the test frequency, the noise covers the frequency range 920 to 1080 Hz (a range of 80 Hz on either side of the 1000 Hz test signal). The noise band is then moved to align its center at 950 Hz (a noise band range of 870–1030 Hz). The question is, what level of the noise masker centered at 950 Hz is required to just make the 1000 Hz test tone inaudible, and then barely audible (that is, to determine the detection of the test tone with the noise band in its new frequency location)? The noise band is moved again, this time to a center frequency of 750 Hz (a noise band range of 670–830 Hz). Now, what is the level of the noise masker required to make the test tone just barely audible? When this experiment is done for many center frequencies of the constant-bandwidth noise band, a threshold curve for the 1000 Hz test tone is determined. As might be expected, the SPL of the noise band required to mask the 1000 Hz test tone increases as the center frequency of the fixed-width noise band moves away from the test tone frequency.
Figure 14–6 shows psychophysical tuning curves (PTCs) from Kluk and Moore (2004) for a 1000 Hz tone obtained from three listeners using the methods described immediately above. The x-axis is the center frequency of a noise band having a width of 160 Hz; the y-axis is the level of noise required to mask the 1000 Hz tone presented 10 dB above threshold. Note that the lowest point on all three graphs (each for a separate listener) is located at 1000 Hz on the x-axis. This follows from the idea discussed above that the lowest level of noise energy required to make the 1000 Hz tone just inaudible is found when the noise band is centered at 1000 Hz. When the noise band is centered at frequencies below or above the 1000 Hz test tone, a higher level is required to make the 1000 Hz test tone just inaudible. For example, the level of the 160 Hz noise band centered at 500 Hz (the second plotted point from the left in each of the three panels) that just masks the 1000 Hz test tone is about 50 dB higher than the test signal. When the noise band is centered at 1000 Hz, the noise level required to mask the tone is nearly 50 dB lower, roughly equivalent or even a bit less than the level of the test tone.
Figure 14–6. Psychophysical tuning curves (PTCs) from three listeners adapted from Kluk and Moore (2013). Each listener’s data are plotted in separate panels (top, middle, bottom). The y-axis shows the masker level required to make the signal level of the tone just detectable. Center frequency of the noise band is on the x-axis, and the test tone is set at a constant frequency (in this case, 1000 Hz) and level approximately 10 dB above the level at quiet threshold. Even though the width of the masker noise is constant, as the center frequency of the noise band moves away from the sinusoid frequency, there is an increase in the level of the noise required to make the signal detectable.
The tuning curves described in the text are generated in humans using masked thresholds. Tuning curves can be generated in other ways, as well. In animal models of audition, scientists have inserted microelectrodes to individual nerve fibers attached to inner hair cells, and monitored their rate of firing in response to different input frequencies. Because nerve fibers generate action potentials spontaneously, with no input, a firing-rate threshold must be set to find the lowest level at which the nerve fiber is first responsive to whatever frequency and intensity is used as input. The theory of this experiment is that the characteristic frequency of a nerve fiber (that is, a frequency that can be predicted well, if not exactly, by the position of the nerve-bearing hair cell along the basilar membrane) will evoke the fastest rate of firing above this threshold. As input frequencies move away from the characteristic frequency, firing rates decline until a frequency is reached where the nerve fires at the spontaneous (i.e., no input signal) rate. A graph of firing rate on the y-axis and frequency on the x-axis shows a tuning curve for the selected auditory nerve fiber. Happily, the tuning curves obtained from animals appear to be similar to the ones obtained from humans.
Like neural tuning curves, these PTCs are upside-down representations of auditory filters. The “tip” of these curves occurs where the test tone matches the center frequency of the noise band; this is the frequency at which the filter allows the most energy through. The tails of the curves (the parts of the curves that rise from the characteristic frequencies) show how the filter blocks, to various degrees, the passage of acoustic energy at frequencies different from the “tip” frequencies. The “tip” frequencies are analogous to the “characteristic frequencies” discussed above and in Chapter 13 for neural tuning curves. Note in Figure 14–6 the consistency of the PTCs across the three listeners, including the asymmetry of the curves. At frequencies higher than the “tip” frequency the slope is quite steep, meaning that the amount of noise energy required to mask the test tone increases quickly as the distance from the “tip” frequency increases. In contrast, at frequencies lower than the “tip” frequency the slope is steep in the frequency range closest to the tip and then becomes shallower in the lower frequencies, with only small increases in level of the masking noise as the distance from the “tip” frequency becomes greater. What this means for the auditory filter centered at 1000 Hz is that energy at and immediately around 1000 Hz is passed with greatest amplitude, but energy at frequencies above 1000 Hz are blocked by rapidly increasing amounts as frequency becomes increasingly different from 1000 Hz. For frequencies immediately below 1000 Hz (to about 800 Hz) there is also a rapid decrease in energy passed by the filter, but at even lower frequencies the blocking of energy changes more gradually (see Figure 14–6).
PTCs for four “tip” frequencies are shown for a single listener in Figure 14–7. This plot is adapted from Carney and Nelson (1983) who used a slightly different method than the one presented above but serves to make two points. First, for each of the four test tones (500, 1000, 2000, and 4000 Hz, indicated by arrows in Figure 14–7), the shape of the PTCs is generally consistent with the PTCs shown in Figure 14–6 for 1000 Hz. For all four test frequencies, PTC shape is asymmetric in the same way: rapid and substantial blocking of energy above the “tip” frequency, shallower slopes below the “tip.” By inference the shape of the auditory filters are consistent across these frequencies, at least between 500 and 4000 Hz. Second, parts of the PTCs for different frequencies overlap, suggesting that auditory filters are not sequenced from low to high frequency as independent passbands. This is an even more compelling observation given the sparse sampling in Figure 14–7 of all possible PTCs that could be determined, for every frequency, if an experimenter had sufficient time and listeners with enough patience and stamina.
Even though the general shape of PTCs shown in Figure 14–7 is similar across frequencies, the width of the filters is not (note for the PTC at 4000 Hz the greater difference between frequencies just above the tip frequency). Considerable research effort has been devoted to the identification of a mathematical expression that reflects the differing width of these filters as a function of frequency. Evidence from animals and humans shows convincingly that the filter widths increase as frequency increases (a technical review is available in Greenwood, 1990). Filter widths are much greater at, for example, 8000 Hz (width ~600 Hz) compared with 800 Hz (width ~90 Hz). These bands are often referred to as critical bandwidths (the concept is also referred to as equivalent rectangular bands). Although the mathematical function that relates filter bandwidth to frequency is a continuous one, the frequency analysis capabilities of the basilar membrane are often partitioned into 35 critical bands. Each of the 35 bands is thought to correspond to a distance along the basilar membrane of roughly 1 mm (the human basilar membrane is 35 mm in length). The correspondence of a length along the basilar membrane with a critical band is not a linear function: a length of 1 mm of basilar membrane at the base of the cochlea covers a much greater range of frequencies than 1 mm of basilar membrane length at the apex.
Figure 14–7. Psychophysical tuning curves (PTCs) for four different “tip” frequencies, adapted from human data reported by Carney and Nelson (1983). These tuning curves are from a single listener and used a slightly different method than the one used by Kluk and Moore (2004).
Auditory Filters: Do They Matter?
Is the concept of auditory filters relevant to the way we hear real-life signals, or simply a laboratory invention? This is a reasonable question that brings us back to the concept of the spectrum. Recall that a spectrum is a representation of amplitude variation across frequency. A spectrum is computed by Fourier analysis and displayed on a screen as a plot, with frequency (in Hz) on the x-axis and amplitude (in dB) on the y-axis. For many years scientists studied the acoustic characteristics of speech sounds by inspecting such spectra to understand the acoustic cues used by listeners to identify speech sounds. However, several researchers considered the possibility that the filtering characteristics of the auditory system meant that the speech-sound spectrum, transformed into neural signals after analysis by auditory filters, was not the same as the spectrum entering the ear canal. In the case of vowels, for example, the formant structure seen in a so-called linear-frequency spectrum (before auditory filtering) was likely to be substantially different from the auditory representation of the signal. Thus, the term “auditory spectrum” was coined to refer to the spectrum as processed by the auditory system. Scientists still debate whether the differences between linear and auditory spectra are critical to listeners’ perception of vowels. See Syrdal and Gopal (1986); Adank, Smits, and van Hout (2004); and Moore (2008) for reviews of this issue.
What happens when the loudness of a sinusoid (pure tone) is matched to the loudness of a complex acoustic event having frequency components falling outside the critical bandwidth whose center (tip) frequency corresponds to the frequency of the sinusoid? The relevant experiments have been performed and published by several scientists, but the discussion here is confined to work reported by Florentine, Buus, and Bonding (1978). Florentine et al. asked listeners to adjust a control knob until the loudness of two signals was equivalent. The knob controlled the intensity of one of the signals while the other was held constant. We are specifically interested in two of their comparisons, one in which the loudness of a 1000 Hz tone was matched to the loudness of a two-tone (that is, two-sinusoid) complex, the other in which the loudness of a 1000 Hz tone was matched to a noise signal whose bandwidth had lower and upper frequency limits equivalent to the tone frequencies used in the two-tone match.
For the loudness match of the 1000 Hz tone to a two-tone (two-sinusoids) signal separated by 1592 Hz (one tone at 468 Hz, the other at 2060 Hz), the 1000 Hz signal had to be about 3 dB greater than the two-tone signal. Stated otherwise, when the intensities of the single tone and the two-tone signal were equal, the two-tone signal sounded louder than the single tone.
For the match of the 1000 Hz tone to a noise with the same bandwidth of 1592 Hz, the 1000 Hz tone had to be roughly 14 dB greater than the noise to achieve a loudness match. This finding led Florentine et al. (1978) to wonder if the difference between the two-tone and noise match was due to the noise signal’s many frequency components (energy at every frequency from 468 to 2060 Hz) compared with the energy of the two sinusoids of the two-tone signal. They did a further experiment in which they added tones to the two-tone signal and found that additional tones made the tonal complex louder relative to the 1000 Hz sinusoid. Loudness seemed to be determined, at least to a significant degree, by the number of sinusoidal components in a signal. The limit of adding sinusoids to a multitone signal is adding equal energy at every frequency in the band, which is equivalent to creation of a bandpass, filtered white noise at 468 Hz on the low end and 2060 Hz on the high end.
What is the relationship between the findings of Florentine et al. (1978) and the concept of auditory filters? Hearing scientists estimate the size of the critical band centered around 1000 Hz to be roughly 150 to 160 Hz (see, for example, Figure 1 in Moore & Glasberg, 1983). This means that the target signal—the 1000 Hz sinusoid—is “competing” for loudness with a two-tone signal. The two-tone signal in this case has a frequency range that is far greater than the filter width around 1000 Hz. Because all signals are presented at the same SPL, it makes sense that the two-tone signal causes more widespread motion of the basilar membrane, and therefore more activation of nerve fibers attached to hair cells, compared with the motion and neural activation of the single component at 1000 Hz. It is as if the two-tone signal creates outputs of two auditory filters (where “output” means the energy “delivered” to the auditory nerve fibers), compared with the single output of the 1000 tone. With greater overall neural activation, it makes sense that the two-tone signal is louder even when the SPLs of the target and comparison signals are equivalent. Similarly, the loudness of a noise band centered at 1000 Hz but extending far below and above the critical bandwidth at 1000 Hz means even more motion and neural stimulation—more filter outputs—compared with that of a single sinusoid at 1000 Hz. Even when the noise signal has the same SPL as the 1000 Hz sinusoid, the former has much greater loudness than the latter. The energy of the noise signal is spread across several critical bands and produces a neural excitation pattern in excess of the excitation produced by a single tone.
Comparisons of loudness across different signals is notoriously tricky because it is affected by many factors. The description provided here provides the basics of the relationship between sound energy, perceived loudness, and the concept of the peripheral auditory system as a series of overlapping filters. The science of loudness is more than a sterile laboratory exercise: fitting of hearing aids, adjustments of the processors of cochlear implants, and design of speech recognition and synthesis programs all depend on an understanding of variables that determine the loudness of complex sounds.
In the study of the senses (hearing, vision, taste, smell, touch) there has always been interest in the smallest change in a physical stimulus that can be detected reliably. These perceptual distinctions are known as difference limens (DLs) or just noticeable differences (JNDs). For this discussion, the term loudness DL is used.
According to the Acoustical Society of America Standards (ASA standard 11.35, June 2016, asastandards.org/Terms/difference-limen-for-loudness/), the loudness DL is defined as follows:
For an individual listener and a sound of specified frequency under specified conditions, the minimum change of sound pressure level and frequency that is just noticed as a change in loudness. Unit, decibel (dB).
Note how this definition includes variables likely to affect the DL for loudness. Some variables that can affect the loudness DL are not specified explicitly in the ASA standard (“under specified conditions”), but point to the complex nature of even a simple perceptual phenomenon such as detecting loudness change.
Loudness DLs have been investigated extensively. Most students get a first exposure to loudness DLs (and other aspects of audition) in an introductory course on experimental psychology, where Weber’s law is introduced (the law is sometimes called the Weber–Fechner law, Fechner’s law being derived from Weber’s law). Even though Weber’s experiments concerned perceived heaviness and brightness of light, the results were generalized as a rule for any sensory distinction. Simply stated, the loudness DL was thought to be a constant proportion of the reference SPL. The greater the reference SPL (and therefore loudness), the greater the change in SPL required for a listener to detect a change in loudness. The amount of this required change is assumed to be a constant fraction of the reference SPL. Listeners can therefore detect smaller changes in SPL as changes in loudness when the reference sound has low SPL. Increasingly larger SPL changes are required to detect a loudness change with increases in the reference SPL. Weber’s law can be expressed this way:
DLloudness = ΔSPL /SPLstandard
ΔI /SPLstandard = k
where ΔSPL (Δ being the Greek symbol indicating “change”) is the change in SPL of a stimulus that allows the detection of loudness change relative to the SPL of the standard. SPLstandard is the SPL of the standard, or reference stimulus. In other words, k is a constant regardless of SPLstandard.
A schematic illustration of a typical experiment designed to determine the loudness DL is shown in Figure 14–8 (after Florentine, Buus, & Mason, 1987). A listener hears two sinusoids in succession, each tone having a duration of 500 ms (half a second) and separated from each other by a silent interval of 250 ms (a quarter of a second, called the interstimulus interval). The black horizontal lines indicate the “on” interval for the standard tones (SPLstandard), whereas the red dashed lines show the level to which one or the other tone must be increased (the ΔSPL + SPLstandard) to determine whether a listener can detect a loudness difference. Both sequenced tones are shown as having a standard and increased level because on any given trial, either the first or second tone may have the increased level. In this way, potential order effects are controlled. During the experiment, the listener presses one of two buttons to indicate which of the two tones is louder.
Figure 14–8. A schematic diagram of one version of an experiment investigating the difference limen for loudness. Two tones are shown, separated by a quiet, interstimulus interval. One tone is the standard SPL (SPLstandard) (indicated by the height [y-axis] of the horizontal black line), the other tone has a slightly higher intensity. The order of the standard tone and the tone with an intensity increment added to the standard (indicated by the horizontal, red dashed line) is randomized across the trials (one trial being the presentation of the two tones in the “observation interval” shown in this figure). The listener presses a button indicating whether the tone with greater loudness was the first or second of the two tones.