Introduction
The preceding chapters have described how visual signals pass from the retina to the visual regions of the brain. In this chapter we discuss how processing of these signals results in our ability to perceive spatial form, such as the outlines of objects, textures, and other patterns. This ability is crucial for mobile species (including humans) to interact with the world around them, and our understanding of the underlying mechanisms has increased rapidly over the past century.
The chapter begins by considering how the tuning of visual cortical neurons performs a local analysis of the content of an image, and how the combined responses of many such neurons determines the limits of our visual abilities. A key method supporting our understanding of low-level vision is the adaptation paradigm, in which neurons temporarily change their tuning following repeated stimulation. Experiments using this technique indicate that many visual features, including orientation, size, shape, and motion, are represented by populations of neurons tuned along these dimensions.
Next we consider how object boundaries are identified, and introduce the concept of second-order vision, which detects changes in contrast and texture information. We discuss how our sensitivity varies across the visual field, becoming poorer away from the fovea. Computational analyses of functional magnetic resonance imaging (fMRI) data allow us to quantify how receptive field sizes change across the visual field, and also increase along the visual hierarchy. These changes have implications both for our ability to detect faint patterns and our perception of higher contrast textures. This contrast constancy effect differentially interacts with visual disorders such as amblyopia and optic neuritis.
Finally, we discuss how information from low-level units is combined to represent important image features such as edges and extended textures, and ultimately more specific categories of object such as faces. Recent fMRI work has demonstrated how a spatially local representation of basic image features at early stages transitions to more spatially invariant object representations in higher visual areas. Overall, this chapter aims to give the reader an understanding of how the early stages of vision contribute to our perceptual experience of the world and allow us to localize and recognize the objects we interact with and the environment we navigate.
Early visual mechanisms as feature detectors
Neurons in primary visual cortex (area V1, at the very back of the brain) have receptive fields that are orientation selective and also tuned to particular spatial frequencies (spatial frequency describes the coarseness of a texture, or the width of the grating bars in a receptive field). For example, a neuron with a vertically oriented receptive field will respond strongly to vertical stripes but remain silent when shown horizontal stripes. Fig. 32.1 shows example receptive field profiles for some simulated neurons ( middle column ), as well as an image ( left ) filtered by each receptive field ( right ). These filtered images simulate what the world looks like to a population of neurons that tile the image (technically, the image has been convolved with each filter). One way to think about the function these cells perform is to think of them as feature detectors—they respond only when the part of an image they are centered on contains orientation and spatial frequency information close to the neuron’s preferred values. In the example images, you can see how sections of the border of the banana are picked out by neurons with different orientation preferences.
Why do neurons have such specific characteristics? One very likely explanation is that their tuning allows them to efficiently represent the information they encounter in the natural world. For example, Field showed that the statistics of natural images were well captured by banks of filters with similar properties to early visual neurons, in a way that reduces redundancy in the signal. Furthermore, Olshausen and Field showed that an algorithm trained on sets of natural images spontaneously develops receptive field properties similar to those found in biological visual systems. This general approach of applying a bank of orientation- and spatial frequency-selective filters to an image is also a critical first step for contemporary computer vision models. In particular, a class of machine learning methods called deep convolutional neural networks uses this method, and they have been very successful at performing useful categorization tasks, such as object identification and image labeling. So, V1 neurons behave in this way because filtering allows the visual cortex to simplify the incoming information from the eyes in an efficient and principled way.
The combined response of populations of tuned neurons determines the limits of our perceptual abilities. These limits can be summarized by the contrast sensitivity function (CSF), which describes how sensitive an individual is to stimuli of different spatial frequencies. In Fig. 32.2A , an example CSF ( thick line ) is shown to be the envelope of many mechanisms with narrower tuning (e.g., populations of neurons, given by the thinner lines ). For any given spatial frequency, our sensitivity will be governed by the most responsive neurons, assuming some appropriate read-out rule. The individual mechanisms typically have bandwidths of just over an octave, whereas the CSF covers several octaves of spatial frequency (an octave is a factor of two difference).
The CSF is normally measured in psychophysical experiments, in which participants are asked to detect faint grating patterns of different spatial frequencies. Typical CSFs, such as the one shown in Fig. 32.2A , have a characteristic peak at frequencies around 2–4 c/deg, and a fall-off to either side. Any visual signals falling outside of the envelope of the CSF (i.e., outside of the gray shaded region ) will be invisible. This means that by knowing an individual’s contrast sensitivity, we can predict what they will be able to see. Notice that the point at which the rightmost limb of the function reaches 1 determines the highest spatial frequency (e.g., the finest detail) that the observer can resolve, which sets the limits of visual acuity (discussed further in Chapter 33 ). However, the CSF provides much more detailed information on visual health and function, which can be useful for understanding and diagnosing visual disorders, as we discuss later in the chapter.
It is possible to visualize one’s own CSF by looking at the image in Fig. 32.2B . In this pattern, spatial frequency increases smoothly along the (logarithmic) x-axis, and contrast decreases smoothly along the (logarithmic) y-axis. At each frequency, the perceived height of the bars is governed by the viewer’s contrast sensitivity, and a “hump” peaking at midrange spatial frequencies is apparent (although note that the placement of the hump will depend on viewing distance). Of course different people have different contrast sensitivities. If you wear glasses, you might notice that removing them shifts the peak of the hump to the left, as the high spatial frequencies are blurred by the eye’s optics.
Interestingly, different animal species have different contrast sensitivities to humans (for examples of experimental paradigms see ). Birds of prey like eagles have a CSF tuned to quite high spatial frequencies, presumably so they can spot small prey animals far away on the ground. Ground-dwelling animals like mice are better adapted to low spatial frequencies for seeing close up food and large predators! But in all known cases, animals share the inverted U-shaped function found in humans, albeit shifted to higher or lower frequencies.
Adaptation as a tool for understanding vision
How do we know that the CSF is the envelope of many mechanisms with narrower tuning? Early evidence for this idea came from studies using adaptation paradigms. In an adaptation experiment, a high intensity stimulus is shown repeatedly for a long time (usually several minutes). This has the effect of desensitizing the neurons that represent that stimulus, but having little or no effect on neurons with different tuning (see Fig. 32.3 ). For example, adapting to a vertical stimulus of 4 c/deg will produce a “notch” of lower sensitivity in the CSF centered at the adapting frequency. But it will not affect sensitivity to spatial frequencies that are much higher or lower, or to stimuli with horizontal orientations. This specificity indicates that an apparently continuous function such as the CSF is really the envelope of several more narrowly tuned mechanisms. If instead our sensitivity were determined by a single broadband mechanism, then adaptation would affect sensitivity at all frequencies equally.
Adaptation has other effects on our perception besides reducing sensitivity. Stimuli adjacent to the adaptor (e.g., with slightly higher or lower spatial frequencies) appear increased or decreased in spatial frequency, as though they were “repelled” away from the adaptor. This happens because adaptation differentially alters the tuning of nearby mechanisms, shifting their peaks away from the adapting frequency. Such after-effects occur not just for spatial frequency, but also for orientation, motion, and higher-order stimulus properties including object size, depth, aspect ratio, numerosity, and even facial expression. The prevalence of repulsive after-effects suggests that many sensory dimensions are represented by an underlying population code comprising multiple narrowly tuned units, and that this is a general organizing principle for sensory systems.
Objects are defined by spatial changes in luminance, color, contrast, and texture
Visual mechanisms that can detect changes in luminance are critical for perceiving the form and location of objects in the environment. This is primarily because most objects have different surface and reflectance properties from their backgrounds, and in natural light will appear brighter or darker. An object like the circle shown in Fig. 32.4A is brighter than its surround, and will therefore produce strong responses in mechanisms matched to the object’s approximate spatial frequency. For example, a filter such as those shown in Fig. 32.1 , in which the excitatory central region (shown in white ) is approximately the same width as the circle in Fig. 32.4A , would be expected to respond strongly, signaling the presence of an object. Additionally, mechanisms tuned to higher spatial frequencies will respond strongly to the edges of the object, indicating the location of the border between it and the surround.
Changes in luminance across space are referred to as first-order information, and neurons that can detect them are first-order mechanisms (see Fig. 32.1 for an illustration of how first-order edges are detected). These are the most critical aspect of our spatial visual abilities, and demonstrate why the concept of contrast sensitivity is so important—luminance modulations outside of the envelope of the CSF ( Fig. 32.2 ) are literally invisible to us. In evolutionary terms, the ability to detect luminance variation has obvious survival benefit for identifying predators, food, and other beneficial or harmful aspects of the environment.
Sometimes objects do not differ much in luminance from their background. However, there are other cues that might reveal them, such as differences in color ( Fig. 32.4B ), contrast ( Fig. 32.4C ), or texture ( Fig. 32.4D ). Color vision processes are explained further in Chapter 34 , so are not discussed in detail here other than to mention that the contrast sensitivity of color channels is much coarser than for the achromatic (luminance) system. This means that we are less able to resolve fine spatial details when they are defined by color than by luminance, and the CSF is shifted downwards and to the left ( Fig. 32.5A ), but plateaus at low spatial frequencies (based on data from ).
Cues such as contrast and texture are referred to as second-order information. These are invisible to first-order mechanisms, which respond only to changes in luminance. Instead, second-order mechanisms are constructed from the outputs of multiple first-order mechanisms. The classic circuit for second-order vision is the filter-rectify-filter (FRF) arrangement, whereby the outputs of first-order filters are rectified, and then form the input into second-order mechanisms. This places a fundamental limit on second-order vision, in that the carrier texture must be detectable by first-order mechanisms. Because of these limitations, and the additional processing stages required, sensitivity to second-order modulations is much lower than for achromatic first-order vision (see Fig. 32.5B , based on data from Schofield and Georgeson ). In ecological terms, sensitivity to second-order information allows us to break camouflage, for example, to detect animals such as moths and cuttlefish that have evolved to blend in with their backgrounds, and also to distinguish differences in lighting conditions from changes in material properties.
Sensitivity and receptive field size versus eccentricity
A key limit on our visual abilities is the fall-off in sensitivity across the visual field. Foveal vision is generally more sensitive than vision in the periphery, yet this decline is not constant across spatial frequency, or different types of cue. Pointer and Hess found that contrast sensitivity declined most rapidly at high spatial frequencies, when data were plotted as a function of eccentricity in degrees. However, replotting the data as a function of the number of cycles of the grating makes these differences much less marked. Baldwin, Meese, and Baker found that the decline in sensitivity was steepest over about the first eight cycles of the stimulus, and somewhat shallower at greater eccentricities. Sensitivity across the visual field can therefore be characterized by an approximately bilinear function of eccentricity expressed in stimulus cycles, as shown in Fig. 32.6 . Notice, however, that there are asymmetries, particularly about the horizontal meridian, with sensitivity declining more gradually in the lower visual field than the upper visual field.
Sensitivity also declines differentially for different cues. For example, Hess et al. compared sensitivity as a function of eccentricity for first- and second-order stimuli. The decline was much more rapid for second-order (contrast modulated) stimuli than for first-order stimuli, although this appeared to be mostly governed by the fall-off in carrier sensitivity (i.e., the sensitivity to the texture, which has been contrast modulated). In other words, when high frequency carriers become harder to see, this limits the detectability of second-order modulations. Sensitivity to chromatic information is also poor away from the fovea, and declines more rapidly than sensitivity to luminance modulations, largely because of the low density of color-sensitive cones (see Chapter 34 ) in the peripheral regions of the retina.
Part of the reason that sensitivity declines away from the fovea may also be that receptive field sizes grow larger with eccentricity. This finding is well established neurophysiologically (e.g., ), and can also be demonstrated in humans using fMRI. The population receptive field (pRF) technique typically involves presenting observers with drifting bar textures moving across the visual field at different angles. The time course of fMRI activity is then fitted using a computational model to determine, for each voxel in the cortex, the region of the visual field to which it is most responsive. Many studies have used this technique to show that receptive field sizes (i.e., the area over which a voxel is responsive) increase with distance from the foveal representation. Figs 32.7A and B show example pRF data for one participant from a study by Lygo et al. Note that the regions of cortex representing more peripheral locations in Fig. 32.7A ( dark red ) also correspond to larger pRF diameters in Fig. 32.7B ( yellow and green ). The figure also illustrates that a far greater proportion of visual cortex is responsive to stimuli in the central few degrees around the fovea than to stimuli in the periphery. It is likely that this reduced cortical representation of peripheral locations also contributes to our poorer sensitivity away from the fovea.