Fig. 7.1
An example of two merge: A lexical item (“the”) is merged with a set previously formed by merge (black circle: sudden macromutation). Abbreviations: N, noun; Adj, adjective; Det, determiner
However, there is considerable disagreement in the linguistic community regarding the relevance of merge on both theoretical and empirical grounds. For example, linguists studying the diversity of the world’s roughly 6,000 languages continue to emphasize the sheer diversity of structures at every level of linguistic organization with no evidence for any universals in language (Evans and Levinson 2009). Consequently, at this time it appears worthwhile to continue exploring the nonhuman roots of human language despite the proposals of human uniqueness by Hauser et al. (2014).
7.2.2 Primate Models
Perhaps more relevant is the evaluation of the empirical animal literature by Hauser et al. (2014), which led them to conclude that “… the gap between us and them is simply too great to provide any understanding of evolutionary precursors or the evolutionary processes (e.g., selection) that led to change over time.” The gap between nonhuman animal and human communication is clearly great, but the purpose of this chapter is to demonstrate that this cannot be a serious argument against comparative research. The following sections are devoted to exploring the chasms between human and nonhuman primate communication, including differences in vocal control and learning, sound-meaning linkages, combinations of signal units, and the social cognition underlying human and nonhuman primate communication.
7.3 Evolution of Vocal Control
7.3.1 Primate Vocal Tracts
Human and nonhuman primate sounds are produced by a specialized vocal tract consisting of a sound-producing source and an acoustic filter apparatus (Fant 1960). During sound production, the larynx oscillates in response to airflow from the lungs, and this creates a basic acoustic signal, which then travels through the supralaryngeal vocal tract. The acoustic properties of the signal emitted into the environment thus are determined not only by the activity of the larynx but also by the spatial configurations of the vocal tract (the shape of the nasal and oral cavities), which determines the resonance properties and acoustic quality of the emitted sounds (Fitch and Hauser 1995).
Although only limited comparative data are available, the evidence suggests that there is a fundamental similarity in the morphological structures of the sound-producing apparatus across primates, including humans, and many other mammals (Fig. 7.2) (Riede et al. 2005; Taylor and Reby 2010; Fitch et al. 2016). One main difference is that, in adult humans, the larynx is in a permanently low position, which gives the human vocal tract a characteristic, perpendicular, two-tube shape. Whether or not this anatomical specialization is crucial for speech production has been the topic of much ongoing debate (Fitch and Reby 2001; Lieberman 2012), but arguably it is unlikely to be the key prerequisite for the evolution of vocal control and, by extension, the production of intelligible speech (Fitch et al. 2016; Quam, Martínez, Rosa, and Arsuaga, Chap. 8).
Fig. 7.2
Schematic drawing of the head-neck region of a Diana monkey with details from dissection and lateral x-ray. Abbreviations: L, larynx; lL, lower lip; P, palate; T, tongue; Tr, trachea; uL, upper lip; dashed line 1, oral vocal tract length; dashed line 2, nasal vocal tract length; arrows indicate the dorsoventral distances of the oral vocal tract. (Reprinted with permission from Riede et al. 2005)
7.3.2 Vocal Flexibility
Humans are undoubtedly unusual in their extraordinarily high degree of motor control of both larynx and vocal tract (Ackermann et al. 2014). Nonhuman primates are different, which impedes them from acquiring new sound patterns through vocal learning (but see Snowdon, Chap. 6). One manifestation of this is that chimpanzees all over Africa possess the same basic vocal repertoire regardless of habitat and social upbringing (Goodall 1986; Slocombe and Zuberbühler 2010). This finding is often contrasted with evidence for dialects in some marine mammals, such as killer whales (Orcinus orca) and sperm whales (Physeter macrocephalus), which serve as acoustical “badges” to secure group cohesion (Ford 2009).
Another manifestation of low vocal control in primates is that it has been nearly impossible to get chimpanzees and other primates to mimic human speech sounds even after extensive training, and learning to produce sounds on command has turned out to be a very difficult task for them. In a classic study, Hayes and Hayes (1951) describe the vocal abilities of their home-raised and speech-trained chimpanzee “Viki” as follows: “…we began a speech training program when she was five months old. The first step was aimed at teaching her merely to vocalize on command, in order to obtain a reward. … Although she seemed to learn what was required quickly, she had serious trouble with the motor skill of voluntary vocalization. It took her five months to learn to produce a hoarse, staccato grunt, quite unlike her normal spontaneous sounds. She could do this quickly and dependably, when told to “speak,” but only with much grimacing and straining. This phase of the training was also given to several laboratory chimpanzees, with similar results” (p. 107 in Hayes and Hayes 1951).
Nonhuman primate natural vocal communication is characterized by species-specific repertoires, which consist of a limited number of basic call types that are produced in relatively specific situations to serve distinct biological and social functions. In our closest relatives, the chimpanzees, the vocal repertoire consists of a few basic call types, many of them blending into each other, which makes classification difficult (Slocombe and Zuberbühler 2010). This limited flexibility in nonhuman primate vocal behavior is also striking when considering the fact that vocal learning is not uncommon in the animal kingdom, although usually restricted to courtship behavior or contact, which often involves sound-producing mechanisms other than the larynx (Janik and Slater 1997; Janik 2014).
Why are humans the only primates that have evolved such a high degree of vocal control? Although the differences are vast, there is evidence for limited vocal flexibility in some primate species (Snowdon, Chap. 6). In adult Campbell’s monkeys (Cercopithecus campbelli), for example, contact calls of closely affiliated pairs of females are more similar than calls of socially less close individuals (Lemasson and Hausberger 2004). In chimpanzees, pant hoot vocalizations, a long-distance contact and display signal, are affected in similar ways, with several studies showing acoustic convergence of calls between closely affiliated males (Marshall et al. 1999; Crockford et al. 2004).
External events can further influence the acoustic variation of primate calls. For example, chimpanzee rough grunts, given when discovering food, vary in their acoustic structure depending on the caller’s perception of the quality of the food, which is something that other group members can discriminate (Slocombe and Zuberbühler 2005) and some of which may be subject to social learning. According to one study, a group of chimpanzees brought in from a Dutch facility to Edinburgh Zoo gradually adjusted the acoustic structure of rough grunts to match the calls given by resident group members, as if adapting to the local communicative convention (Watson et al. 2015).
Another way by which primates can create acoustic variation is by combining acoustic units within calls. One example is the alarm call of Campbell’s monkeys (Cercopithecus campbelli). Males produce three basic alarm calls, krak, hok, and wak calls, all of which can be combined with an acoustically invariable vocal suffix (oo) to generate a combined call (krak-oo, hok-oo, wak-oo) (Fig. 7.3) (Ouattara et al. 2009a, b). Unsuffixed calls are typically given in response to dangerous predators, while suffixed calls are associated with less dangerous situations. In playback experiments, monkeys gave significantly stronger responses to unsuffixed (leopard) than suffixed (unspecific danger) calls, which suggested that suffixation is an evolved function in primate communication (Coye et al. 2015).
Fig 7.3
Spectrographic illustrations of the different loud call types produced by male Campbell’s monkeys in different contexts. (a) Boom call [B]: a low-pitched loud call produced by the vocal sac with no frequency modulation; (b) Krak call [K]: a single loud tonal utterance of ø = 0.176 s duration with a decreasing main frequency band, starting at about 2.2 kHz; (c) Hok call [H]: a single loud tonal utterance of ø = 0.070 s with no frequency modulation at about 1.0 kHz; (d) Wak-oo call [W + ]: a suffixed loud tonal utterance of 0.330 s consisting of a call stem with an increasing main frequency band, rising from about 1.0 to 1.3 kHz, followed by a compulsory oo suffix; (e) Krak-oo call [K + ]: a krak call followed by the oo suffix; (f) Hok-oo [H + ]: a hok call followed by the oo suffix. Dashed red arrow indicates direction of frequency transition; dashed red oval indicates the oo suffix. (Reproduced with permission from Ouattara et al. 2009b)
Another example is contact calls by the Diana monkey (Cercopithecus diana) that consist of an individually distinct, arched structure that can be combined with three other call types that are linked with specific events (Fig. 7.4) (Candiotti et al. 2012). Importantly, R, L, and A call units can be given either singly or merged as RA or LA combinations. While R and L units refer to information about external events, the A units convey information about caller identity. In playback experiments, subjects responded in ways that suggested that both event type and identity information were perceived by listeners, which was an empirical demonstration of morphosemantic properties in primate social calls (Coye et al. 2016).
Fig. 7.4
Spectrographic representations of female Diana monkey contact calls, which consist of an optional introductory unit (High-pitched trill, H; Low-pitched trill, L; Repeated unit, R) followed by a broken (b) or full (f ) arch (A). Introductory units and arches can also be produced on their own. (Reprinted with permission from Candiotti et al. 2012)
Despite these findings, human speech goes much beyond such phenomena, so what structures enable it? As mentioned earlier, initial explanations have highlighted differences in vocal tract anatomy, in particular the fact that humans have a permanently lowered larynx (Lieberman 2012). However, it now seems unlikely that this is sufficient to explain differences in vocal behavior between human and nonhuman primates (Quam, Martínez, Rosa, and Arsuaga, Chap. 8).
A more plausible hypothesis is that humans possess a direct cortical innervation of the nucleus ambiguous in the brain, the site of laryngeal motor control, which yields a high degree of laryngeal control during phonation (Jürgens 2002). Motor control of the filter, the supralaryngeal vocal tract, is evolutionarily more ancient since it is shared with at least the great apes. Various lines of evidence suggest that great apes have good motor control over the facial musculature, including those muscles involved in producing speech (Lameira et al. 2014). For example, captive orangutans (Pongo pygmaeus) can learn to mimic a caretaker’s whistles by controlling the airflow passing through their lips (Lameira et al. 2013), although it is less clear whether the control of the tongue is equally advanced. However, the main point here is that parts of the speech apparatus appear to have been in place prior to the evolution of speech in humans.
Comparative ontogenetic research has also contributed to this discussion. In humans, the larynx descends during early infant development, and this process is related to the onset of speech production. However, in infant chimpanzees, the larynx also descends during early development, suggesting that relevant anatomical changes of the vocal tract during development are phylogenetically ancient (Nishimura et al. 2003). Of course, the adult vocal tract anatomy of humans and chimpanzees still differs considerably. In chimpanzees, the horizontal part of the vocal tract grows relatively more than the vertical part, while the pattern is the opposite for humans with the larynx descending more rapidly in human infants. The human-specific laryngeal descent thus may simply be a by-product of more general differences in facial developments of humans and chimpanzees (Nishimura 2005).
Another line of argument has been that the human FOXP2 gene, which plays a role in speech production in humans, is structurally different from the gene in all other primates. This is due to two relatively recent mutations during human evolution that became stabilized around 200 Ka, approximately coinciding with when modern humans evolved in Africa (Enard et al. 2002). In modern humans, deleterious mutations in the FOXP2 gene lead to severe speech disorders, apparently by affecting orofacial control during speech production (Fisher and Scharff 2009). Control of the larynx, however, does not seem to be impaired in affected patients, suggesting that FOXP2 evolution has little to add to the basic problem of what, how, and why humans evolved the capacity to control sound production. The human-specific FOXP2 gene also has been found in two Neandertal specimens (Krause et al. 2007), suggesting that the key mutations occurred before the advent of modern humans.
In sum, like all other primates, humans possess a repertoire of species-specific vocalizations – the possible remnants of an ancestral, nonhuman primate-like communication system. But humans also have evolved an additional layer of vocal control that is characterized by highly coordinated movements of the jaws, lips, and tongue in union with highly controlled sound production. While motor control of parts of the supralaryngeal vocal tract appears to be phylogenetically older and shared at least with the great apes, motor control of the larynx appears to be a recent human invention. How brain evolution and the associated laryngeal innervation changed to foster the transition from nonhuman primate to human vocal behavior is unclear. A potentially relevant point is the loss of laryngeal air sacs, present in nonhuman primates but absent in humans, which may have further facilitated the production of fine-grained vocalizations in humans (Quam, Martínez, Rosa, and Arsuaga, Chap. 8).
7.4 Reference, Inference, and Meaning in Communication
7.4.1 Information About External Entities
Much research has been devoted to the question of whether primate calls are meaningful (i.e., have an informational content), similar to how human words are meaningful (Fedurek and Slocombe 2011). This line of work has been inspired by early results from East African vervet monkeys (Chlorocebus aethiops), which produce acoustically distinct alarm calls to their main predators: pythons, leopards, and predatory eagles (Seyfarth et al. 1980). With playback experiments it was possible to demonstrate that vervet monkeys responded to the different calls as if the corresponding predators were present (e.g., standing bipedally to visually search the ground in response to a snake alarm). Comparable findings have been reported from other primate species, including Campbell’s monkeys (Zuberbühler 2001), black-and-white Colobus monkeys (Colobus guereza) (Schel et al. 2010), and several lemur species (Pereira and Macedonia 1991; Fichtel and Kappeler 2002), suggesting that predator-specific alarm calls are a general feature of primate communication (Zuberbühler 2001). It is also relevant that primates (and other groups of animals) recognize alarm calls of other species, a demonstration that call recognition and comprehension is not based on some innate capacity but is acquired by observing behavioral interactions of other individuals (Zuberbühler 2000; Rainey et al. 2004) (Snowdon, Chap. 6).
Alarm calls are not the only class of signals that refer to external entities. Some animals also produce acoustically distinct calls when finding food, with acoustic variations that sometimes convey something about the perceived value of the food (Fig. 7.5) (Scarantino and Clay 2015). Similar to alarm calling, food calls thus refer to distinct external events, probably mediated by specific internal emotional/psychological states, which, on the surface, appear to have negative consequences for the caller, since these calls are likely to increase feeding competition for the caller or attract a predator’s attention, respectively. However, observations and field experiments with chimpanzees have shown that callers are very selective in when they produce alarm or food calls, ensuring that social allies and other important group members are the main beneficiaries (Crockford et al. 2012; Fedurek and Slocombe 2013; Schel et al. 2013a).
Fig. 7.5
Time-frequency spectrograms of chimpanzee food calls (rough grunts) given by an adult male at Edinburgh Zoo to bread and apples. Bread is the more preferred food, and the corresponding grunts have more energy (depicted by the darkness of the image) at higher frequencies and a clearer harmonic structure in comparison to the lower-pitched, noisier grunts to apples. (Reproduced with permission from Slocombe and Zuberbühler 2005)
These examples go to the heart of the difficulties in deciding whether primate calls reflect the emotional/psychological state of a caller or whether they have an informational content. In this and many other cases both aspects seem to matter, suggesting that primate calls have a dual nature.
Primate vocal responses to external events, such as to foods or predators, are part of a more general pattern seen across nonhuman primate signaling systems. Most nonhuman primate calls serve relatively specific biological functions: they are given in very specific social situations or given to specific external events to the effect that recipients can draw inferences about the event experienced by the caller almost by default. For example, primates, including humans, produce specific vocalizations during aggressive interactions with aversive effects on opponents, probably to facilitate rapid learning by operant conditioning (Gouzoules et al. 1984; Owren and Rendall 2001). At the same time, any such tight signal-event link allows nearby listeners to draw inferences about the nature of the ongoing event. Calls come to convey information about an external entity or social event (Slocombe et al. 2010a).
Interestingly, during fights, chimpanzees sometimes produce sequences consisting of two different types of calls: barks directed at the aggressor to signal readiness to retaliate and screams directed at allies to solicit their help (Fedurek et al. 2015). Screams also show event-related acoustic variation that roughly encodes the severity of the attack, and field experiments have shown that listeners can discriminate this information readily (Slocombe and Zuberbühler 2007). Chimpanzees that are victims of aggression, in other words, appear to address two different audiences with their calls with two different intentions.
7.4.2 Symbolic Information
Are primate calls symbolic? Most definitions of “symbol” are based on notions of signal arbitrariness and reference to something else, either by association or by convention. A symbol thus represents, stands for, or suggests something else, usually an idea or an object. Since it is clear that primate alarm calls can refer to relatively specific predator classes (Marler 1998; Zuberbühler et al. 1999), discussions about the symbolic nature of primate calls usually center around the notion of signal arbitrariness. From a signaler’s point, alarm calls are not really arbitrary because nonhuman primates are predisposed from birth to produce alarm calls to some classes of events, such as “flying things,” and not others (Seyfarth and Cheney 1986). From a recipient’s point, however, alarm calls are entirely arbitrary, as demonstrated by research on interspecies alarm call recognition. Black-casqued hornbills (Ceratogymma atrata), for instance, discriminate between eagle and leopard alarm calls given by Diana monkeys, although there is nothing in the signal structure of the monkey alarm calls that implies the predator referred to by the calls (Rainey et al. 2004).
However, the one call-one meaning model of nonhuman primate communication is not always accurate. Similar calls are often given to a range of different and sometimes seemingly unrelated events, suggesting that recipients need to interpret the meaning of a call by making pragmatic decisions (Wheeler and Fischer 2012). For example, the most common call type in bonobos, the peep, is given by individuals in response to a wide range of social situations, as if to comment on the high significance of an event, rather than its nature (Clay et al. 2015), similar to human pointing. Also, many primates have unspecific alert calls that are given to a range of disturbances, including intraspecies conflicts, and terrestrial alarms are usually given to a range of animals, which can include nonpredators, suggesting that listeners need to rely on context to extract the exact meaning of a call (Arnold and Zuberbühler 2013).
7.4.3 Information about Caller Identity
Primate vocalizations are meaningful at multiple levels. For example, many call types carry individual acoustic signatures that enable receivers to identify the caller (Lemasson et al. 2005; Clay and Zuberbühler 2012). In chimpanzees, individuals recognize each other by their loud pant hoot vocalizations (Fig. 7.6) and can discriminate the calls of neighboring males from the calls of unknown stranger males (Herbinger et al. 2009). Pant hoots are different from all other calls within the chimpanzee repertoire, including pant grunts (given to food), in that they consist of four distinct units, at least one of which (the climax) carries over very large distances. Different units contain different information, including caller identity, age, rank, and behavioral context (e.g., arriving at a food tree versus traveling) (Fedurek et al. 2016).
Fig. 7.6
Chimpanzee pant hoots are acoustically complex, long-distance calls, mainly produced by the adult males. They consist of four acoustically distinct units: Introduction, Build up, Climax, and Let down. Each unit contains distinct information, including caller identity, social rank, age, and activity (travel versus food), as indicated with the checked pink boxes. (Modified from Fedurek et al. 2016)
Although there is widespread evidence for individually distinct calls in almost all primate communication systems that have been analyzed, it is important to point out that there are exceptions. For example, male Gelada baboons (Theropithecus gelada) do not react more strongly to experimentally presented grunts of rival males (simulating their approach) compared to nonrival males, suggesting that they do not use these vocalizations to recognize other group members (Bergman 2010).
7.4.4 Call Sequences
Another line of research has found that the relevant information units in animal communication are sometimes not at the level of individual calls but can reside in sequences of calls (Kershenbaum et al. 2016). Primate examples of meaningful call sequences include the alarm call system of black-and-white Colobus monkeys in which sequence length correlates with predator type (Schel et al. 2009); putty-nosed monkey (Cercopithecus nictitans) alarm calls (Fig. 7.7) in which different call combinations encode predator class and travel intention (Arnold and Zuberbühler 2006, 2008); Campbell’s monkey alarm calls in which call combinations discriminate between predatory and nonpredatory dangers and also predator type (Ouattara et al., 2009a, b); and black-fronted titi monkeys (Callicebus nigrifrons) in which different call combinations encode predator class and location (Cäsar et al. 2013).
Fig. 7.7
Male putty-nosed monkeys produce two basic types of alarm calls: pyows (P) and hacks (H). Males produce different sequences of calls, including series of hacks (to eagles), series of pyows (to ground predators), and short pyow-hack combinations, consisting of one or a few pyows followed by a few hacks. The distance traveled refers to the group movement once a call sequence has been emitted. Call sequences can (yes) or cannot (no) contain pyow-hack combinations. Sequences with pyow-hack combinations consistently lead to more group travel than sequences without pyow-hack combinations, suggesting that males produce them to initiate group travel. (Reprinted with permission from Arnold and Zuberbühler 2006)
Apart from the putty-nosed monkeys, it is still largely unclear whether these sequences have evolved specifically to convey meaning or whether they are a by-product of a caller’s changing perceptions as an event unfolds—something that needs to be addressed with targeted experiments (Schlenker et al. 2014). For apes, the songs of gibbons are of special interest, representing a vocal behavior with complex sequential structure by which the mated pair advertises social information relevant to neighboring individuals (Geissmann and Orgeldinger 2000; Geissmann 2002). Lar gibbons (Hylobates lar) also sing when encountering predators, and acoustic analyses have demonstrated that predator-induced songs and duet songs are assembled from the same song unit repertoire but with different syntactic structures (Clarke et al. 2006). The bonobos are another example of primates who produce acoustically variable calls when finding food. The different call variants are given in combinations, and the value of the food source determines the composition of the sequence, which is perceived and discriminated by others (Clay and Zuberbühler 2011). These empirical data have generated the hypothesis that some basic linguistic principles also apply to animal communication (Kershenbaum et al. 2016). Human language follows a number of linguistic laws that may also explain patterns in nonlinguistic animal communication (Schlenker et al. 2016). There is a considerable literature on this problem, already recognized by the pioneers of animal communication research (e.g., Sebeok 1977) and linguists interested in evolutionary questions (e.g., Hockett 1960).
One interesting problem is whether the patterns found in animal sound combinations are more similar to the notions of phonology or of syntax. In language, phonology refers to the process of forming meaningful units from meaningless sounds, an arguably simpler layer of combination than syntax, which refers to the combination of meaningful units. However, Collier and colleagues have reviewed examples of sound combinations in animal communication and concluded, surprisingly, that they are better explained as syntactic rather than phonological systems, suggesting that syntax evolved before phonology (Collier et al. 2014). Another linguistic principle, Menzerath’s Law (Cramer 2005), states that in linguistic structures there is a negative relationship between the number of syllables per word and the size of individual syllables. In an empirical study on male gelada baboons, the vocal sequence length negatively correlated with the duration of the composite calls, partly because call types were more abbreviated in longer sequences, suggesting that this principle is not restricted to linguistic constituents (Gustison et al. 2016).