Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Monkeys can identify pictures from words

  • Elizabeth Cabrera-Ruiz ,

    Contributed equally to this work with: Elizabeth Cabrera-Ruiz, Marlen Alva

    Roles Data curation, Formal analysis, Investigation, Methodology, Software, Visualization, Writing – review & editing

    Affiliations Department of Cognitive Neuroscience, Instituto de Fisiología Celular, Universidad Nacional Autónoma de México, Mexico City, México, Basic Neurosciences, Instituto Nacional de Rehabilitacion, “Luis Guillermo Ibarra Ibarra” Mexico City, México

  • Marlen Alva ,

    Contributed equally to this work with: Elizabeth Cabrera-Ruiz, Marlen Alva

    Roles Data curation, Formal analysis, Software, Visualization, Writing – review & editing

    Affiliation Department of Cognitive Neuroscience, Instituto de Fisiología Celular, Universidad Nacional Autónoma de México, Mexico City, México

  • Mario Treviño,

    Roles Formal analysis, Software, Validation, Writing – review & editing

    Affiliation Laboratorio de Plasticidad Cortical y Aprendizaje Perceptual, Instituto de Neurociencias, Universidad de Guadalajara, Guadalajara, Jalisco, México

  • Miguel Mata-Herrera,

    Roles Data curation, Formal analysis, Investigation

    Affiliation Department of Cognitive Neuroscience, Instituto de Fisiología Celular, Universidad Nacional Autónoma de México, Mexico City, México

  • José Vergara,

    Roles Formal analysis, Validation, Writing – review & editing

    Affiliation Department of Neuroscience, Baylor College of Medicine, Houston, TX, United States of America

  • Tonatiuh Figueroa,

    Roles Data curation, Software

    Affiliation Department of Cognitive Neuroscience, Instituto de Fisiología Celular, Universidad Nacional Autónoma de México, Mexico City, México

  • Javier Perez-Orive,

    Roles Validation, Writing – review & editing

    Affiliation Basic Neurosciences, Instituto Nacional de Rehabilitacion, “Luis Guillermo Ibarra Ibarra” Mexico City, México

  • Luis Lemus

    Roles Conceptualization, Formal analysis, Funding acquisition, Methodology, Software, Validation, Visualization, Writing – original draft

    lemus@ifc.unam.mx

    Affiliation Department of Cognitive Neuroscience, Instituto de Fisiología Celular, Universidad Nacional Autónoma de México, Mexico City, México

Abstract

Humans learn and incorporate cross-modal associations between auditory and visual objects (e.g., between a spoken word and a picture) into language. However, whether nonhuman primates can learn cross-modal associations between words and pictures remains uncertain. We trained two rhesus macaques in a delayed cross-modal match-to-sample task to determine whether they could learn associations between sounds and pictures of different types. In each trial, the monkeys listened to a brief sound (e.g., a monkey vocalization or a human word), and retained information about the sound to match it with one of 2–4 pictures presented on a touchscreen after a 3-second delay. We found that the monkeys learned and performed proficiently in over a dozen associations. In addition, to test their ability to generalize, we exposed them to sounds uttered by different individuals. We found that their hit rate remained high but more variable, suggesting that they perceived the new sounds as equivalent, though not identical. We conclude that rhesus monkeys can learn cross-modal associations between objects of different types, retain information in working memory, and generalize the learned associations to new objects. These findings position rhesus monkeys as an ideal model for future research on the brain pathways of cross-modal associations between auditory and visual objects.

Introduction

Humans form cross-modal associations (CMAs) between sounds and images, which play a vital role in integrating semantic representations within language [1]. Supporting this, fMRI studies have shown that the temporal lobe of the human brain is actively involved in CMAs [2, 3] between words and visual objects [4]. It is believed that CMAs between phonological "templates"—developed in human infants by listening to caretakers—and observed objects are essential for creating semantic representations and aiding the production of a child’s first words [58]. Similarly, auditory templates have been proposed as a mechanism for vocal production in birds [913] and marmoset monkeys [14]. Recent studies, such as those by Carouso-Peck and Goldstein [15, 16], have also shown that visual signals during social interactions can also influence vocal production in birds. However, only a few ethological studies have suggested the existence of CMAs between vocal sounds and visual cues for semantic communication [17]. For instance, research has observed that vervet monkeys respond to calls signaling the presence of predators by looking upwards, downwards, or climbing into trees [18].

Neurophysiological recordings in monkeys have shown that the prefrontal cortex (PFC)—a brain area homologous to that in humans—utilizes working memory (WM) circuits [19] to perform CMAs between voices and faces [2032], receiving inputs from various sensory regions [3336]. CMAs have also been observed in the auditory and visual areas of the temporal lobe [3745]. Notably, trained macaques have demonstrated the ability to perform cross-modal discriminations between visual and tactile objects [46, 47], and between stimuli that could be considered non-ethologically relevant (NER), such as between pitch and color [48] and between amodal information (i.e., information that does not belong to a particular modality) [49] such as numerosity [50] and flutter frequencies [5154]. However, it remains to be explored whether non-human primates can establish CMAs between NER stimuli that are important for human language, like words—which monkeys can discriminate phonetically [55, 56], and pictures.

Therefore, to assess whether monkeys can form CMAs between NER stimuli, we trained two rhesus macaques in a delayed crossmodal match-to-sample task (DCMMS). We specifically designed the task to temporally separate auditory and visual stimuli, thus engaging WM circuits to retain one modality in mind while awaiting the corresponding cross-modal stimulus. Unlike prior studies, this task required the monkeys to retain auditory information during a 3-second WM period and then use this information to select a matching visual from a set of 2–4 pictures displayed simultaneously on a screen after the delay.

Our results show that rhesus monkeys can accurately identify sounds produced by various emitters and match them with images despite the temporal gap, highlighting the crucial role of WM circuits not only for storing information but also for actively evaluating the equivalence between stimuli of different sensory modalities. This finding suggests substantial similarities with human cognitive processing in analogous tasks [57, 58] and paves the way for future neurophysiological studies focused on identifying the specific brain pathways and mechanisms involved in these cross-modal processes.

Materials and methods

Ethics statement

Animal welfare was a priority throughout the study, conducted in strict accordance with the recommendations of the Official Mexican Norm for the Care and Use of Laboratory Animals (NOM-062-ZOO-1999). The protocol was approved by UNAM’s IACUC (i.e., Comité Institucional para el Cuidado y Uso de Animales de Laboratorio; CICUAL; Protocol number: LLS200-22). Descriptions comply with the ARRIVE recommended guidelines [59]. The portrayal of one of the authors of this manuscript was used in the experiments and has given written informed consent to publish this case details.

Subjects

Two adult rhesus monkeys (Macaca mulatta), a 10-year-old female (monkey G, 7 kg) and a 12-year-old male (monkey M, 12 kg) participated in the experiments. The animals had no previous training in any other task and were not subjected to any surgery or head restraint for this behavioral study. We adhered to the 3R principles (Replacement, Reduction, Refinement) [60]; therefore, we achieved statistical significance for the study in the number of trials each monkey performed rather than in the number of animals employed. The monkeys were housed in cages in a temperature-controlled room (22°C) with filtered air and day/night light cycles. They had free access to a balanced diet of dry food (pellets) supplemented with nuts, fresh fruits, and vegetables. Regular weight monitoring and veterinary check-ups ensured their health and well-being. The monkeys also had access to an enriched environment with toys, a recreation area for climbing and socializing with other monkeys four days a week, and opportunities for grooming through mesh sliding doors. In addition, cartoons and wildlife videos of content unrelated to the experiments, were presented on TV for no more than four hours a day. However, the face and voice of one of the researchers with whom the monkeys interacted were used during the experiments. To motivate participation in the experiments, the monkeys followed a water restriction protocol for 12–15 hours before experimental sessions (Monday to Friday, with water intake of 20–30 ml/kg achieved during the experimental sessions and ad libitum on weekends). After the 2–3-hour experimental sessions, they received 150g rations of fruits and vegetables.

Experimental setup

The monkeys were trained to leave their cages and sit in a primate chair (Crist Instrument, INC.) for transfer to a soundproof booth adjacent to the vivarium for the experiments. The chair faced a touchscreen (ELO 2201L LED Display E107766, HD wide-aspect ratio 22in LCD) positioned 30 cm in front. A spring lever below the touchscreen (ENV-610M, Med Associates) allowed the monkeys to initiate the trials. Two speakers were mounted above the touchscreen: a Yamaha MSP5 Studio (40 W, 0.050–40 kHz) and a Logitech speaker (12 W, 0.01–20 kHz). These speakers delivered the sounds and background noise at 45- and 55 dB SPL, respectively. The monkeys received liquid rewards through a stainless-steel mouthpiece attached to the chair (Reward delivery system 5-RLD-E2-C Gravity feed dispenser, Crist Instrument INC.).

Acoustic stimuli

The experiment utilized a variety of sounds, including laboratory recordings of words and monkey vocalizations, as well as free online sounds of cow vocalizations (https://freesound.org/). The sounds were edited to a duration of 500ms, resampled to 44.1 kHz (with cutoff frequencies of 0.1–20 kHz), and finally normalized (RMS) with Adobe Audition® 6.0 software. The phonetic labels of words in Spanish in the text and figure legends were created using the Automatic Phonetic Transcription tool by Xavier López Morras (http://aucel.com/pln/transbase.html).

Visual stimuli

The visual stimuli consisted of a red oval, grayscale cartoons of cows and monkeys, and pictures of human and monkey faces, as well as a cow circumscribed in ovals with a resolution of 200px/sq inch. Animal pictures used in the experiment were downloaded from free online sites and customized. However, the pictures shown in figures and supplementary information are similar but not identical to the original images used in the study; they were created for illustrative purposes only using an online AI image generator (https://www.fotor.com/ai-art-generator).

Delayed crossmodal match-to-sample task

We trained two rhesus macaques in a DCMMS task to assess their ability to establish CMAs between sounds and images temporally decoupled. Each trial began with a 1° white cross appearing in the center of the touchscreen. In response to the cross, the monkeys had to press and hold down a lever so that a 0.5-second reference sound could be delivered. After hearing the sound, the animals had to wait during a 3-second delay until 2–4 four pictures were presented simultaneously at random positions but equidistant on an 4° radius from the center of the touchscreen. The monkeys were then allowed to release the lever and select, within a 3-second response window, the picture that matched the sound (S1 Video). Correct selections were rewarded with a drop of liquid.

After the monkeys learned the task (see the monkeys’ training section below), they were able to perform at different CMAs. Each CMA was established by associating a sound with a picture representing the same category of external stimulus (e.g., both corresponding to a human). For example, a CMA of the type ‘human’ consisted of the association between the word [si] and a human face. This way, CMAs of different types were created (e.g., monkey, cow, human, and color). In some cases, the monkeys associated a single sound with several pictures of the same type; for example, four monkey faces were associated with one ‘coo’, resulting in four ‘monkey’ CMAs (S1 Table). Each CMA at S1 Table was established by the monkeys after many sessions of practice (see the following methods’ sections). However, in an experimental condition which we designated as the ‘perceptual invariance experiment’, we explored the monkeys’ ability to recognize sounds uttered by different individuals the monkeys did not hear before the experiment. For example, a ‘monkey’ CMA substitution set was comprised of ten different coos (i.e., auditory versions uttered by different individuals) delivered randomly in different trials, but all those trials presented the same monkey picture as a match. Finally, all experimental sessions consisted of blocks of ~300 trials of intermixed CMAs. The Hit rate (HR) corresponds to the proportion of correct responses (i.e., audio-visual match) in a session; false alarms (FA) indicate the proportion of incorrect responses. Reaction times (RT) are the times to release the lever in response to the appearance of the pictures on the touchscreen. Motor times are the intervals between the lever release and the touching of the screen. The task was programmed using LabVIEW 2014 (64-bit SP1, National Instruments®). The artwork in the task description was created using a free online platform (https://www.fotor.com/ai-art generator).

Monkeys training

To enhance the monkeys’ speed and efficiency in learning the DCMMS task, we tailored stimuli, durations, and rewards according to their ongoing performance. Initially, the animals were trained to produce the motor responses necessary for the task, such as pressing and releasing a lever and consistently activating the touchscreen. Rewards were given for holding down the lever when the cross appeared at the center of the touchscreen and for releasing the lever and touching the screen upon its disappearance. After the subjects completed more than 90% of the trials in consecutive sessions, we introduced a gray filled circle on the touchscreen that appeared at random positions, requiring the monkey to touch it to receive a reward. Within one or two weeks, the animals consistently reacted to the cross within a 500 ms window of appearance, maintained the lever pressed for 5–7 seconds, and released it upon the disappearance of the cross to touch the visual target.

In the subsequent training phase, the monkeys were required to respond to a tone (i.e., a 0.5-second 440Hz; 55 dB SPL) randomly emitted from speakers on either side of the screen. The goal was to indicate the direction of the sound by touching a right or left circle on the screen, which appeared simultaneously with the tone and then after a gradually increasing delay (from 1–3 seconds). Here, the objective was for the monkeys to associate the auditory and visual locations. However, after more than 35,000 trials (i.e., ~ 117 sessions), the performance remained at chance level. Consequently, we adopted a new approach that involved helping the monkeys to directly associate audio cues with specific images.

We replaced one circle with a cartoon image of a cow and added a 0.5-second broadband noise, so each trial featured either the tone or the noise. Rewards were given for correctly associating the cow cartoon with the broadband noise and the gray circle with the 440 Hz tone. From then on, sounds were delivered exclusively from a central speaker above the screen, and pictures appeared at different positions but were consistently separated by 180° (of visual angle) from each other. With this new training method, it took only a few sessions for Monkey G to begin performing above chance in associating the broadband noise with the cow cartoon (S1 Fig, upper leftmost panel). With many practice sessions, performance improved above chance, prompting us to gradually introduce new sounds and images to establish various CMAs. The initial CMAs involved only two different pictures on the touchscreen, while more complex associations involved the simultaneous presentation of three or four pictures.

Learning measurements

Although the primary goal of our experiments was not to explore the learning process of macaques, we noted behavioral improvements throughout the sessions we aimed to document. To quantify this, we fitted learning curves to the performance at each CMA across sessions, thereby assessing the monkeys’ learning progress. For this analysis, we applied the Rescorla-Wagner model, a well-established framework in associative learning [61], which explains learning as the formation of associations between conditioned and unconditioned stimuli. The process of deriving the learning curves required solving the following ordinary differential equation: Eq (1)

This equation describes the progression of associative strength (V) in response to trained conditioned stimuli, dependent on the number of training trials (t). The model provided the parameters for this equation: αβ, which is the product of the salience of the conditioned stimuli and the strength of the unconditioned stimuli (assumed constant during training, though modifications are possible [62]), and λ, representing the maximum possible associative strength towards the unconditioned stimulus. From the learning curves derived from this model, we extracted three additional parameters. Y0 measured initial performance, representing the starting point of the curve along the Y-axis. Parameter γ, indicating statistical learning onset, was determined as the first session in which performance reliably exceeded chance, defined as surpassing two standard deviations from the mean probability of a correct response under a binomial distribution (where p = chance level, and n = average number of trials per session). Finally, the derivatives of these learning curves, coupled with predefined thresholds, allowed us to determine the ‘trend-to-optimal’ experimental session for each CMA (δ), marking the session where changes in performance from one session to the next did not exceed a designated minimal rate of improvement of y’ = 0.01, indicating an approach towards a learning plateau.

Statistical analysis

We focused most of our analyses on data collected post-training, after the monkeys’ performance reached an asymptotic level, with their choices consistently exceeding the chance level. We used various statistical tests to compare RTs across different conditions. These tests included Spearman rank correlations to test the relationship between reaction time distributions and the number of pictures on the touchscreen and a Kruskal-Wallis test for differences between CMAs. If the Kruskal-Wallis’s test indicated a significant difference, we followed up with Mann-Whitney tests to compare conditions such as trials having 2 or 3 pictures when the same sound was presented. Finally, Bonferroni post hoc tests were used for multiple comparisons. The monkeys’ chance performance threshold depended on the number of pictures displayed; for monkey M, the chance was 0.5 since it was performed in two picture sets only, while for monkey G it was 0.25 at four picture sets. Analyses were performed using MATLAB R2022 (MathWorks).

Results

To investigate the ability of two rhesus monkeys to form CMAs between auditory and visual stimuli, we engaged them in a DCMMS task. Each trial commenced with the monkeys hearing a reference sound, followed by a 3-second delay, after which 2–4 pictures were displayed on the screen. Their task was to identify on the touchscreen the picture that corresponded to the sound (Fig 1A). The monkeys mastered fourteen CMAs after associating six distinct sounds—including broadband noise, animal vocalizations like a coo and a moo, and words such as [‘tsan. gi], [si], and [‘ro. xo]—with fourteen images (S1 Table). The trials varied, presenting either 2–4 pictures for Monkey G, while consistently presenting 2 pictures for Monkey M. Illustrative examples of four CMAs are depicted in Fig 1B. Fig 1C and 1D shows the monkeys’ hit rate (HR) and false alarm rate (FA) across these CMAs (S2 Table). For example, when a coo sound was used as the reference, Monkey G correctly matched it with the monkey face 87.43% of the time, while its most frequent incorrect choice was the cow face, selected 5.94% of the time (Fig 1C, open boxplots). Overall, Monkey G exhibited a HR of 85.12% ± 9.11 (mean ± SD), and Monkey M achieved a HR of 87.07% ± 5.71. Statistical analysis showed no bias in their selection of specific positions on the touchscreen (one-way ANOVA with multiple pairwise comparisons; Tukey’s HSD, p < 0.05) (S2 Fig). These outcomes indicate that both monkeys proficiently learned to discriminate each sound, against 2 to 4 pictures.

thumbnail
Fig 1. Delayed crossmodal match-to-sample task.

(A) Task Events. A trial begins with the monkey pressing a lever in response to a cross appearing in the center of the touchscreen. This is followed by a 0.5-second reference sound, succeeded by a 3-second delay. After the delay, 2–4 pictures are simultaneously presented on the touchscreen. The monkey must then release the lever and touch the picture that matches the sample sound to receive a reward. LP indicates lever press. (B) Examples of Crossmodal Associations. Each column displays a CMA between a sound, represented visually by its sonogram and spectrogram and a picture. The sounds, marked in black, include two Spanish words (in IPA notation) and vocalizations of a monkey and a cow. (C) HR (close boxplots) and FAs (open boxplots) during the presentations of the CMAs shown in B. The dashed line indicates the performance at chance level (i.e., 25% for sounds discriminated against four pictures). The reference sound is labeled in red at the top of the graph. (D) Same as in C, but for Monkey M. The dashed line is set at the 50% chance level (i.e., two pictures on the screen). The pictures are similar but not identical to the original images used in the study and are therefore for illustrative purposes only.

https://doi.org/10.1371/journal.pone.0317183.g001

Rhesus monkeys can learn cross-modal associations between stimuli of different types

The monkeys successfully established each CMA after several sessions of engaging in the DCMMS task, during which we initially presented two pictures simultaneously; only one of which corresponded to the played sound. To investigate the learning dynamics, we measured four learning parameters derived from fitting simple associative learning curves to the performance data across sessions. These parameters included the HR in the first session (y0), and the sessions marking statistical learning (γ), increasing learning (δ), and asymptote of learning (λ), respectively (refer to Methods for detailed descriptions). The left panel in Fig 2A illustrates Monkey G’s performance for the CMA between the coo sound and the monkey cartoon across sessions. Initially, the performance before learning was at chance level (~300 trials; see methods section on monkeys’ training and learning measurements), aligning with the intersection of the learning curve (black line) and the Y-axis, termed Y0. Subsequently, the γ performance level was reached after eight sessions from Y0 (~2700 trials); this level is defined as the session when the HR was above chance, marked by the intersection between the left edge of the gray box and the learning curve. A consistent increase in HR continued until the 15th session, reaching δ performance (right edge of the gray box), and by approximately the 40th session, the performance stabilized at the λ level, where changes in performance from one session to the next were insignificant. Similarly, middle and right panels in Fig 2A show two CMAs learned in trials ending with 3 and 4 pictures, respectively.

thumbnail
Fig 2. Learning CMAs in monkeys.

(A) Monkey G’s learning progress for three CMAs across sessions with trials presenting 2, 3, or 4 pictures simultaneously on the screen. The black line represents the average performance across sessions, while the blue line maps the first derivative of performance over training sessions (y’ values), illustrating the rate of change at each session. The initial HR (Y0) was near chance level (indicated by the black line at the ordinates), followed by γ (the left edge of the gray box), where the HR statistically exceeded chance. The learning parameter δ, marks a period when HR increased consistently above chance, culminating in a performance plateau at the session denoted by the asymptote of learning λ. (B) Sessions before δ for each CMAs. (C) Represents the average performance of Monkey G across all CMAs over the sessions. (D) Same as in C, but for Monkey M. The pictures are similar but not identical to the original images used in the study and are therefore for illustrative purposes only.

https://doi.org/10.1371/journal.pone.0317183.g002

S1 Fig shows performance evolving at each CMA across sessions in monkey G. In addition, the number of sessions needed for reaching sustained performance (i.e., the δ parameter) decreased in most new CMAs as the monkeys learned the aim of the task (Fig 2B). However, for the ‘color’ CMA formed by the word [ro. xo] (Spanish for ’red’) and the red oval, Monkey G spent ~14 sessions to reach δ at the conditions where four pictures appeared on the screen. We interpret this increase in learning sessions as the result of introducing those stimuli for the first time in trials that presented four pictures on the screen. Finally, Fig 2C and 2D present the mean HR for all CMAs across sessions for both monkeys. We interpret the reduction in γ and δ as the monkeys solving the cognitive control of the motor behavior required for the task (procedure memory), e.g., pressing and releasing the lever and interacting correctly with the touchscreen, so that once this was done, the animals could focus only on learning the CMA associations.

The RTs increased as a function of the selected picture and number of pictures on the touchscreen

To explore how different sounds and pictures influenced the monkeys’ ability to find a cross-modal match, we analyzed the RTs during hits across various CMAs. Fig 3A displays Monkey G’s RTs and motor times (MT) distributions across four CMAs. Notable differences are observed between the RT distributions, which pertain to the decision-making period (i.e., the time taken to decide which picture on the touchscreen matches the sound before releasing the lever). In contrast, the MT distributions, which relate to the stereotyped arm movement toward the chosen picture, showed no differences.

thumbnail
Fig 3. Crossmodal associations influenced the monkeys’ reaction times.

(A) Cumulative probabilities of reaction and motor times across four CMAs. (B) Left panel, pie charts displaying hit rates in sets presenting three CMAs. In all trials, the reference sound was consistently a "coo," but the match in each session was one of the four monkey pictures. Hits are depicted in colors, while false alarms (FAs), occurred when the monkey chose a non-matching picture, are shown in gray or white. Right panel, reaction time (RT) distributions of hits are illustrated with the same color coding as in the left panel. Inset, FA distributions produced in trials where one of the four monkey pictures was presented as a match, but a picture of a ‘human’ or a ‘cow’ was selected. (C) Same format as B but for ’cow’ CMAs. (D) The standard deviations (STDs) of the RT distributions increased as a function of their means during hits, false alarms (FAs), and in trials with two, three, or four pictures on the screen. (E) Plot of the monkeys’ HRs as a function of the mean RTs of hit distributions in D.

https://doi.org/10.1371/journal.pone.0317183.g003

To assess whether acoustic or visual information primarily influenced the monkeys’ RT distributions, we analyzed RTs to different pictures associated with a single sound (S3 Table), in trials presenting 3 pictures simultaneously. For instance, Fig 3B shows Monkey G’s RT distributions (right panel) during correct responses to various pictures of the type ‘monkey’ (left panel) associated with a single ‘coo’. Fig 3C shows the same for pictures of the type ‘cow’ associated with a single ‘moo’ sound. The RT distributions differed significantly in both instances (p < 0.001, Kruskal-Wallis’s test), indicating that since the sounds were constant, the differences in RTs must have stemmed from variations among the pictures. This trend continued across all CMAs where different pictures were associated with the same sound (p < 0.001 for all comparisons, post hoc Mann-Whitney U tests with Bonferroni correction); pairwise comparisons between all pictures with each sound revealed significant differences in RT distributions (p < 0.01 for 71.43% of coo comparisons, 76.19% for moo comparisons, and 82.14% for [si] comparisons; Mann-Whitney U tests with Bonferroni correction). A similar effect is observed for FAs as shown in Fig 3B (insets), where the differences in RTs resulted from incorrect matchings (p < 0.001, Kruskal-Wallis’s test).

Furthermore, Fig 3D shows that both the mean and the standard deviation (STD) of the RT distributions increased with the number of pictures displayed on the screen (2–4 pictures), indicating that locating the crossmodal match took longer as the number of distractor pictures increased. This tendency aligns with Weber’s Law and studies in time processing [63]. Here, we interpret that the variation in STDs suggests that the faster RTs likely occurred when the matching picture was found first among the pool of pictures on the screen, and longer RTs when the match was found last. Notably, these variations in RTs did not impact the accuracy across different CMAs (Fig 3E). These findings imply that RT was more heavily influenced by the amount of visual information processed than by differences in sounds.

The monkeys recognized sounds uttered by different speakers

We explored whether the monkeys could recognize sounds of the same type they learned but uttered by different individuals they did not hear before (Fig 4A, S4 Table). Fig 4B show how Monkey G performed above the 25% chance level in 98.33% of cases (paired-sample t-test, p < 0.05). Notably, the RTs during correct responses grouped into the four CMAs (i.e., pictures) used at this experiment rather than in the number of new sounds (Fig 4C). Fig 4D shows that Monkey M presented a similar effect in 3 CMAs, performing above 50% chance in trials of only two pictures on the touchscreen (i.e., 72.22% of the versions; paired-sample t-test, p < 0.05) and distributing RTs by picture category (Fig 4E), further supporting the notion of auditory invariance. In other words, regardless of variations in sounds, the animals could recognize them. Altogether, our findings suggest that monkeys can perform CMAs based on the ability to perceive equivalences within different sounds of the same type.

thumbnail
Fig 4. Monkeys recognized sounds uttered by different individuals.

(A) Spectrograms from various speakers depicting the Spanish word [’ro. xo] (red). The spectrogram of the learned sound is on the left. (B) Hit rate of monkey G in all sounds’ versions. Closed boxes on the left represent HR in the learned sounds (L). Open boxes, different versions’ HR. Closed boxes on the right of each group correspond to the HR in versions comprised of double repetitions of some sounds including L. (C) Cumulative density functions of the RTs in the learned sounds (bold lines) of monkey G and their versions. Notice how the distributions group by the picture category rather than by sounds. (D) Same as in B, but for monkey M. (E) Same as C, but for monkey M. The pictures are similar but not identical to the original images used in the study and are therefore for illustrative purposes only.

https://doi.org/10.1371/journal.pone.0317183.g004

Discussion

To investigate if rhesus monkeys can associate sounds with images regardless of their ethological relevance, we engaged two of these primates in a DCMMS task. To solve the task, monkeys had to retain in WM either an auditory replay or a crossmodal equivalent of the sounds (i.e., a face) and compare the memory against different pictures to find the match. Evaluation of their performance across various tests yielded two main outcomes: 1) the monkeys adeptly formed associations between sounds (e.g., animal vocalizations, words) and pictures (e.g., faces, cartoons), demonstrating human-like word-object associations that form the basis of language (Figs 1 and 2, S1 Table), and 2) these associations generalized even when the vocalizations and words they learned were uttered by different voices (Fig 4, S4 Table). Subsequent sections will detail these findings and explore the potential mechanisms to establish CMAs.

Rhesus macaques create crossmodal associations between sounds and images of different types

Previous studies demonstrated that monkeys could perform crossmodal discriminations of supramodal information such as numerosity and flutter frequencies [5054] and learn and group numerous sounds into categories irrelevant to their ethology [55, 56]. However, establishing cross-modal associations between NER categories in monkeys has proved to be challenging [37, 4648]. In training two rhesus monkeys in the DCMMS task, we initially encountered hurdles as the monkeys tended to disregard sounds [64, 65]. To counter this, training began with sound detection and progressively moved to crossmodal associations. We obtained different learning parameters from the monkeys’ performances in each CMA across sessions (Fig 2).

During the initial training phase, the monkeys learned to interact with the task’s apparatus (i.e., pressing the lever and touching the screen), achieving controlled motor responses within one or two weeks. Learning the first CMA (i.e., a broadband noise paired with a cow cartoon) required many sessions. Subsequent CMAs achieved statistical performance in just a few sessions; however, the animals excelled at the task after many practice sessions. We found no clear evidence that learning CMAs that included possible ethologically relevant stimuli like human and monkey faces, or coos [2031] were facilitated more than other CMAs to which they had no previous exposure. In other words, the animals learned all CMAs at similar rates, providing behavioral data that could be highly informative regarding the brain responses underlying CMAs. Future neurophysiological evidence could build on these behavioral findings.

Three of our results aligned with the idea that CMAs could be created from templates [512]: 1) monkeys learn each new CMA faster; 2) mastering a CMA requires a prolonged period, akin to learning to speak in humans; 3) the animals’ performance remained consistently high when the same vocalizations or words were presented with different voices, suggesting that the acoustic variations activated auditory templates, similar to how formants in words trigger acoustic recognition in monkeys [55]. Similarly, our results suggest that visual templates could create perceptual equivalence among different faces of the same type (Fig 3B). This is the strongest evidence to date that supports the possibility that monkeys can connect auditory and visual templates as humans do.

The formation of supramodal circuits linking vocalizations with other motor behaviors [12, 13] has suggested that the integration process in NHPs might similarly involve motor and spatial associations across sensory modalities [66, 67]. In our task, such associations were unnecessary since the animals had to match a sound with the corresponding picture, which was presented at different locations every trial. Moreover, studies exploring the convergence of crossmodal information in WM [2227, 48] indicate that while motor or spatial associations may facilitate initial learning, more abstract associations such as numerosity or flutter [50, 54], extending beyond immediate and innate categories, can be developed through direct CMAs. Therefore, the monkeys performing our task could have created direct connections between auditory and visual templates.

Working memory mechanisms for crossmodal matching

In contrast to other tasks [21, 37], our monkeys had to retain information about sounds over a 3-second delay and use it to compare with different pictures until they found a match, similar to previous work on the intra- and cross-modal discrimination of flutter [5154]. Given that the animals performed above chance in all CMAs, and strategies such as selecting a particular picture or location cannot explain their performance (S2 Fig), we conclude that the most parsimonious explanation was the cross-modal matching of sounds and pictures. In other words, monkeys must have retained information about the sounds in WM to find the cross-modal match presented 3 seconds later. A candidate brain region for the type of WM involved in our task is the PFC [19], which participates in the retaining of parametric and nonparametric information of different sensory modalities compared intra- or cross-modally [2, 3, 2031]. Notably, the PFC is also responsible for intramodal associations of stimuli separated in space and time [50]. Therefore, it is probably capable of translating information cross-modally; in our task, this could involve possibly invoking visual representations after hearing sounds, thus retaining visual information in WM for later comparisons with the pictures, rather than keeping the reference sound in working memory until the pictures appear.

On the other hand, it is well documented that PFC activity in the context of CMAs is activated by ethologically relevant stimuli such as conspecific faces and voices in monkeys not engaged in their active recognitions [26, 42]. This suggests that ethologically relevant circuits could be established there since birth [26, 31]. Therefore, active cross-modal discrimination and the learning of CMAs between non-ethological stimuli may occur in other areas of the temporal lobe, known to represent and integrate auditory and visual objects [3745], showing activations to superimposed audiovisual stimuli [37], perhaps to facilitate the recognition of individuals within their social group [26]. However, only future neurophysiological experiments in monkeys trained in the DCMMS task would reveal not only how and where in the brain non-ethological auditory and visual categories are learned, stored, and associated cross-modally, but also whether auditory or visual images invoked by sounds are retained in WM during the resolution of the task.

Supporting information

S1 Video. Monkey G performing the DCMMS task.

https://doi.org/10.1371/journal.pone.0317183.s001

(MP4)

S1 Table. Monkeys’ learning parameters and hit rate.

https://doi.org/10.1371/journal.pone.0317183.s002

(PDF)

S2 Table. Overall hit rate (mean ± STD) in four CMAs.

https://doi.org/10.1371/journal.pone.0317183.s003

(PDF)

S3 Table. The proportion (mean ± STD) of pictures selected.

Selections of pictures during hits and FAs in the condition when one sound was associated with different pictures of the same type.

https://doi.org/10.1371/journal.pone.0317183.s004

(PDF)

S4 Table. Hit rate (mean ± STD) in different versions of the learned sounds.

https://doi.org/10.1371/journal.pone.0317183.s005

(PDF)

S2 Fig. Hit rate and reaction times at different picture locations.

To analyze biases toward selecting a P at any angle from the center of the touchscreen, we performed a one-way ANOVA, False Discovery Rate corrected for multiple pairwise comparisons. Monkey M showed no location bias (p-values > 0.034). Monkey G, however, exhibited a significant effect for the monkey face position (F [15, 160.67] = 1.97; p = 0.014) and the cow face (F [15, 150.619] = 2.51; p = 0.001), but not for the human (p = 0.988). Post-hoc analysis (Tukey’s HSD) revealed these differences occurred in angles < 90° within each screen quadrant. In other words, while there were biases in selecting pictures at angles, there was no consistent preference for a specific quadrant. Based on these findings, the behavioral results presented here correspond to subsequent experiments presenting pictures only in four quadrants.

https://doi.org/10.1371/journal.pone.0317183.s007

(PDF)

Acknowledgments

We extend our gratitude to Vani Rajendran for valuable feedback; Francisco Pérez, Gerardo Coello, and Ana María Escalante from the Computing Department of the IFC; Aurey Galván and Manuel Ortínez of the IFC workshop; and Claudia Rivera for veterinary assistance. Additionally, we thank Centenario 107 for their hospitality.

References

  1. 1. Bowerman M, Choi S. Shaping meanings for language: universal and language–specific in the acquisition of spatial semantic categories. In: Bowerman M, Levinson S, editors. Language acquisition and conceptual development. Cambridge, UK; 2001. p. 475–511.
  2. 2. Beauchamp MS, Lee KE, Argall BD, Martin A. Integration of auditory and visual information about objects in superior temporal sulcus. Neuron. 2004;41(5):809–23. pmid:15003179
  3. 3. Noesselt T, Rieger JW, Schoenfeld MA, Kanowski M, Hinrichs H, Heinze HJ, et al. Audiovisual temporal correspondence modulates human multisensory superior temporal sulcus plus primary sensory cortices. Journal of Neuroscience. 2007 Oct 17;27(42):11431–41. pmid:17942738
  4. 4. Mesulam MM, Wieneke C, Hurley R, Rademaker A, Thompson CK, Weintraub S, Rogalski EJ. Words and objects at the tip of the left temporal lobe in primary progressive aphasia. Brain. 2013 Feb;136(Pt 2):601–18. Epub 2013 Jan 29. pmid:23361063; PMCID: PMC3572925.
  5. 5. Vihman M, Croft W. Phonological development: Toward a “radical” templatic phonology. Linguistics. 2007 Jul 20;45(4):683–725.
  6. 6. Coffey JR, Shafto CL, Geren JC, Snedeker J. The effects of maternal input on language in the absence of genetic confounds: Vocabulary development in internationally adopted children. Child Dev. 2022 Jan 1;93(1):237–53. pmid:34882780
  7. 7. Bloom L, Tinker E. The intentionality model and language acquisition: engagement, effort, and the essential tension in development. Monogr Soc Res Child Dev. 2001;66(4):1–91. pmid:11799833
  8. 8. Locke JL. Movement patterns in spoken language. Science. 2000 Apr 21;288(5465):449–51. pmid:10798981
  9. 9. Goldstein MH, King AP, West MJ. Social interaction shapes babbling: testing parallels between birdsong and speech. Proc Natl Acad Sci U S A. 2003 Jun 24;100(13):8030–5. pmid:12808137
  10. 10. Mooney R. Neurobiology of song learning. Curr Opin Neurobiol. 2009 Dec;19(6):654. pmid:19892546
  11. 11. Margoliash D. Evaluating theories of bird song learning: implications for future directions. J Comp Physiol A Neuroethol Sens Neural Behav Physiol. 2002 Dec 1;188(11–12):851–66. pmid:12471486
  12. 12. Chen Y, Matheson LE, Sakata JT. Mechanisms underlying the social enhancement of vocal learning in songbirds. Proc Natl Acad Sci U S A. 2016 Jun 14;113(24):6641–6. pmid:27247385
  13. 13. Hisey E, Kearney MG, Mooney R. A common neural circuit mechanism for internally guided and externally reinforced forms of motor learning. Nat Neurosci. 2018 Apr 1;21(4):589–97. pmid:29483664
  14. 14. Takahashi DY, Fenley AR, Teramoto Y, Narayanan DZ, Borjon JI, Holmes P, et al. Language Development. The developmental dynamics of marmoset monkey vocal production. Science. 2015 Aug 14;349(6249):734–8. pmid:26273055
  15. 15. Carouso-Peck S, Goldstein MH. Female Social Feedback Reveals Non-imitative Mechanisms of Vocal Learning in Zebra Finches. Curr Biol. 2019 Feb 18;29(4):631–636.e3. pmid:30713105
  16. 16. Takahashi DY. Vocal Learning: Shaping by Social Reinforcement. Curr Biol. 2019 Feb 18;29(4): R125–7. pmid:30779900
  17. 17. Ratcliffe VF, Taylor AM, Reby D. Cross-modal correspondences in non-human mammal communication. Multisens Res. 2016;74(5657):49–91. pmid:27311291
  18. 18. Seyfarth RM, Cheney DL, Marler P. Vervet monkey alarm calls: Semantic communication in a free-ranging primate. Anim Behav. 1980;28(4):1070–94.
  19. 19. Romo R, Brody CD, Hernández a, Lemus L. Neuronal correlates of parametric working memory in the prefrontal cortex. Nature. 1999;399(June):470–3. pmid:10365959
  20. 20. Plakke B, Hwang J, Romanski LM. Inactivation of Primate Prefrontal Cortex Impairs Auditory and Audiovisual Working Memory. 2015;35(26):9666–75.
  21. 21. Diehl MM, Plakke BA, Albuquerque R, Romanski LM. Representation of expression and identity by ventral prefrontal neurons. Neuroscience. 2022;496(2022):243–60. pmid:35654293
  22. 22. Hwang J, Romanski LM. Prefrontal neuronal responses during audiovisual mnemonic processing. J Neurosci. 2015 Jan 21;35(3):960–71. pmid:25609614
  23. 23. Sugihara T, Diltz MD, Averbeck BB, Romanski LM. Integration of auditory and visual communication information in the primate ventrolateral prefrontal cortex. J Neurosci. 2006;26(43):11138–47. pmid:17065454
  24. 24. Romanski LM, Sharma KK. Multisensory interactions of face and vocal information during perception and memory in ventrolateral prefrontal cortex. Philos Trans R Soc Lond B Biol Sci. 2023 Sep 25;378(1886). pmid:37545305
  25. 25. Romanski LM. Representation and integration of auditory and visual stimuli in the primate ventral lateral prefrontal cortex. Cereb Cortex. 2007 Sep;17. pmid:17634387
  26. 26. Adachi I, Hampton RR. Rhesus monkeys see who they hear: Spontaneous cross-modal memory for familiar conspecifics. PLoS One. 2011;6(8). pmid:21887244
  27. 27. Diehl MM, Romanski LM. Responses of Prefrontal Multisensory Neurons to Mismatching Faces and Vocalizations. 2014;34(34):11233–43.
  28. 28. Romanski LM, Averbeck BB, Diltz M. Neural representation of vocalizations in the primate ventrolateral prefrontal cortex. J Neurophysiol. 2005 Feb;93(2):734–47. pmid:15371495
  29. 29. Romanski LM, Goldman-Rakic PS. An auditory domain in primate prefrontal cortex. Nat Neurosci. 2002;5(1):15–6. pmid:11753413
  30. 30. Cohen YE, Theunissen F, Russ BE, Gill P. Acoustic features of rhesus vocalizations and their representation in the ventrolateral prefrontal cortex. J Neurophysiol. 2007;97(2):1470–84. pmid:17135477
  31. 31. Gifford GW, MacLean K a, Hauser MD, Cohen YE. The neurophysiology of functionally meaningful categories: macaque ventrolateral prefrontal cortex plays a critical role in spontaneous categorization of species-specific vocalizations. J Cogn Neurosci. 2005;17(9):1471–82. pmid:16197700
  32. 32. Huang Y, Brosch M. Neuronal activity in primate prefrontal cortex related to goal-directed behavior during auditory working memory tasks. Brain Res. 2016 Jun 1;1640(Pt B):314–27. pmid:26874071
  33. 33. Eacott MJ, Gaffan D. Inferotemporal-frontal Disconnection: The Uncinate Fascicle and Visual Associative Learning in Monkeys. Eur J Neurosci. 1992;4(12):1320–32. pmid:12106395
  34. 34. Romanski LM, Bates JF, Goldman-Rakic PS. Auditory belt and parabelt projections to the prefrontal cortex in the rhesus monkey. Journal of Comparative Neurology. 1999;403(April 1998):141–57. pmid:9886040
  35. 35. Romanski LM, Tian B, Fritz J, Mishkin M, Goldman-Rakic PS, Rauschecker JP. Dual streams of auditory afferents target multiple domains in the primate prefrontal cortex. Nat Neurosci. 1999; 2:1131–6. pmid:10570492
  36. 36. Gaffan D, Harrison S. Auditory-visual associations, hemispheric specialization and temporal-frontal interaction in the rhesus monkey. Brain. 1991; 114:2133–44. pmid:1933238
  37. 37. Chandrasekaran C, Lemus L, Ghazanfar AA. Dynamic faces speed up the onset of auditory cortical spiking responses during vocal detection. Proc Natl Acad Sci U S A. 2013;110: E4668–77. pmid:24218574
  38. 38. Ghazanfar AA, Chandrasekaran C, Logothetis NK. Interactions between the superior temporal sulcus and auditory cortex mediate dynamic face/voice integration in rhesus monkeys. J Neurosci. 2008;28(17):4457–69. pmid:18434524
  39. 39. Foxe JJ, Schroeder CE. The case for feedforward multisensory convergence during early cortical processing. Neuroreport. 2005 Apr 4;16(5):419–23. pmid:15770144
  40. 40. Ghazanfar AA, Schroeder CE. Is neocortex essentially multisensory? 2006;10(6). pmid:16713325
  41. 41. Huang Y, Brosch M. Behavior-related visual activations in the auditory cortex of nonhuman primates. Prog Neurobiol. 2024 Sep 1;240. pmid:38879074
  42. 42. Perrodin C, Kayser C, Logothetis NK, Petkov CI. Auditory and visual modulation of temporal lobe neurons in voice-sensitive and association cortices. Journal of Neuroscience. 2014;34(7):2524–37. pmid:24523543
  43. 43. Chandrasekaran C, Ghazanfar AA. Different neural frequency bands integrate faces and voices differently in the superior temporal sulcus. J Neurophysiol. 2009 Feb;101(2):773–88. pmid:19036867
  44. 44. Dahl CD, Logothetis NK, Kayser C. Modulation of visual responses in the superior temporal sulcus by audio-visual congruency. Front Integr Neurosci. 2010 Apr;4(APRIL 2010). pmid:20428507
  45. 45. Tyree TJ, Metke M, Miller CT. Cross-modal representation of identity in the primate hippocampus. Science. 2023;382(6669):417–23. pmid:37883535
  46. 46. Zhou YD, Fuster JM. Visuo-tactile cross-modal associations in cortical somatosensory cells. Proc Natl Acad Sci U S A. 2000;97(17):9777–82. pmid:10944237
  47. 47. Weiskrantz L, Cowey A. Cross modal matching in the rhesus monkey using a single pair of stimuli. Neuropsychologia. 1975;13(3):257–61. pmid:808744
  48. 48. Fuster JM, Bodner M, Kroger JK. Cross-modal and cross-temporal association in neurons of frontal cortex. Nature. 2000;405(6784):347–51. pmid:10830963
  49. 49. Stein BE, Meredith MA. The merging of the senses. Cambridge, Massachusetts: MIT Press; 1993. https://doi.org/10.1162/jocn.1993.5.3.373
  50. 50. Nieder A. Supramodal numerosity selectivity of neurons in primate prefrontal and posterior parietal cortices. Proceedings of the National Academy of Sciences. 2012;109(29):11860–5. pmid:22761312
  51. 51. Lemus L, Hernández A, Luna R, Zainos A, Romo R. Do sensory cortices process more than one sensory modality during perceptual judgments? Neuron. 2010; 67:335–48. pmid:20670839
  52. 52. Lemus L, Hernandez a, Romo R. Neural encoding of auditory discrimination in ventral premotor cortex. Proc Natl Acad Sci U S A. 2009; 106:14640–5. pmid:19667191
  53. 53. Lemus L, Hernández A, Romo R. Neural codes for perceptual discrimination of acoustic flutter in the primate auditory cortex. Proc Natl Acad Sci U S A. 2009;106(23):9471–6. pmid:19458263
  54. 54. Vergara J, Rivera N, Rossi-Pool R, Romo R. A Neural Parametric Code for Storing Information of More than One Sensory Modality in Working Memory. Neuron. 2016;89(1):54–62. pmid:26711117
  55. 55. Melchor J, Vergara J, Figueroa T, Morán I, Lemus L. Formant-based recognition of words and other naturalistic sounds in rhesus monkeys. Front Neurosci. 2021; 15:1–10. pmid:34776842
  56. 56. Morán I, Perez-Orive J, Melchor J, Figueroa T, Lemus L. Auditory decisions in the supplementary motor area. Prog Neurobiol. 2021;202:1–11. pmid:33957182
  57. 57. Chandrasekaran C, Lemus L, Trubanova A, Gondan M, Ghazanfar A a. Monkeys and humans share a common computation for face/voice integration. PLoS Comput Biol. 2011;7(9).
  58. 58. Majerus S, Cowan N, Péters F, Van Calster L, Phillips C, Schrouff J. Cross-Modal Decoding of Neural Patterns Associated with Working Memory: Evidence for Attention-Based Accounts of Working Memory. Cerebral Cortex. 2016 Jan 1;26(1):166–79. pmid:25146374
  59. 59. du Sert NP, Hurst V, Ahluwalia A, Alam S, Avey MT, Baker M, et al. The ARRIVE guidelines 2.0: Updated guidelines for reporting animal research. PLoS Biol. 2020 Jul 1;18(7).
  60. 60. Russell WMS, Burch RL. The Principles of Humane Experimental Technique. London: The Universities Federation for Animal Welfare; 1959.
  61. 61. Rescorla RA, Wagner AR. A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and non-reinforcement. In: Black A, Prokasy W, editors. In Classical conditioning II. New York: Appleton-Century Crofts; 1972. p. 64–99.
  62. 62. Treviño M. Associative learning through acquired salience. Front Behav Neurosci. 2016 Jan 11; 9:168673. pmid:26793078
  63. 63. Merchant H, Zarco W, Prado L. Do we have a common mechanism for measuring time in the hundreds of millisecond range? Evidence from multiple-interval timing tasks. J Neurophysiol. 2008 Feb;99(2):939–49. pmid:18094101
  64. 64. Ng CW, Plakke B, Poremba A. Primate auditory recognition memory performance varies with sound type. Hear Res. 2009 Oct 1;256(1–2):64–74. pmid:19567264
  65. 65. Scott BH, Mishkin M, Yin P. Monkeys have a limited form of short-term memory in audition. Proc Natl Acad Sci U S A. 2012 Jul 24;109(30):12237–41. pmid:22778411
  66. 66. Shushruth S, Zylberberg A, Shadlen MN. Sequential sampling from memory underlies action selection during abstract decision-making. Curr Biol. 2022 May 9;32(9):1949–1960.e5. pmid:35354066
  67. 67. Bennur S, Gold JI. Distinct representations of a perceptual decision and the associated oculomotor plan in the monkey lateral intraparietal area. J Neurosci. 2011 Jan 19;31(3):913–21. pmid:21248116