
Investigating Auditory–Visual Perception Using Multi-Modal Neural Networks with the SoundActions Dataset
Abstract
Musicologists, psychologists, and computer scientists study relationships between auditory and visual stimuli from very different perspectives and using various terminologies and methodologies. This article aims to bridge the gap between phenomenological sound theory, auditory–visual theory, and audio–video processing and machine learning. We introduce the SoundActions dataset, a collection of 365 audio–video recordings of (primarily) short sound actions. Each recording has been human‑labeled and annotated according to Pierre Schaeffer’s theory of reduced listening, which describes the property of the sound itself (e.g., ‘an impulsive sound’) instead of the source (e.g., ‘a bird sound’). With these reduced‑type labels in the audio–video dataset, we conducted two experiments: (1) fine‑tuning the latest audio–video transformer model on the reduced‑type labels in the SoundActions dataset, proving that the model can recognize reduced‑type labels, and observing that the modality‑imbalance phenomenon is similar to the added value theory by Michel Chion and (2) proposing the Ensemble of Perception Mode Adapters method inspired by Pierre Schaeffer’s three listening modes, improving the audio–video model also on reduced‑type tasks.
© 2026 Jinyue Guo, Jim Tørresen, Alexander Refsum Jensenius, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.