
Figure 1
A sketch map of the three main sound types of Pierre Schaeffer’s typology (impulsive, sustained, and iterative) with related action profiles (Jensenius, 2022). Dashed vertical lines show the perceived onset/end points.
Table 1
Popular labeled audio‑visual datasets. In label modality, ‘audio&video’ means some labels are based on audio information while others are based on video information, while ‘combined’ means labels are based on the combined information of both modalities. In ‘Perception mode’ and ‘Label ontology’ columns, we use colors as an additional indicator of the perception types: red stands for causal labels, blue stands for reduced labels, and green stands for semantic labels. While ‘emotion’ was not considered in Schaeffer’s listening mode framework, we loosely regard it as a mixture of causal and semantic information.
| Dataset | Year | # of Clips | Total Duration | Source | Label Modality | Perception Mode | Label Ontology |
|---|---|---|---|---|---|---|---|
| AudioSet | 2017 | 2M+ | 5,800+ h | YouTube | audio | causal | events |
| Kinetics‑400 | 2017 | 306k+ | 850+ h | YouTube | combined | causal | actions |
| EPIC‑KITCHENS | 2018 | 39,594 | 55 h | original | audio&video | causal | objects/actions |
| CMU‑MOSEI | 2018 | 2,199 | 2+ h | YouTube | combined | causal/semantic | emotion |
| AVE | 2018 | 4,143 | 11+ h | YouTube | combined | causal | events |
| LLP | 2020 | 11,849 | 32.9 h | YouTube | audio&video | causal | events |
| VGGSound | 2020 | 200k+ | 550+ h | YouTube | combined | causal | events |
| SSW60 | 2022 | 9.2k | 25.7 h | original | combined | causal | events |
| SoundActions | 365 | 1 h | original | combined | causal+reduced | events, objects, actions, environment, enjoyability, perception type |

Figure 2
Thumbnails of the SoundActions dataset.

Figure 3
Histogram of durations of SoundActions samples. Sample counts are shown in logarithmic style to show the long tail.

Figure 4
Recording a sound action using a lightweight setup with a mobile phone equipped with a USB microphone.
Table 2
Statistics of the reduced type labels in SoundActions dataset.
| PerceptionType | Enjoyability | |
|---|---|---|
| Class counts | Impulsive: 124 | Yes: 49 |
| Sustained: 84 | Neutral: 203 | |
| Iterative: 125 | No: 72 | |
| No majority: 32 | No majority: 41 | |
| Multirater | 0.46 | 0.20 |
| Agreed ≥ 2 | 91.5% | 88.5% |
| Agreed = 3 | 43.3% | 25.4% |

Figure 5
Three factors of the audio–video fine‑tuning experiment: (1) fine‑tuning range, (2) modality combination, and (3) label type. All combinations of the three factors were tested on the SoundActions dataset with a fivefold cross‑validation.

Figure 6
Left: Classic audio‑visual adapter structure used by LAVisH (Lin et al., 2023) and DG‑SCT (Duan et al., 2023). Right: Ensemble of Perception Mode Adapters (EoPMA model). The causal adapters are trained on the AVE dataset (Tian et al., 2018); the reduced adapters are trained on reduced labels in the SoundActions dataset.
Table 3
Five‑fold fine‑tuning results of DG‑SCT with EoPMA on SoundActions.
| PerceptionType | Enjoyability | |||||
|---|---|---|---|---|---|---|
| Fine‑Tune Type | Fine‑Tune Modality | Validation Modality | Group No. | Validation Accuracy Mean (Each Fold) | Group No. | Validation Accuracy Mean (Each Fold) |
| cls | av | av | A1 | B1 | ||
| a | a | A2 | B2 | |||
| v | v | A3 | B3 | |||
| av | a | A4 | B4 | |||
| av | v | A5 | B5 | |||
| all | av | av | C1 | D1 | ||
| a | a | C2 | D2 | |||
| v | v | C3 | D3 | |||
| av | a | C4 | D4 | |||
| av | v | C5 | D5 | |||

Figure 7
Qualitative principal component analysis visualization of the embedding spaces of different modality setups and tasks.
Table 4
Comparison of AVE validation accuracy of different ensemble methods, data, and labels. EoPMA: Ensemble of Perception Mode Adapters. ∗Our re‑evaluation of the officially provided DG‑SCT AVE checkpoint.
| Ensemble Method | Ensemble Data | Fine‑Tuned SoundActions Label | Validation Accuracy on AVE (Mean and Each Run) |
|---|---|---|---|
| (Original DG‑SCT) | ‑ | ‑ | |
| Ensemble of adapters | AVE (different random seed) | ‑ | |
| Ensemble of final embedding | SoundActions | PerceptionType | |
| Ensemble of final embedding | SoundActions | Enjoyability | |
| EoPMA | SoundActions | PerceptionType | |
| EoPMA | SoundActions | Enjoyability |

Figure 8
Accuracy of the original DG‑SCT model (baseline), an ensemble of two DG‑SCT models with different random seeds (ensemble of baselines), and the EoPMA models (PT for PerceptionType/EJ for Enjoyability, fine‑tuned modality, ensembling modality). is the hyperparameter of EoMPA defined in Equations 2 and 4. Accuracies are calculated on the validation set of AVE.
