Table 1
Comparison of widely used instrument recognition datasets in polyphonic music, where ‘Predominant’ refers to the main instrument, ‘Partial’ refers to some active instruments, and ‘Full’ refers to all instruments present in the audio.
| Dataset | #Examples | #Instruments | Duration | Labeling |
|---|---|---|---|---|
| OpenMIC‑2018 | 20,000 | 20 | 10 s | Partial |
| IRMAS | 6,705 | 11 | 3 s | Predominant |
| Slakh2100 | 1405 | 35 | Track | Full |
| Cerberus4 | 1327 | 4 | Track | Full |
| MusicNet | 330 | 11 | Track | Full |
| MedleyDB | 122 | 80 | Track | Full |
| URMP | 44 | 14 | Track | Full |
| HamNava | 6,000 | 9 | 5 s | Full |

Figure 1
Screenshots of the crowd‑sourcing app used for annotations: the Persian interface (right) and its English translation (left).

Figure 2
Analysis of instrument distributions in annotated excerpts, binarized with a 0.5 threshold.

Figure 3
Distribution of excerpts by difficulty level and annotation quality across instruments, with inter‑annotator agreement and annotator confidence factors.

Figure 4
Normalized percentage matrix of instrument co‑occurrence across annotated excerpts, binarized with a 0.5 threshold.
(a) Trained on hard‑labeled data and evaluated using accuracy and F1 score.
| MSE | R2 | |||
|---|---|---|---|---|
| Model | Val | Test | Val | Test |
| CNN Baseline | 9.07 | 9.28 | 50.78 | 51.08 |
| Music2vec | 5.09 | 5.10 | 72.74 | 73.13 |
| MusicHuBERT | 6.85 | 7.30 | 64.40 | 61.54 |
| MERT | 5.15 | 5.54 | 72.33 | 70.83 |
(b) Trained on soft‑labeled data and evaluated using accuracy and F1 metrics, considering model outputs greater than 0.5 as 1.
| Acc | F1 | |||
|---|---|---|---|---|
| Model | Val | Test | Val | Test |
| CNN Baseline | 84.48 | 82.5 | 67.76 | 65.81 |
| Music2vec | 91.31 | 91.59 | 84.69 | 86.06 |
| MusicHuBERT | 89.45 | 88.21 | 81.73 | 81.02 |
| MERT | 91.38 | 90.85 | 84.91 | 84.98 |
(c) Trained on hard‑labeled data, with the aggregation of elicited confidence greater than 0.5 considered as 1, and evaluated using accuracy and F1 metrics.
| Acc | F1 | |||
|---|---|---|---|---|
| Model | Val | Test | Val | Test |
| CNN Baseline | 85.49 | 84.47 | 69.96 | 68.62 |
| Music2vec | 91.49 | 91.36 | 82.11 | 81.39 |
| MusicHuBERT | 88.97 | 87.76 | 75.84 | 72.47 |
| MERT | 91.14 | 90.39 | 80.85 | 78.49 |
[i] The validation set is used for hyperparameter tuning, and the test score of the best‑performing model on the validation set is highlighted in bold.
Table 3
Instrument‑wise performance of Music2Vec on the test set.
| Label Type* | Metric | Tonbak | Singer | Tar | Kamancheh | Santour | Oud | Ney | Setar | Daf |
|---|---|---|---|---|---|---|---|---|---|---|
| Soft | MSE | 5.97 | 3.28 | 6.19 | 7.05 | 6.06 | 3.08 | 5.78 | 6.03 | 2.45 |
| R2 | 70.69 | 86.19 | 69.23 | 64.56 | 62.68 | 77.67 | 63.98 | 56.08 | 51.37 | |
| Acc | 91.87 | 96.22 | 89.69 | 86.48 | 91.41 | 93.36 | 90.61 | 88.66 | 95.99 | |
| F1 | 93.47 | 95.45 | 88.19 | 82.23 | 79.22 | 83.52 | 78.65 | 65.26 | 72.87 | |
| Hard | Acc | 92.55 | 95.99 | 89.92 | 86.71 | 91.18 | 93.01 | 90.15 | 87.17 | 95.53 |
| F1 | 94.05 | 95.16 | 88.60 | 83.04 | 79.02 | 83.20 | 78.39 | 60.28 | 70.68 |
[i] The best performance for each instrument is highlighted in bold. The asterisk (*) refers to the labels used during training, where ‘Hard’ means the aggregation of elicited confidence scores greater than 0.5 is considered as 1 during training.
