HamNava: A Dataset for Multi‑Label Instrument Classification

Pouya Mohseni; Bagher BabaAli; Hooman Asadi

doi:10.5334/tismir.257

1 Introduction

As the amount of music data continues to increase, the need for automated systems to retrieve and extract musical information has become more critical. Over the past two decades, music information retrieval (MIR) has developed into a prominent research field, resulting in a variety of techniques and tools (Ji et al., 2023; Jamshidi et al., 2024). The progress and comprehensive evaluation of these methods depend on datasets that are specifically annotated for MIR tasks, such as genre classification and emotion and mood recognition.

Data collection, technical solutions, and philosophical inquiry are crucial for promoting cultural diversity in MIR research (Duan et al., 2023). The development of large, diverse datasets in Western musical traditions has significantly advanced algorithmic progress and facilitated methodological comparisons, driving research in the field. Recently, the MIR community has been increasingly advocating for expanding MIR tasks to a broader range of musical cultures (Serra, 2017; Lidy et al., 2010), and many researchers have started to focus on creating datasets for traditional and underrepresented musical cultures. In this study, we present a novel multi‑label instrument classification dataset collected from Iranian classical music, consisting of monophonic and heterophonic recordings, making the task intrinsically more challenging. A baseline is established for multi‑label instrument classification, and then we proceed with cross‑cultural research on foundation models.

1.1 Iranian classical music

Though unique, the musical culture of Persia shares strong connections with the musical traditions of the Middle East and Central Asia. It also has ties to the music of the Indian subcontinent and, to a lesser extent, to African musical cultures and, in particular, from the 1800s onward, it shows influences from European music (Nettl, 2012). It is based on a system of ‘Dastgahs,’ which are a multi‑modal cycle or framework, consisting of a set of melodic modal structures organized in a cyclic pattern based on a modal foundation (Asadi, 2004). There are seven primary Dastgah, each encompassing distinct melodic patterns, motifs, and specific musical movements that evoke particular emotions.

The orchestration of Iranian classical music is mainly monophonic in solo performance and heterophonic when played in ensembles. Songs feature a single melodic line that is either sung or played, often accompanied by multiple instruments playing variations of it by adding ornamentation (Massoudieh, 2016). In contrast to polyphonic music, where multiple melodies occur simultaneously, Iranian classical music focuses on the ornamentation and expression of a single melody. Despite its monophonic nature, the performance typically involves a variety of instruments, each contributing to the music’s texture by interpreting the central melody in subtly different ways (Zaker Jafari, 2019). This approach adds depth and richness to the music while preserving its monophonic essence, presenting a challenge for recognizing individual instruments in this style compared to other polyphonic forms.

1.2 Our contribution

Our main contribution is the creation of a fully annotated multi‑instrument dataset with soft labels, designed not only for segment‑level multi‑label instrument classification in Iranian classical music but also for cross‑cultural research. This paper provides a detailed exploration of the dataset’s development process, including its crowd‑sourced annotation methodology. In summary, our contributions are threefold:

Collection of an Iranian classical music, fully labeled, multi‑instrument dataset with crowd‑sourced soft labels, with a dynamic number of annotations;
Introducion of an innovative crowd‑sourcing methodology that relies on social incentives and community involvement, rather than financial compensation for each annotation, using a common messaging platform; and
establishment of a baseline for cross‑cultural multi‑label instrument classification as an MIR task through the use of pre‑trained music foundation models.

1.3 Related works

In this paper, we introduce HamNava, meaning ‘in unison’ or ‘harmonized’ in Farsi, a novel, crowd‑sourced dataset designed for multi‑label instrument classification in Iranian classical music as well as a baseline for evaluating models in cross‑cultural settings. The dataset provides 6,000 audio excerpts, each annotated for the presence or absence of nine core elements, including eight instruments and vocals, with a soft‑labeling approach that captures annotator confidence. Our methodology leveraged everyday messaging platforms to streamline the crowd‑sourcing process, relying on community engagement and social incentives to encourage participation.

Monophonic and polyphonic music instrument recognition are well‑established challenges in MIR, driving the creation of numerous datasets in this field. Table 1 provides an overview of widely used polyphonic music datasets and their key characteristics. These datasets are typically categorized as either partially labeled or fully labeled datasets according to the extent of their labeling.

Table 1

Comparison of widely used instrument recognition datasets in polyphonic music, where ‘Predominant’ refers to the main instrument, ‘Partial’ refers to some active instruments, and ‘Full’ refers to all instruments present in the audio.

Dataset	#Examples	#Instruments	Duration	Labeling
OpenMIC‑2018	20,000	20	10 s	Partial
IRMAS	6,705	11	3 s	Predominant
Slakh2100	1405	35	Track	Full
Cerberus4	1327	4	Track	Full
MusicNet	330	11	Track	Full
MedleyDB	122	80	Track	Full
URMP	44	14	Track	Full
HamNava	6,000	9	5 s	Full

Partially labeled multi‑instrument datasets feature labels for only a subset of the instruments present in each excerpt. This approach enables these datasets to encompass a wider variety of samples and a broader range of instruments. For example, some datasets may label only the predominant (most salient) instrument, e.g., IRMAS (Bosch et al., 2012), or they may include labels indicating the most or least likely instruments, e.g., OpenMIC‑2018 (Humphrey et al., 2018).

Fully labeled datasets, in contrast, typically provide both audio recordings and MIDI transcriptions, allowing for instrument identification at the note level. In some cases, the audio is synthesized to match the MIDI, e.g., Slakh2100 (Manilow et al., 2019) and Cerberus4 (Manilow et al., 2020), making these datasets less suitable for real‑world retrieval tasks. Other fully labeled datasets consist of live audio recordings, e.g., MedleyDB (Bittner et al., 2014), MusicNet (Thickstun et al., 2016), and URMP (Li et al., 2019). These real‑world datasets are generally smaller and less diverse, presenting challenges such as MIDI–audio alignment.

Several datasets focusing on musical traditions outside Western classical and popular music are increasingly used in computational musicology and the broader study of music through computational methods. The CompMusic Project (Serra, 2014) focuses on the computational study of five music traditions of the world; in addition to audio recordings, music scores, and lyrics, it emphasizes cultural aspect annotations. The Arab–Andalusian music dataset (Caro Repetto et al., 2018) includes annotations for Miznās, ub $\bar{u}$ s, and Nawbas, capturing the rhythmic patterns and musical modes of this tradition. The Jingju corpus (Caro Repetto, 2018) contains more than 1,244 audio recordings across eight role types for the study of traditional Jingju singing; it is labeled for melodic analysis, with annotations for Shengqiang (modal systems) and Banshi (metrical patterns), and is complemented by Jingju score and lyrics collections. The OTMM dataset (Şentürk, 2016) contains more than 420 hours of stereo recordings in different Makams, Forms, and Usuls; as well as score collections in hand‑written scores and book formats. The Carnatic and Hindustani corpuses (Srinivasamurthy, 2016; Gulati, 2016) contain more than 800 hours of music recordings in different Rāgas, Tālas, and forms, providing a comprehensive foundation for research in Indian art music.

In addition, there are other newly proposed datasets that have not yet gained widespread recognition within the MIR community but have significant potential to advance cross‑cultural music research. These emerging datasets include collections focused on lesser‑explored traditions; one such example is the SAMBASET dataset (Maia et al., 2019), which documents historical and modern Brazilian samba de enredo recordings, or the Hardanger Fiddle Dataset (Lartillot et al., 2023), which contains annotated note and beat onsets for Norwegian Hardanger fiddle music. Another contribution is the Teach Yourself Georgian Folk Songs Teach Yourself Dataset (Gillman et al., 2022), an annotated collection of traditional vocal polyphony from Georgia, as well as the Erkomaishvili Dataset (Rosenzweig et al., 2020), which includes Georgian sacred three‑voice tracks with score–audio annotations. Similarly, the CCOM‑HuQin dataset (Zhang et al., 2022) features multimodal annotations for Chinese fiddle music, which records its playing techniques. CCMusic is a consolidated database of Chinese music for MIR research, containing datasets for instrument recognition, playing technique classification, mode classification, and singing style classification, open with standardized formats (Zhou et al., 2025). The Lyra dataset (Papaioannou et al., 2022) for Greek traditional and folk music captures tags of musical genre, instrumentation, and regional aspects, while the Choralbücher dataset (Gerhardt and Kirsch, 2024) offers a historical collection of hymn arrangements from Northern Germany with a focus on figured bass analysis. Annotated for beats and downbeats, the Brazilian Rhythmic Instruments Dataset (Maia et al., 2018) contains both solo‑ and multi‑instrument recordings across 10 instrument settings and five rhythm classes, and, similarly, the Candombe dataset (Nunes et al., 2015) features ensembles of three to five drums. In the context of Iranian classical music, the Heydarian mode‑recognition dataset (Heydarian and Bainbridge, 2019) includes 5,706 seconds of performances in five Dastgahs, played on the Santur. Additionally, the KDC dataset (Nikzat and Repetto, 2022) offers recordings of Ney, Santur, and voice across seven Dastgahs and five Avazes, totaling more than two hours of music, including performances and Radif s.

These datasets allow MIR models to capture regional, rhythmic, and modal subtleties across various musical traditions. However, they often face limitations due to a small number of instruments or excerpts, or they lack the diversity of the culture. This restricts their applicability to more diverse and extensive cross‑cultural MIR tasks.

2 Constructing Methodology

This section outlines the step‑by‑step methodology for constructing HamNava. It details the selection of core instruments, the process of gathering excerpts, the confidence elicitation and soft‑labeling techniques used for flexible annotation, and the creative use of a common messaging app to effectively crowd‑source annotations within the community. This comprehensive approach enabled the generation of high‑quality, consensus‑driven labels while actively involving a diverse group of contributors in the annotation process.

2.1 Instrument selection

The selection of instruments for the HamNava dataset was carried out through a combination of expert consultation and empirical analysis. Professors specializing in Iranian classical music were consulted, and their expertise guided the selection of core instruments commonly featured in traditional ensembles. To complement this process, we conducted a survey of 300 albums spanning from 1975–2005, focusing on recordings by predominant Iranian artists. Naturally, we observed that the sound quality of older and live recordings tends to be lower than that of more recent releases, though never to an extent that hinders listening or annotation. The booklet and metadata of each album were examined to determine the specific instruments involved in the orchestration.

Through these processes, we consistently identified nine musical elements (eight instruments and a vocalist) across the majority of the albums. These elements include five string instruments (Tar, Setar, Santur, Oud, and Kamancheh), a wind instrument (Ney), two percussion instruments (Tonbak and Daf), and also voice. Our findings confirm that these nine elements establish a solid foundation for a balanced and representative dataset. This selection captures the diversity of instruments in Iranian classical music while remaining practical for full labeling and annotation.

The excerpts used for crowd‑sourcing were collected by investigating the aforementioned 300 albums, whose licenses could be secured for academic sharing. From these, we selected only the albums that featured the eight target instruments and voice, which reduced the number of albums to almost a quarter of the original set. Then, we extracted the tracks from these selected albums where two or more instruments were played together, resulting in 236 pieces. The instruments present in each track were identified collaboratively by the two musically‑trained authors, narrowing down the album‑level instrumentation to track‑level labels. These track‑level labels were later used to guide contributors during the annotation process for each five‑second excerpt extracted from each track.

Prior to extracting five‑second excerpts from the recordings, the tracks were categorized into two groups based on their complexity: easy and hard. The ‘easy’ category included tracks where all instruments were played with different techniques, e.g., plucking, striking, and bowing, making them more distinguishable for less‑experienced annotators. The ‘hard’ category, in contrast, contained excerpts that required a higher level of expertise for accurate annotation.

Ultimately, each track was segmented into non‑overlapping five‑second excerpts, resulting in over 17,000 candidates. To reduce this number while maintaining musical diversity, we applied a heuristic function that ranked and selected excerpts based on several criteria. First, the function ensured that each selected musical element appeared in at least 1,000 track‑level labels of excerpts, based on a narrowed‑down list of musical elements of the excerpts. Second, to promote diversity, no consecutive excerpts from the same track were selected. Third, to balance task difficulty, approximately twice as many excerpts were drawn from the ‘easy’ category compared to the ‘hard’ category. Finally, excerpts with the highest number of simultaneous musical elements were deprioritized, as such dense, heterophonic passages tend to be more challenging for accurate human annotation, deprived of visual reference.

This heuristic process resulted in a curated set of 6,000 five‑second excerpts for the HamNava dataset, with around 67% labeled ‘easy’ and the remaining ones labeled ‘hard’. The chosen balance between difficulty levels reflects the distribution of contributors who had joined the platform before the start of crowd‑sourcing. The final dataset size was determined by both the anticipated level of community engagement and the volume of available data. The five‑second duration was selected as a compromise, being long enough for annotators to make reliable judgments, while also being informative for models. This choice also aligns with recent practices in transformer‑based architectures and follows the current trend in foundation models for music. To preserve the authenticity of the source material, no post‑processing, such as silence removal or dynamic filtering, was applied to the excerpts, and each segment remains as it originally appeared in the licensed recordings.

2.2 Soft labeling and aggregation methodology

To efficiently gather reliable annotations for the HamNava dataset, a soft labeling approach was used. Soft labeling is especially beneficial in crowd‑sourcing environments as it enables faster annotations by lowering the threshold for precise decisions, leading to higher agreement rates among annotators (Martın‑Morató et al., 2023). Incorporating confidence elicitation into the annotation process, we aimed to collect probabilistic labels that capture the inherent ambiguity in the presence of instruments, especially in complex multi‑instrument contexts.

Annotators provided a confidence score for its presence using a simple confidence‑elicitation method with four discrete scales. These confidence scores were then averaged across annotators to generate soft aggregate labels. For vocals, however, we departed from the confidence scoring approach and directly asked annotators to specify whether the vocal contained Chah‑chah (also known as Tahrir, a rhythmic, wordless vocalization characterized by wide and rapid, vibrato‑like ornamentation over a fundamental note), Kalam (a melodic, with lyrics singing), or both. This categorical approach was chosen over confidence scoring because behavioral evidence suggests the human voice is special for humans (Liberman and Mattingly, 1989) and can be recognized more rapidly and accurately than instrumental sounds, due to neural mechanisms attuned to its complex spectro‑temporal features (Agus et al., 2012).

Eliciting confidence scores from annotators enhances label quality by encouraging them to consider their certainty. It also improves agreement in ambiguous cases and is cost‑effective, requiring fewer participants to achieve high‑quality annotations. These continuous labels help improve model training by incorporating uncertainty, reducing data requirements, and maintaining performance (Méndez Méndez et al., 2022). By using soft labeling and confidence elicitation, the accuracy and consistency of the annotations and the resulting labels for HamNava are significantly improved.

2.3 Crowd‑Sourcing platform and workflow

Our crowd‑sourcing session was conducted on Telegram, a widely‑used communication platform in Iran (Akbari and Gabdulhakov, 2019) and globally. We used Telegram’s bot feature, which enabled us to streamline and automate the crowd‑sourcing process for our dataset. The bot acted as a virtual assistant, guiding users through the annotation process without requiring manual intervention.

To maximize participation, the project was promoted within relevant Iranian classical music communities through advertisements and announcements on popular Telegram channels, encouraging users to engage and contribute. This outreach strategy aimed to attract individuals with an interest or background in Iranian classical music, motivating them to share their expertise based on the social dynamics of the dataset collection. The annotation project was launched on June 21, 2024, and ran for six months.

When users started the bot, they received a message outlining the process and requirements. To ensure high‑quality annotations and match users with appropriately leveled excerpts, each user’s ability to identify instruments was first evaluated. This assessment involved both self‑evaluation questions and practical instrument‑recognition tasks. Users could select their ear‑training level from ‘low,' ‘medium,’ and ‘high,’ each accompanied by a brief descriptive sentence. They then listened to a set of predefined musical sounds of Santur, staccato from Kamancheh, and a low‑quality Tar. Based on their responses, users were categorized into three proficiency levels according to their ability to distinguish instruments. Annotators who selected the correct answers and rated themselves the highest were assigned to the top category.

The lowest proficiency level was excluded from contributing to the dataset, as participants at this level generally lacked sufficient familiarity with the instruments. For the remaining two levels, the bot assigned specific types of audio excerpts: one group received excerpts containing multiple instruments playing the same technique, while the other group worked with simpler excerpts featuring clearer separations. This approach guaranteed that participants were assigned tasks suitable for their skill level, ultimately enhancing the quality of the annotations.

Once a user’s proficiency level is assessed, the annotation workflow begins. Annotators are provided with detailed instructions that ensure consistency in measurement, including clear definitions and distinctions between Chah‑chah and Kalam, as well as guidance on how to report their confidence levels. For each excerpt, annotators indicate whether they did not hear the instrument (confidence level 0), or, if they did hear it, they select a level from 1, 2, or 3 according to how confident they are in their judgment. These responses are later mapped to numerical values: 0.0 for ‘not heard,’ and 0.5, 0.75, and 1.0 for confidence levels 1, 2, and 3, equally spaced to reflect increasing levels of confidence in hearing the instruments. Each time the user enters the ‘\annotate’ command, a function is triggered that sends a randomly selected five‑second mono MP3 excerpt with a sample rate of 22,050 Hz, along with a predetermined list of potential instruments—based on the determined track‑level labels—for users’ focus listening. This follows a predefined sequence of questions, where each response reflects the user’s judgment about whether an instrument is absent or how confidently it is perceived to be present in the audio excerpt.

Throughout the annotations process, the total number of annotated excerpts was displayed to the user, which, in our pilot study, was shown to increase the average number of annotations per user. Users can pause or resume the annotation process at any time, offering flexibility in their participation. Every two weeks, participants received a message and a push notification (if enabled), reminding them about the process and stating the expected average number of annotations per user needed for dataset construction. The number of judgments collected for each excerpt was dynamically determined, and the annotation continued until two annotators agreed on the presence or absence of each instrument in the excerpt. Figure 1 shows the interface of the app and the annotation workflow.

Screenshots of the crowd‑sourcing app used for annotations: the Persian interface (right) and its English translation (left).

3 HamNava Analysis

Overall, HamNava consists of 6,000 five‑second excerpts, with over 60,000 individual judgments collected through crowd‑sourced annotation from more than 1,800 contributors. A total 91% of contributors met the proficiency level required for annotation, of whom 60% were assigned to the easier task and 40% were assigned to the more difficult task. Each excerpt is annotated based on a predefined list of instruments present in the full track, referred to as track‑level labels, resulting in excerpt‑level annotations derived from the aggregated judgments of annotators.

Figure 2b shows the track‑level labels and excerpt‑level annotations with distribution of instruments per excerpt, revealing that, while the minimum track‑level labels for each instrument is 1,000, certain instruments are more dominant in selected excerpts from the recordings of Iranian classical music. For the vocal class, the classifications Chah‑Chah and Kalam are grouped under the ‘singer’ tag in the figures, as further discussed in the following paragraphs. Additionally, Figure 2a compares the instrument frequency distributions between track‑level labels and excerpt‑level annotations across the ensembles, with both distributions resembling a normal curve. Figure 4 summarizes the co‑occurrence patterns of selected instruments within HamNava, providing insights into music modeling and instrument pairings. For instance, instruments like the Setar and Kamancheh are not typically played together in a traditional context. Finally, Figure 3 illustrates the distribution of instrument levels in the excerpts.

Analysis of instrument distributions in annotated excerpts, binarized with a 0.5 threshold.

Distribution of excerpts by difficulty level and annotation quality across instruments, with inter‑annotator agreement and annotator confidence factors.

Normalized percentage matrix of instrument co‑occurrence across annotated excerpts, binarized with a 0.5 threshold.

To evaluate annotation consistency, we applied an inter‑annotator agreement criterion using binarized annotations. For each (instrument, excerpt) pair, one annotation was selected as the ground truth, while the remaining annotations were treated as predictions. The accuracy was then computed between the selected ground truth and each of the other annotations, treating the other annotations as predictions. This process was repeated for each annotation, iterating through all of them as the ground truth. The agreement, denoted as $Ag .$ , is computed according to Equation 1:

1

{Ag .}_{I} = \frac{1}{| E_{I} |} \sum_{E \in E_{I}} \frac{1}{| A_{E} |} \sum_{a \in A_{E}} Acc (1 {[a]}^{| A_{E} | - 1}, 1 [A_{E} \ {a}]),

where $E_{I}$ represents the set of excerpts where the presence of instrument $I$ is possible and $A_{E}$ denotes the set of annotations for excerpt $E$ . The notation $1 [.]$ refers to the binarized form of an annotation at the threshold of $0.5$ , and ${[a]}^{n}$ indicates a sequence consisting of $a$ repeated $n$ times. Additionally, annotators’ confidence factors were computed across all instruments. This factor reflects the degree of certainty with which annotators indicated the presence of instruments. The confidence factor is defined in Equation 2:

2

{Confidence}_{I} = \frac{1}{| E_{I} |} \sum_{E \in E_{I}} \frac{1}{| A_{E} |} \sum_{a \in A_{E}} 2 a - 0.5| .

Figure 3 presents the inter‑annotator agreement and the confidence factors of the annotators for all instruments. Overall, both agreement and confidence levels are high across the board, reflecting the quality of the crowd‑sourced annotations. Inter‑annotator agreement is particularly high for easily distinguishable musical elements, such as vocals, which reached an agreement rate of 98.16%. However, some instruments exhibit lower confidence and agreement scores. When cross‑referenced with the distribution of excerpts by difficulty level in Figure 3, it becomes apparent that these instruments are more prevalent in the ‘hard’ excerpts. This pattern corresponds to their increased presence in larger ensembles, which complicates the task of annotators to confirm their presence, leading to reduced agreement rates and confidence scores for these instruments.

Moreover, the inter‑annotator agreement for the classification task (Kalam, Chah‑chah) was calculated based on the excerpts in which the aggregated tags indicated the presence of a singer. The results showed that, although the agreement for Kalam was relatively high at 91.06%, the agreement for Chah‑chah was considerably lower at 71.73%. This suggests that, while annotators generally agree on identifying singing with lyrics, there is less consensus when it comes to vocal ornamentation and its extent. As a result, this categorical annotation is not included in the subsequent sections of the paper, and the Kalam and Chah‑chah categories are unified under a new label ‘singer.’

3.1 Baseline model

For evaluations of HamNava, a data split of 70:15:15 was proposed for training, testing, and validation, while ensuring a consistent distribution of instruments and levels across all splits. The mean squared error (MSE) on the validation set was used to tune hyperparameters and select the best model configuration. Initially, a multi‑label regression task was performed, where both training and test samples were softly labeled, with performance assessed using MSE and $R^{2}$ metrics. Next, while maintaining the soft‑label regression for the training phase, the testing and validation tasks were framed as multi‑label binary classification with the threshold of $0.5$ , with evaluation based on accuracy and F1 scores. Lastly, the labels of the training and testing sets were binarized, and accuracy and F1 metrics were reported for this setup.

In this section, we establish a performance baseline for multi‑label instrument classification in Iranian classical music. This baseline experiment helps to evaluate the effectiveness of simple classification and regression models before exploring more complex approaches, particularly those using pre‑trained models for cross‑cultural music information retrieval. By starting with a straightforward convolutional neural network (CNN) architecture, we can better understand the challenges and limitations of the task and provide a reference point for subsequent methods.

To establish a performance baseline for multi‑label instrument classification in Iranian classical music, we propose the following evaluation methodology using the HamNava dataset. These models serve as a reference point for evaluating more complex approaches. The CNN architecture consists of two convolutional layers, each followed by batch normalization, and four fully connected layers applied to the flattened feature representation. The ReLU activation function is employed after each layer. The network is optimized using the Adam optimizer with a learning rate of $0.001$ and a weight decay of $1 e - 4$ . For the soft‑label setting, the loss function employed was the MSE, while, for the hard‑label setting, binary cross‑entropy was used. As input features, we extract Mel‑frequency cepstral coefficients (MFCCs) from the audio excerpts using the following configuration: FFT size of $1024$ , hop length of $512$ samples, and number of Mel filterbanks of $40$ . The results of the CNN baseline across the three tasks are summarized in Table 2.

Table 2

The performance of the foundation models on the test and validation sets is presented.

(a) Trained on hard‑labeled data and evaluated using accuracy and F1 score.

	MSE		R2
Model	Val	Test	Val	Test
CNN Baseline	9.07	9.28	50.78	51.08
Music2vec	5.09	5.10	72.74	73.13
MusicHuBERT	6.85	7.30	64.40	61.54
MERT	5.15	5.54	72.33	70.83

(b) Trained on soft‑labeled data and evaluated using accuracy and F1 metrics, considering model outputs greater than 0.5 as 1.

	Acc		F1
Model	Val	Test	Val	Test
CNN Baseline	84.48	82.5	67.76	65.81
Music2vec	91.31	91.59	84.69	86.06
MusicHuBERT	89.45	88.21	81.73	81.02
MERT	91.38	90.85	84.91	84.98

(c) Trained on hard‑labeled data, with the aggregation of elicited confidence greater than 0.5 considered as 1, and evaluated using accuracy and F1 metrics.

	Acc		F1
Model	Val	Test	Val	Test
CNN Baseline	85.49	84.47	69.96	68.62
Music2vec	91.49	91.36	82.11	81.39
MusicHuBERT	88.97	87.76	75.84	72.47
MERT	91.14	90.39	80.85	78.49

[i] The validation set is used for hyperparameter tuning, and the test score of the best‑performing model on the validation set is highlighted in bold.

While the CNN baseline provides a reasonable starting point, the shallow architecture may not fully capture the complex relationships between different instruments with relatively few samples, especially in the multi‑label case with high co‑occurrence patterns. This provides a reference point for subsequent methods, where we can better understand the challenges and limitations of the task. In the subsequent section, therefore, more advanced models, predominantly pre‑trained on Western classical and popular music, are explored to harness their effectiveness for cross‑cultural research.

3.2 Cross‑Cultural multi‑label instrument classification

Foundation models, such as large language and diffusion models, are versatile models designed to perform a wide range of tasks rather than being restricted to specific ones (Bommasani et al., 2021). These models are primarily trained through self‑supervision on large volumes of unlabeled data, allowing them to learn useful representations that can be applied effectively to various downstream tasks (Ericsson et al., 2022).

In the domain of music processing, foundation models offer the potential to reduce annotation costs and enhance generalization. Their use could significantly improve music information retrieval and generation for various music‑related tasks (Ma et al., 2024). While these models show promise for advancing cross‑cultural and culture‑preservation works (Ma et al., 2024), research on foundation models pre‑trained on large datasets, which are typically dominated by Western or Western‑influenced music, remains limited in the context of non‑Western and folk music (BabaAli and Mohseni, 2024).

This subsection examines the cross‑cultural potential of foundation models on HamNava, highlighting notable state‑of‑the‑art models from the literature, such as Music2vec (Li et al., 2022), MusicHuBERT (Ma et al., 2023), and MERT (Li et al., 2023). Music2vec, inspired by Data2vec (Baevski et al., 2022), is an efficient model for learning representations from raw music waveforms through a teacher–student architecture with masked input prediction. MusicHuBERT adapts the HuBERT (Hsu et al., 2021) framework to music by predicting discrete pseudo‑labels for masked audio segments, using pre‑trained K‑means clustering on MFCC features, which contrasts with Music2vec’s continuous latent predictions. MERT, on the other hand, integrates acoustic and musical representation learning by combining RVQ‑VAE (Défossez et al., 2022) and CQT‑based teacher models in a multi‑task framework.

Considering the previous evaluation settings, a fully connected decision layer was added to each of the three models, with all layers except the final one frozen during training. The learning rate was set to $0.01$ , with 100 epochs and the Adam optimizer used for training. Table 2 summarizes the results, showing Music2vec’s superior performance across all settings. Using p‑values to assess statistical significance, we find that the pre‑trained Music2vec model performs significantly better in representing excerpts containing folk elements than the CNN‑based architecture ( $p -value < 0.005$ ). While this improvement is not statistically significant when comparing Music2vec with MERT ( $p -value > 0.1$ ), representations learned by the MusicHuBERT model fail to generalize to the selected musical elements in Iranian classical music, given these evaluations and as reflected by its lower performance.

Additionally, Table 3 presents the instrument‑wise metric values for the best‑performing model. Among all the musical elements, the singer consistently achieved the highest performance across almost all settings, which may be due to the presence of vocal samples in the training data of the foundation models. Tonbak (percussion) and Oud (string) also ranked among the top‑performing instruments after ‘singer.’

Table 3

Instrument‑wise performance of Music2Vec on the test set.

Label Type*	Metric	Tonbak	Singer	Tar	Kamancheh	Santour	Oud	Ney	Setar	Daf
Soft	MSE	5.97	3.28	6.19	7.05	6.06	3.08	5.78	6.03	2.45
	R2	70.69	86.19	69.23	64.56	62.68	77.67	63.98	56.08	51.37
	Acc	91.87	96.22	89.69	86.48	91.41	93.36	90.61	88.66	95.99
	F1	93.47	95.45	88.19	82.23	79.22	83.52	78.65	65.26	72.87
Hard	Acc	92.55	95.99	89.92	86.71	91.18	93.01	90.15	87.17	95.53
Hard	F1	94.05	95.16	88.60	83.04	79.02	83.20	78.39	60.28	70.68

[i] The best performance for each instrument is highlighted in bold. The asterisk (*) refers to the labels used during training, where ‘Hard’ means the aggregation of elicited confidence scores greater than 0.5 is considered as 1 during training.

4 Conclusions

In this paper, we introduced HamNava, a novel, crowd‑sourced dataset designed to address segment‑level multi‑label instrument classification in Iranian classical music. The dataset provides 6,000 audio excerpts, each annotated for the presence or absence of nine musical elements, including instruments and vocals, with a soft‑labeling approach that captures annotator confidence. Our methodology leveraged everyday messaging platforms to streamline the crowd‑sourcing process, relying on community engagement and social incentives to encourage participation.

HamNava fills a gap in the cross‑cultural MIR, offering a fully labeled dataset suitable for evaluating and developing robust and transferable systems. Additionally, we performed cross‑cultural research on foundation models that were primarily trained in the Western popular and classical canon. This approach highlights the potential for transferring knowledge from Western‑centric models to improve MIR systems for lesser‑explored music traditions, emphasizing the opportunities to adapt these models for cross‑cultural tasks. By detailing the construction, annotation processes, and baseline performance of the dataset, we provide a resource that encourages further exploration of this field.

Future work will expand the dataset with additional instruments and vocal techniques, integrate multi‑modal data, and explore advanced learning approaches tailored to cross‑cultural research in music. We hope HamNava serves as a foundational resource, encouraging the MIR community to explore a wider spectrum of musical cultures and bridge the gap between historically dominant and underrepresented traditions. By using HamNava, researchers can evaluate the cross‑cultural performance of pre‑trained models, examining whether their training strategies and architectures can generalize to musical elements that are underrepresented in their original training data.

5 Reproducibility

NAVAAK provided the data for this research, for which the authors expressed their gratitude. The dataset, which includes audio files and aggregated soft annotations, is available only for non‑commercial research purposes, upon the request of researchers through the project webpage. The provided dataset consists of 6,000 MP3 audio files at 22.05 kHz, along with annotations in CSV format, split into three files: train, validation, and test. Each file contains the ‘sample_id’ column as well as the aforementioned eight instruments and vocals. The Telegram bot code used for collecting the annotations is also provided on the same webpage.

Acknowledgments

The second author, Bagher BabaAli, gratefully acknowledges Dr. Jung‑Woo Ha and NAVER Corporation for their crucial support and inspiration in initiating his research in the field of music information retrieval during his visit to NAVER in Seoul, South Korea, in the summer of 2018.

Competing Interests

The authors have no competing interests to declare.

Ethical Considerations

No humans were harmed or placed at risk in the course of this research. All contributors provided their annotations voluntarily with full knowledge that the data would be used for the construction of a research dataset. No personal information beyond the annotations was collected from participants. The dataset was compiled in compliance with copyright and data protection legislation, and all material was sourced from where academic‑use licenses could be obtained.

Funding Information

The authors declare that no funds, grants, or other support were received during the preparation of this project or the dataset.