RWC Revisited: Towards a Community‑Driven MIR Corpus

Stefan Balke; Johannes Zeitler; Vlora Arifi-Müller; Brian McFee; Tomoyasu Nakano; Masataka Goto; Meinard Müller

doi:10.5334/tismir.326

1 Introduction

The Real World Computing (RWC) Music Database has played a foundational role in Music Information Retrieval (MIR) research for over two decades (Goto, 2004; Goto et al., 2002, 2003). Its carefully curated audio recordings—covering popular, classical, jazz, and other genres—together with aligned Musical Instrument Digital Interface (MIDI) files and a variety of annotations (e.g., beat, structure, and chords (Goto, 2006)), have supported a broad range of tasks, including structure analysis (Paulus and Klapuri, 2009; Wang et al., 2022), beat tracking (Böck et al., 2016), chord recognition (Korzeniowski and Widmer, 2016), automatic music transcription (Dessein et al., 2010), and synchronization (Ewert et al., 2009).

Released at a time when freely accessible music data were scarce due to copyright constraints, the RWC Music Database addressed a critical need by providing royalty‑free, original content under clear legal conditions. Distributed on physical compact discs (CDs) at moderate cost, it enabled reproducible research and quickly became a widely used benchmark in the MIR community. The dataset includes five main sub‑collections for music—Pop (RWC‑P), Royalty‑Free (RWC‑R), Classical (RWC‑C), Jazz (RWC‑J), and Genre (RWC‑G)—which together include 328 tracks totaling approximately 23 hours of audio, as illustrated in Figure 1.^¹ A sixth sub‑collection, the Instrument Database (RWC‑I), features isolated recordings of 50 instruments but is not within the scope of this work. Over time, RWC evolved into a de facto standard and became integral to evaluation campaigns such as MIREX,^² with over 1,700 citations to its core publications (see Figure 2).

Overview of the RWC Music Database. Illustration includes a comic‑style portrait of Masataka Goto, generated using OpenAI’s DALL⋅E model via ChatGPT.

Citation counts from 2003–2024 for Goto et al. (2002), Goto et al. (2003), and Goto (2006), respectively.

This paper marks a pivotal moment in the history of RWC: for the first time, the audio recordings are being made openly available under a Creative Commons license. Hosted on Zenodo in lossless quality and paired with a GitHub repository for collaborative development, the RWC dataset is transitioning into a fully open and community‑supported resource.

Beyond the dataset release itself, this paper has two main objectives: it documents and contextualizes the transition of the RWC dataset into the open research domain, and it offers a broader reflection on the role of open, community‑driven data practices within the MIR field. By revisiting the origins and impact of RWC—including insights from an interview with its original creator, Masataka Goto—we examine what it takes to sustain a dataset over decades and how collaborative engagement can help renew and extend its relevance. In doing so, this paper aims to serve both as a practical roadmap for similar initiatives and as a case study demonstrating how legacy resources can be revitalized through open and inclusive collaboration.

In the remainder of this paper, we first provide historical background on the creation and evolution of the RWC Music Database in Section 2. We then situate RWC within the broader landscape of open‑source and open‑data practices in MIR in Section 3. Sections 4 and 5 offer detailed insights into the structure of the dataset’s audio and MIDI components, respectively. In Section 6, we explore strategies for collaborative dataset development and outline opportunities for future community contributions. Finally, in Section 7, we summarize our key findings and reflect on broader implications for sustaining and advancing community‑driven MIR resources.

2 RWC Background

This section outlines the historical foundations of the RWC Music Database, drawing on insights from an interview conducted by Meinard Müller and Stefan Balke with Masataka Goto on February 18, 2025. We highlight the key developments and motivations behind the creation of the RWC dataset, as well as its evolving role within the MIR community since its initial release. For a more personal and in‑depth account of the motivations, design choices, and early development of the RWC dataset, we refer readers to the edited transcript of the interview (Müller et al., 2025).

In the 1990s, Masataka Goto recognized a pressing need for annotated musical data—first to evaluate his work on beat tracking as a PhD student, and later for melody extraction and fundamental frequency (F0) estimation as a researcher at the Electrotechnical Laboratory (later the National Institute of Advanced Industrial Science and Technology (AIST)). Faced with limited resources, he manually annotated beats and F0 contours of commercial recordings using custom tools he had developed. These early challenges—especially data scarcity and the difficulty of reproducible evaluation—led to the idea of creating a publicly accessible, copyright‑cleared dataset for computer‑driven music research. This vision aligned with Japan’s RWC program, a national effort aimed at advancing intelligent information processing. Supported by this program, Goto and his team developed the RWC Music Database, designed to provide high‑quality, professionally produced audio recordings across diverse musical genres. The dataset was created to support the emerging field of Music Information Processing, a precursor to what is now known as MIR.

Significant milestones of the RWC database and related publications are listed in Table 1. In the years 2000 and 2001, the RWC dataset was mainly designed by Masataka Goto. For certain genres, such as jazz and classical music, external experts Keiji Hirata and Yuzuru Hiraga were consulted, respectively, to help in the selection of suitable pieces. However, rather than licensing commercial music, Goto decided to commission original recordings to ensure full copyright clearance. With the help of a music production company, he created a dataset of recordings by professional musicians and ensembles—including the Tokyo City Philharmonic Orchestra—covering a broad repertoire ranging from pop and jazz to classical music and instrumental sounds. Following the production phase, the first RWC collections—Pop, Royalty‑Free, Jazz, and Classical—were introduced at ISMIR 2002 (Goto et al., 2002). The Genre and Instruments collections were subsequently presented at ISMIR 2003 (Goto et al., 2003).

Table 1

Key developments in the evolution of the RWC Music Database.

Year	Comment	Sub.‑Coll.
2000/01	Dataset design and recordings
2002	RWC presentation (Goto et al., 2002)	{P,R,C,J}
2003	RWC extension (Goto et al., 2003)	{G,I}
2006	AIST Annotations (Goto, 2006)
2009	Music alignment (Ewert et al., 2009)^³	C
2011	Chord annotations (Cho and Bello, 2011)^⁴	P

[i] Abbreviations: P = Pop, R = Royalty‑Free, C = Classical, J = Jazz, G = Genre, I = Instruments.

Beyond audio, Goto recognized the importance of annotations for advancing research and prepared MIDI files that were created after music production by professional music transcribers from the karaoke industry. Although the audio recordings were distributed on CDs, these MIDI files and lyric text files have been distributed online from the beginning. However, the MIDI files are not necessarily aligned with the audio, and the original RWC dataset does not contain annotations other than lyrics and track‑level metadata. Goto therefore developed custom annotation tools and collaborated with a music college graduate to manually align the MIDI files with the audio and to annotate beats, structural boundaries, and F0 contours. These efforts culminated in the AIST Annotations, which complemented the RWC recordings and became widely used benchmarks for MIR evaluation (Goto, 2006). In subsequent years, the MIR community continued to enhance the dataset through additional contributions—providing, for example, high‑resolution audio–MIDI alignments for the classical collection (Ewert et al., 2009) and chord annotations for popular music (Cho and Bello, 2011).

The RWC dataset had a profound impact on the MIR field, providing one of the first large‑scale resources that combined professionally produced audio with rich, structured annotations. At a time when internet‑based distribution was not yet feasible, the dataset was made available on physical CDs, ensuring global accessibility. Goto’s efforts not only laid the groundwork for modern MIR research practices but also inspired the development of new tools, methods, and user interfaces. The strong and immediate response from the international research community—especially following its presentation at ISMIR 2002—underscored the significance of RWC and encouraged Goto to continue contributing additional annotations and tools in support of open, reproducible music research. As shown in Figure 2, the three core publications describing the RWC collections, along with the AIST Annotations, have been cited in 1,774 publications between 2003 and 2024.^⁵

A selection of key publications that build upon and make extensive use of the RWC dataset is summarized in Table 2. These works illustrate the dataset’s broad impact across the MIR field. First, the RWC database has been applied to a wide variety of core MIR tasks, demonstrating its versatility. Second, the publications span leading venues, including conferences such as ISMIR, ICASSP, and WASPAA, as well as prominent journals. Third, the Popular Music collection (P) emerges as the most frequently used subset—likely due to its extensive set of available annotations. Finally, the selection reflects the progression of MIR research itself, from early signal processing techniques like melody extraction to source separation methods based on Non‑Negative Matrix Factorization (NMF) and more recent applications of deep neural networks, such as for beat tracking. The continued use of RWC in current studies—including work on lyrics processing and music generation—underscores its lasting relevance within the community.

Table 2

Selection of key works building upon the RWC Music Database.

Task @ Venue	Sub‑Collections	Citation
Chorus‑Section Detection @ ICASSP	P	(Goto, 2003)
Drum Sound Detection @ ISMIR	{P,I}	(Yoshii et al., 2004)
Musical Interfaces @ ISMIR	{P,R,C,J,G}	(Goto and Goto, 2005)
Singer Identification @ ISMIR	P	(Fujihara et al., 2005)
Instrument Identification @ ISMIR	{C,I}	(Kitahara et al., 2005)
Drum Loop Retrieval @ CBMI	P	(Gillet and Richard, 2005)
Audio Retrieval @ ISMIR	{P,R,C,J}	(Bertin and de Cheveigné, 2005)
Music Source Separation @ GRETSI	C	(Vincent and Gribonval, 2005)
Audio Melody Extraction @ WASPAA	G	(Ryynänen and Klapuri, 2005)
Automatic Mixing @ AXMEDIS	P	(Katayose et al., 2005)
Drum Sound Detection @ ICASSP	P	(Yoshii et al., 2006)
Lyrics Alignment @ ISM	P	(Fujihara et al., 2006)
Pitch Estimation @ ICASSP	P	(Fujihara et al., 2006)
Music Structure Analysis @ ISMIR	{P,J}	(Bruderer et al., 2006)
Musical Interfaces @ ISMIR	J	(Hamanaka, 2006)
Audio Coding @ TASLP	{C,J}	(Derrien et al., 2006)
Automatic Music Transcription @ ISMIR	P	(Ryynänen and Klapuri, 2006)
Genre Classification @ ISMIR	RWC	(Reed and Lee, 2006)
Music Structure Analysis @ AMCMM	P	(Paulus and Klapuri, 2006)
Pitch Estimation @ TASLP	{C,J}	(Kameoka et al., 2007)
Singing Voice Retrieval @ ISMIR	P	(Fujihara and Goto, 2007)
Thumbnail Image Generation @ ISMIR	G	(Yoshii and Goto, 2008)
Singing Synthesis @ SMC	P	(Nakano and Goto, 2009)
Beat Tracking @ TASLP	{P,C,J}	(Grosche and Müller, 2011)
Singer Identification @ ISMIR	P	(Lagrange et al., 2012)
Music Source Separation @ ICML	P	(Yoshii et al., 2013)
Singing Voice Detection @ ISMIR	P	(Lehner et al., 2013)
Pitch Estimation (pYIN) @ ICASSP	P (synth.)	(Mauch and Dixon, 2014)
Music Mashup @ TASLP	P	(Davies et al., 2014)
Chord Estimation @ ISMIR	P	(Zhou and Lerch, 2015)
Beat Tracking @ ISMIR	P	(Böck et al., 2016)
Melody Harmonization @ ISMIR	P	(Tsushima et al., 2017)
Lyrics Transcription @ ICASSP	P	(Nishikimi et al., 2019)
Singing Voice Separation @ WASPAA	P	(Nakano et al., 2019)
Audio Declipping @ TASLP	{P,J,C}	(Gaultier et al., 2021)
Singing Voice Extraction @ Electronics	P	(Gao et al., 2021)
Lyrics Generation @ ISMIR	P	(Watanabe and Goto, 2023)
Music Generation @ IJCAI	P	(Lin et al., 2024)

3 Open Music Data

The MIR field has made remarkable strides toward greater transparency—driven not only by broader trends in open‑source software and open data, but also through its own community‑led efforts to develop, share, and maintain tools and datasets. On the software side, Python libraries such as librosa (McFee et al., 2015) and madmom (Böck et al., 2016) have provided researchers with reliable, well‑documented implementations that serve as foundations for a wide range of applications. Educational resources like the fmp‑notebooks (Müller and Zalkow, 2019), with their extensive explanations and visualizations, have further lowered the entry barrier and deepened understanding across the community. Additionally, the mir‑eval library filled a critical gap by offering standardized evaluation metrics to support reproducible experiments (Raffel et al., 2014).

Open datasets are equally essential. While open‑source code supports reproducibility, true replicability—particularly in machine learning—requires access to the original data. Without it, researchers can only approximate prior results, which undermines both transparency and the reliability of scientific findings.

In its early years, the MIR field lagged behind other research domains in terms of openly accessible datasets. Copyright restrictions often prevented the distribution of audio recordings, posing a major challenge for tasks that require direct access to sound—such as transcription, recommendation, or music generation. In contrast, other fields benefited early on from public data repositories such as the UCI Machine Learning Repository,^⁶ which supported benchmarking and empirical research from the outset. Offering hundreds of datasets across diverse domains—including biology, finance, healthcare, social science, and games—the UCI repository provided researchers with immediate access to structured, ready‑to‑use data for algorithm development and evaluation.

The release of the RWC Music Database marked a pivotal moment for the MIR community. Although not fully open—since the audio was not in the public domain and had to be purchased on physical media—it was licensed for research use and remained easily accessible, making it a widely adopted benchmark for tasks such as beat tracking, structural segmentation, and chord recognition. This balance of legal clarity and practical usability enabled consistent evaluation and reproducibility, establishing RWC as a foundational resource. It also served as a model and inspiration for later datasets.

The Million Song Dataset, for example, supported large‑scale recommendation research without providing audio, instead offering precomputed features and listening histories (Bertin‑Mahieux et al., 2011). In contrast, MedleyDB included full audio access, focusing on multitrack recordings and melody annotations (Bittner et al., 2014), parts of which were later incorporated into MUSDB18 for source separation (Rafii et al., 2017). Other specialized datasets soon followed, including MAPS and MAESTRO, which support research in piano transcription (Emiya et al., 2010; Hawthorne et al., 2019). Genre‑specific collections such as the Wagner Ring Dataset (Weiß et al., 2023) and Schubert’s Winterreise (Weiß et al., 2021) either provide links to commercial CD recordings or include carefully curated older recordings for which copyright has expired, making them freely available for research purposes.

Several of these datasets gained prominence through open challenges, where research teams submitted and evaluated their systems on shared tasks. These challenges often culminated in workshops—either as standalone events or integrated into larger conferences such as ISMIR. Within the MIR community, MIREX (as previously mentioned) has played a central role by offering standardized benchmarking tasks, including melody extraction and beat tracking (Downie, 2008). Following a period of reduced activity, MIREX was recently revitalized with updated tasks, such as music generation. Other focused competitions—like the Sound/Music Demixing Challenge—have also contributed significantly to progress, particularly in music source separation (Fabbro et al., 2024).

4 RWC Dataset: Audio

Audio recordings are the core component of the RWC Music Database. As part of this work, we announce the official release of the original RWC audio material under a Creative Commons license (CC BY‑NC 4.0^⁷).^⁸ All recordings are now freely available for research purposes and hosted on Zenodo.^⁹ Crucially, the distributed files stem from the original master tracks used in the CD production—not from consumer‑ripped copies. This distinction is essential for ensuring high fidelity and timing accuracy. Ripping software may introduce inconsistencies such as compression artifacts and time shifts, which can compromise tasks involving synchronization, alignment, or precise timing analysis. By providing the original master‑quality recordings, this release serves as an authoritative audio reference that enables reproducible MIR research and promotes consistent, high‑quality dataset usage across the community.

4.1 Structure and design

The RWC Music Database is organized into six sub‑collections: Pop (P), Royalty‑Free (R), Classical (C), Jazz (J), Genre (G), and Instruments (I). Table 3 provides an overview of the first five sub‑collections, which form the core focus of our current initiative. The sixth sub‑collection, Instruments, consists of a range of isolated notes from 50 instruments. Since this sub‑collection does not contain complete performances, we do not consider it in this work.⁹ Collectively, the first five sub‑collections include 328 tracks, totaling approximately 23 hours of audio. The largest among them are Pop and Genre, with 100 and 102 tracks, respectively—each contributing nearly 7 hours of material. Classical and Jazz follow, with 61 and 50 tracks, while Royalty‑Free includes 15 tracks totaling around 30 minutes. The discrepancy between the number of tracks and the number of pieces arises from defining a ‘piece’ as a distinct musical work. This distinction is particularly relevant in the Classical collection, where a single work—such as a symphony—may span multiple tracks corresponding to its individual movements. These special cases are covered by our naming convention: Each file begins with a prefix indicating its sub‑collection, followed by a three‑digit number corresponding to the piece number (e.g., RWC_P014). If a piece consists of multiple parts, these are indicated by a capital letter (e.g., RWC_C024A, RWC_C024B, and RWC_C024C). Audio files, MIDI files, and extra annotation files, such as lyrics, follow these patterns to allow for easy and structured access to corresponding files.

Table 3

Overview of the five sub‑collections of the RWC Music Database.

ID	Sub‑Collection	#Pieces	#CDs	#Tracks	Dur.
RWC‑P	Popular Music	100	7	100	6:43:36
RWC‑R	Royalty‑Free Music	15	1	15	0:32:23
RWC‑C	Classical Music	50	6	61	5:27:08
RWC‑J	Jazz Music	50	4	50	3:42:20
RWC‑G	Genre Mix	100	9	102	6:58:26
$\sum$		315	27	328	23:23:55

[i] Abbreviations: P = Pop, R = Royalty‑Free, C = Classical, J = Jazz, G = Genre.

All tracks in the RWC Music Database were specifically performed and recorded for inclusion in the collection. To avoid copyright restrictions, the dataset includes only public domain works or newly commissioned compositions created exclusively for the RWC project. The RWC‑P subset includes entirely original works, featuring contributions from 25 composers, 30 lyricists, and 23 arrangers. In RWC‑G, 73 pieces were newly composed, while RWC‑J contains 46 original compositions by 4 composers and 1 lyricist.

The RWC sub‑collections were curated with distinct design objectives that reflect their respective musical domains. RWC‑P aims to capture a range of mainstream pop styles. Of its 100 pieces, 80 are sung in Japanese and emulate the sound of 1990s Japanese pop charts, while the remaining 20 are in English, inspired by American pop music from the 1980s. In contrast with RWC‑P, which consists of 100 newly composed songs, RWC‑R includes 15 well‑known public‑domain songs, including 10 traditional songs with English lyrics and 5 Japanese children’s songs. The RWC‑C collection was designed to offer variety across instrumentation, historical periods, compositional styles, and composers. RWC‑J is structured into three parts: the first features five jazz pieces, each recorded with seven different instrumentations (for a total of 35 tracks); the second includes nine pieces illustrating stylistic diversity, such as modal and free jazz; and the third presents six fusion‑style tracks. Finally, RWC‑G spans 10 musical genres (popular, rock, dance, jazz, Latin, classical, marches, world, vocals, and traditional Japanese music), further divided in 33 sub‑genres,^¹⁰ each represented by at least three pieces, with one additional a cappella piece rounding up the collection.

4.2 Statistical overview

Figure 3 shows the distribution of track durations across the five RWC sub‑collections. RWC‑P tracks have a mean duration of 4:02 (mm:ss), ranging from 00:52 to 10:08. RWC‑J exhibits a slightly higher average of 4:26, with durations between 2:23 and 7:39. RWC‑G tracks average 4:06, with a similar range to RWC‑P tracks. RWC‑R contains the shortest tracks overall, with a mean of 2:09 and durations from 1:46 to 2:56. RWC‑C spans the widest range, from 00:50 to 17:59, and shares the highest average duration (4:26) with RWC‑J. To support vocal‑specific research tasks such as melody extraction, singing voice detection, source separation, or lyrics alignment, Figure 3 highlights the tracks containing singing. Vocal content varies considerably across the RWC sub‑collections: all 100 tracks in RWC‑P include vocals, compared to 52 in RWC‑G, 15 in RWC‑R, 6 in RWC‑C, and only 2 in RWC‑J. These differences reflect the distinct stylistic and functional goals that guided the design of each collection.

Track duration distributions for the five RWC sub‑collections. Tracks containing singing voice are highlighted with black overlays.

As a final statistic, Figure 4 presents the tempo distribution for the RWC‑P collection. Tempos range from 62 to 200 beats per minute (BPM), with an average of approximately 110 BPM. The distribution reveals three prominent clusters centered around 80 BPM, 100 BPM, and 130 BPM. While the other sub‑collections do not include explicit tempo annotations in their metadata, approximate tempo estimates can be derived from the beat annotations available in the AIST annotation set.

Tempo distribution of the RWC Popular Music sub‑collection, showing the tempo range in beats per minute (BPM) across all tracks.

5 RWC Dataset: MIDI

From its inception, the RWC Music Database was designed to include both audio and symbolic representations, with each track complemented by a corresponding MIDI file. These MIDI files were not created during the original recording sessions but were transcribed afterwards by professionals from Japan’s karaoke industry, following the standard practices of the time. While this approach ensured broad symbolic coverage, it also led to temporal misalignments between audio and MIDI, limiting their suitability for tasks requiring precise timing.

To address this limitation, Masataka Goto later released the AIST Annotations, which include a set of manually aligned MIDI files created using dedicated editing tools.^¹¹ While the alignment process was labor‑intensive, it substantially improved temporal accuracy and enhanced the dataset’s usefulness for a broad range of audio–symbolic MIR tasks, such as score‑informed source separation, automatic transcription, and performance analysis. Beyond these applications, the MIDI data also enables statistical analysis of instrument usage across the dataset. For example, Figure 5 shows the number of note events for the 10 most common instrument groups, based on the instrument labels provided in the MIDI files.

Number of note events per collection and instrument group (top 10).

It has long been known that alignment issues persisted—particularly in the Classical sub‑collection (RWC‑C), where audio and MIDI often exhibit significant temporal discrepancies. Moreover, our recent inspection revealed that similar misalignment problems exist across other sub‑collections as well. In response, we initiated a systematic effort to refine and realign all MIDI files in the RWC dataset, combining automated synchronization algorithms with careful manual verification to enhance the correspondence between audio and symbolic timelines.

In the following subsections, we detail this alignment initiative. Beyond improving the dataset’s accuracy and utility for downstream MIR applications, our work reflects a broader vision: establishing a transparent, community‑driven corpus that supports collaborative contributions—such as corrections, validations, and extended annotations—through reproducible workflows and open infrastructure.

As part of this re‑release, we provide both the original MIDI files from AIST and the temporally adjusted versions. Together, they constitute a core component of the RWC dataset and offer a solid foundation for future annotation and research efforts.

5.1 Alignment issues

During our inspection of the RWC Music Database, we identified three main types of alignment issues between the audio recordings and their corresponding MIDI files, which we describe below in order of increasing complexity.

The first issue involves a global offset, where the MIDI file is uniformly shifted in time relative to the audio. Such offsets can result from differences in audio sources (e.g., ripped vs. master files), decoding artifacts (e.g., MP3 compression), or inconsistencies introduced during processing. Although relatively easy to correct, even small offsets can compromise accuracy in time‑sensitive applications.

The second issue arises when, in addition to a global offset, the overall performance speed of the audio and MIDI differs by a constant factor. That is, the playback tempo in the MIDI file is consistently slower or faster than in the audio, leading to an increasing temporal drift over time. While a perfect alignment corresponds to a time‑scaling factor of 1.0, we observed deviations typically ranging from 0.97–1.03, resulting in misalignments of several seconds in longer recordings. We refer to this as linear time scaling (LTS) case, which necessitates correcting both the global offset and the constant tempo mismatch.

The third and most complex issue involves non‑linear timing deviations between the audio and MIDI, typically caused by tempo fluctuations in expressive performances. Unlike a constant tempo mismatch, these deviations change over time and require a non‑linear warping of the MIDI timeline to accurately align with the audio. We refer to this as non‑linear time warping (NTW) case. Such timing irregularities are particularly common in the Classical collection (RWC‑C) and in Western classical pieces within the Genre collection (RWC‑G), though they may also occur in other sub‑collections.

5.2 Alignment correction

To address the three types of alignment issues, we first manually annotated the onset time of the first and the offset time of the last note event in each audio recording.^¹² These anchor points served as stable temporal references and were used to reduce boundary artifacts during synchronization.

Based on these annotations, we computed an initial alignment between the audio and MIDI files using the SyncToolbox (Müller et al., 2021), a high‑resolution synchronization framework tailored to Western music. It combines multiscale dynamic time warping (MsDTW), memory‑restricted DTW, and onset‑enhanced alignment strategies (Ewert et al., 2009; Prätzlich et al., 2016). By integrating chroma‑based and onset‑based features, this approach produces alignment paths that are both robust and temporally precise. The manually annotated anchor points were used to guide and stabilize the alignment, particularly at the beginning and end of each track, where we found that automatic methods often introduce errors.

We used the resulting alignment paths to categorize each audio–MIDI pair into either the LTS case or the NTW case. In the LTS scenario, a perfect alignment between the audio and MIDI should follow a linear function characterized by an offset and a constant slope. Based on this observation, we applied least‑squares fitting to approximate each alignment path with a linear function. To reduce the influence of boundary artifacts, we excluded the first and last 10% of the alignment path from the fitting process. Approximations with a mean absolute error below a predefined threshold (Mean Square Error (MSE) empirically set to 0.01) were classified as LTS cases, while those exceeding the threshold were considered NTW cases.

For LTS cases, we adjusted the MIDI timeline using the linear fit, which involved first extrapolating the global offset, then applying a uniform time scaling based on the fitted slope. For NTW cases, we used the initial alignment path obtained from the SyncToolbox to non‑linearly warp the MIDI file. In practice, applying the linear adjustment in LTS cases proved more robust and accurate than relying on the more general NTW strategy, which is better suited for performances with expressive tempo fluctuations.

Figure 6a illustrates a representative LTS case, where the alignment path closely follows the linear fit. The offset at the beginning reflects a delayed initial MIDI onset, while the slope below one suggests that the audio performance was slightly faster than the MIDI. Figure 6b shows an NTW case, where the initial alignment deviates substantially from the linear fit, reflecting local tempo fluctuations that cannot be captured by a global scaling. Finally, Table 4 summarizes the number of MIDI files classified as LTS or NTW cases according to our heuristic criteria. For example, in the RWC‑P subset, 94 MIDI files are identified as LTS cases, of which 65 required almost no adjustment and 29 involved a time stretch. In contrast, all 61 MIDI files in the RWC‑C subset are NTW cases.

Two audio‑MIDI alignment examples illustrating linear time scaling (LTS) and non‑linear time warping (NTW) cases. The background shows the chroma‑based cost matrix; the red line indicates the alignment path computed with the `SyncToolbox` (Müller et al., 2021), while the dotted cyan line shows the corresponding linear fit. **(a)** LTS case: Musette in D major by Bach, performed on harpsichord (`RWC_C024C`). **(b)** NTW case: Prelude and Liebestod from Wagner’s Tristan und Isolde, performed by a symphony orchestra (`RWC_C009`).

Table 4

Classification of MIDI files into LTS (Linear Time Scaling) and NTW (Non‑linear Time Warping) cases based on heuristic criteria, summarizing the number of files per category. The two LTS columns indicate whether applying a time‑scaling factor of 1.0 results in an accumulated alignment error of $\leq 50 ms$ (indicating tempo agreement between audio and MIDI) or $> 50 ms$ (indicating a tempo mismatch requiring time scaling to adapt the MIDI). The $> 50 ms$ threshold reflects a typical tolerance used in time‑critical tasks such as onset detection.

ID	#Tracks	LTS		NTW
		≤ 50 ms	> 50 ms
RWC‑P	100	65	29	6
RWC‑R	15	13	2	–
RWC‑C	61	–	–	61
RWC‑J	50	6	5	39
RWC‑G	102	27	16	59

5.3 Alignment verification

To verify the accuracy of the MIDI alignments, we employed sonification using the libsoni toolbox (Özer et al., 2024). MIDI events were rendered and played alongside the corresponding audio, enabling direct auditory comparison. Verification of all tracks through listening was carried out by the authors of this article, with particular attention given to the beginning and ending passages of each recording.^¹³

Overall, the alignment corrections yielded reliable results, especially for tracks classified as LTS cases. However, for a number of tracks—particularly those involving soft‑onset instruments such as strings—the alignment produced by the SyncToolbox proved less accurate, e.g., solo cello recording in RWC_C041. In such cases, we manually annotated additional anchor points between audio and MIDI to stabilize and improve the alignment. Human intervention was especially valuable in performances featuring expressive timing variations or ambiguous onset cues.

Despite our best efforts, some limitations persist. In particular, tracks from the Classical (RWC‑C) and Genre (RWC‑G) collections may still contain transcription inaccuracies or musically ambiguous passages. In contrast, the Popular Music collection (RWC‑P) generally offers the highest synchronization quality, and the Royalty‑Free collection (RWC‑R) also proved to be highly reliable.

While the listening‑based verification was carried out with care, it remains informal and inherently subjective. In the process, various note‑level discrepancies between the MIDI and corresponding audio were identified, including missing, incorrect, or additional notes. Many of these issues stem from limitations in the original MIDI annotations or transcription artifacts. Correcting them requires meticulous manual effort and lies beyond the scope of this initiative. A more systematic and quantitative evaluation will be necessary in future work, ideally tailored to the specific needs and constraints of individual application contexts.

To facilitate ongoing refinement, we encourage community contributions such as corrections, improvements, and extended annotations. These can be integrated into the versioning infrastructure described in the next section. For instance, the recording RWC_P009 begins with a drum pick‑up beat that contains transcription errors in the MIDI—an issue that could be addressed through community engagement. Further discussion of the community’s role follows below.

6 Towards Community‑Driven Curation

Early MIR datasets were typically developed by individual labs or through institutional funding. Today, improved tooling and infrastructure—such as version‑controlled repositories, standardized annotation formats, and open contribution models—make collaborative, community‑driven maintenance feasible. Instead of static releases, modern datasets can evolve with input from a broad research community. In this context, the concept of an ‘RWC 2.0’ has emerged: a unified effort to consolidate disparate RWC versions, improve alignment between symbolic and audio content, and standardize annotations—all under a clear, permissive license that supports sustainable community involvement.

Community‑maintained datasets offer both opportunities and challenges. Open collaboration broadens the contributor base, improving data quality, enabling new annotations, and fostering tool development across the ecosystem. However, such distributed efforts require careful coordination: data curation must follow shared standards, and all changes should be versioned and reviewed to prevent inconsistencies or regressions. Projects like mirdata already support this approach by offering a unified interface to many datasets, along with tools for download management and data validation (Bittner et al., 2019).

Inspired by software engineering and open‑source development, the MIR field is now well‑equipped for robust, transparent, and community‑driven data curation. At the core of the updated RWC dataset are the original master recordings, which were used for the original compact disc production and are now hosted on Zenodo within a dedicated community collection.⁹ To preserve historical context and ensure continued access, a copy of the original RWC website is archived alongside the audio. Complementing this, a GitHub‑based infrastructure manages annotations, including MIDI files, beat and melody annotations, and other metadata.^¹⁴ This setup supports long‑term storage, version control, collaborative contributions, and transparent change tracking, aligning the dataset with best practices in open and sustainable research (McFee et al., 2019).

The GitHub infrastructure, serving as the main hub for community contributions, begins with the rwc‑annotations‑archive repository.^¹⁵ This repository mirrors annotation files from the original RWC website, including MIDI and other data. However, these legacy materials vary in format and are often poorly documented, making them difficult to interpret and reuse. New community contributions, such as time‑aligned lyrics, are first added to the rwc‑annotations‑archive, which acts as a central collection point for both original and newly created resources.^¹⁶

A second repository, rwc‑annotations, hosts cleaned and standardized versions of the annotations.¹⁴ Here, files from the rwc‑annotations‑archive are converted into a consistent, well‑documented format to support long‑term accessibility and usability. For pragmatic reasons—and with awareness of potential limitations—we adopt a simple comma‑separated values (CSV) format, chosen for its human‑readability, ease of parsing, and broad compatibility with text editors and spreadsheet tools. The structure and semantics of each CSV file are carefully documented to ensure sustainability and transparency.

The conversion from the rwc‑annotations‑archive to the cleaned annotations is intended as a one‑time process. Dedicated scripts transform the data into the standardized format before inclusion in rwc‑annotations. Once migrated, backward compatibility with the archive is no longer maintained—encouraging future work to build on the cleaned format. This workflow balances openness to community contributions with the need for a consistent, well‑documented annotation set.

Community‑suggested corrections to the annotations are tracked in the rwc‑annotations repository via GitHub features such as issues and pull requests. Figure 7 illustrates this workflow: a user forks the repository, applies changes, and submits a pull request, which is then reviewed by maintainers. Revisions may be requested before approval and merging. Drawing on experience from building the ChoraleBricks dataset (Balke et al., 2025), continuous integration tests help prevent format errors (e.g., invalid values) and annotation mistakes (e.g., duplicate fundamental frequency entries). Following software development best practices, a textual changelog summarizes all changes. To support reproducibility and citation, we recommend regular versioned releases, using a semantic versioning scheme.^¹⁷

Illustration of the workflow for incorporating community‑provided corrections into the centralized annotation repository.

Beyond infrastructure and versioning, the sustainability of a dataset ultimately depends on active community participation. It is this collective effort that keeps the resource relevant, accurate, and broadly useful. Engagement fosters shared ownership, encourages diverse contributions, and helps uncover issues that might otherwise go unnoticed. A living dataset relies not just on well‑designed systems but also on a committed, collaborative user base that continuously uses, critiques, and improves it.

7 Conclusions

The re‑release of the RWC Music Database under an open license marks a significant milestone in the evolution of community‑driven research infrastructure in MIR. By making the original master‑quality audio freely available and establishing a modern, version‑controlled annotation ecosystem, we aim to transform RWC from a static historical resource into a living, community‑driven MIR corpus.

Our efforts not only preserve the legacy and continued relevance of the RWC dataset but also serve as a practical case study in revitalizing research assets through collaborative, open‑source principles. The GitHub‑based infrastructure supports transparent development workflows, facilitates community contributions, and enables reproducible research by linking data, code, and changelogs in a cohesive ecosystem. The decision to separate archival data from cleaned, standardized annotations helps to ensure long‑term maintainability without sacrificing openness to experimentation and iteration.

More broadly, this initiative reflects a growing maturity in the MIR field—one where open data, shared infrastructure, and sustained community engagement are increasingly recognized as essential pillars of scientific progress. It also underscores the importance of historical awareness: by revisiting and improving upon foundational datasets like RWC, we not only enhance current research practices but also honor the work of those who laid the groundwork for the field. Whether the RWC dataset will ever be truly complete remains an open question. Initial ideas for an ‘Everything Corpus’ were outlined by Gotham et al. (2025). Even without a formal evaluation, the original RWC release already followed many of the principles described there. With the re‑release, especially data availability will be further improved and made much more accessible for future generations of researchers.

Looking ahead, we invite the MIR community to actively shape the continued development of the RWC dataset. Our initiative marks only the beginning. Inconsistencies in the MIDI alignments have already been identified and corrected, suggesting that further refinements remain—an opportunity for community engagement. Whether by proposing corrections, contributing new annotations, improving tools, or exploring novel use cases, researchers and developers can help ensure that RWC evolves into a robust, versatile, and inclusive resource for years to come.

Acknowledgements

The International Audio Laboratories Erlangen are a joint institution of the Friedrich–Alexander–Universität Erlangen–Nürnberg (FAU) and Fraunhofer Institute for Integrated Circuits IIS. Part of this work was initiated through conversations during the Dagstuhl Seminar 24302 ‘Learning with Music Signals: Technology Meets Education.’

Funding Information

SB and MM were funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under grant number 500643750 (MU 2686/15‑1).

Competing Interests

The authors have no competing interests to declare. MM is member of the editorial team and have recused himself from any editorial involvement with this article.

Authors’ Contributions

The re‑release of the dataset was led by SB and MM, encompassing conception, coordination, execution, and documentation. MG contributed the original data and provided valuable historical context. TN supported its long‑term dissemination. The refinement of the audio–MIDI alignment was carried out by VAM, JZ, and SB. BM provided insights into open‑source and open‑data best practices, as well as insights into community‑led initiatives. All co‑authors actively participated in the preparation of the final manuscript.

Data Accessibility

To foster reproducibility and further research, we provide access to the RWC dataset and related resources.

Dataset: The RWC dataset is available under a CC BY‑NC 4.0 license on Zenodo: https://zenodo.org/communities/rwc-music/.
Annotations: Structured and cleaned annotations are available on GitHub: https://github.com/rwc-music/rwc-annotations.

Notes

[3] A detailed description is given in Section 4.

[4] MIREX (Music Information Retrieval Evaluation eXchange) is an annual community‑driven benchmarking initiative that evaluates state‑of‑the‑art algorithms across a wide range of MIR tasks.

[5] https://audiolabs-erlangen.de/resources/MIR/SyncRWC60/, accessed May 27, 2025.

[6] https://github.com/tmc323/Chord-Annotations, accessed May 27, 2025.

[7] Sources: https://scholar.google.com/citations?view_op=view_citation&user=BkZggZkAAAAJ&citation_for_view=BkZggZkAAAAJ:u5HHmVD_uO8C, https://scholar.google.com/citations?view_op=view_citation&user=BkZggZkAAAAJ&citation_for_view=BkZggZkAAAAJ:d1gkVwhDpl0C, https://scholar.google.com/citations?view_op=view_citation&user=BkZggZkAAAAJ&citation_for_view=BkZggZkAAAAJ:LkGwnXOMwfcC, accessed February 10, 2025.

[8] https://archive.ics.uci.edu/, accessed May 27, 2025.

[9] https://creativecommons.org/licenses/by-nc/4.0/.

[10] The choice of the ‘CC BY‑NC 4.0’ license best aligns with the original licensing approach and most accurately reflects the intended purpose of the RWC dataset—as a resource for research, including use within corporate research settings. More detailed information is available in the corresponding Zenodo data repository.

[11] The data are available online at https://zenodo.org/communities/rwc-music/, accessed October 2025.

[12] For details, we kindly refer to the original publication (Goto et al., 2003).

[13] We kindly refer the reader to Table 1 in Goto (2006) for an overview on the included annotations.

[14] Note that we manually checked the offset times. However, this point in time can only be approximated.

[15] During verification, we found that the MIDI files for RWC_G025 and RWC_G026 were accidentally swapped in the original RWC Music Database and need to be corrected.

[16] https://github.com/rwc-music/rwc-annotations, accessed October 2025.

[17] https://github.com/rwc-music/rwc-annotations-archive, accessed October 2025.

[18] Keeping the original annotations in a separate repository is based on practical experience. As the archive may grow in size, storing these files separately helps keep the cleaned annotations lightweight and maintain a small memory footprint. However, changing this design is up to the community.

[19] https://semver.org/, accessed May 2025.

RWC Revisited: Towards a Community‑Driven MIR Corpus

Full Article