Improving Motif Discovery of Symbolic Polyphonic Music with Motif Note Identification

Jun-You Wang; Yu-Chia Kuo; Li Su

doi:10.5334/tismir.250

1 Introduction

According to Grove Music Online, a motif is ‘a short musical idea, melodic, harmonic, rhythmic, or any combination of these three’ (Drabkin, 2001). The Oxford Companion to Music further states that motif ‘brings unity, relationship, coherence, logic, comprehensibility, and fluency to a composition by means of its repetition and varied recurrence’ (Whittall, 2011). In a nutshell, a motif refers to a subset of notes that are 1) short, 2) musically important, and 3) repetitive in a music piece. To process such a foundational unit of music, motif discovery, the task of finding motifs and their occurrences in music, serves a key role in music information retrieval (MIR), with particular applications such as music analysis (Velarde et al., 2016), music classification (Boot et al., 2016; Louboutin and Meredith, 2016), music retrieval (Frieler et al., 2018), and music generation (Hu et al., 2022; Shih et al., 2023).^¹

The motif discovery of polyphonic symbolic music data is a challenging task. The musical importance of a motif is identified in a context‑dependent manner, subject to experts’ analysis and interpretation. The strong repetition of a pattern does not imply its musical importance. On the other hand, a pattern being important in one music piece (e.g., the famous G‑G‑G‑E♭ motif in Beethoven’s Fifth Symphony) might not be as important as in another. Besides, the repetition behavior of a motif is non‑static; it usually develops into highly variant occurrences throughout the music piece. Such characteristics not only complicate the data‑annotation process (Finkensiep et al., 2020; Krause et al., 2020; Srinivasamurthy et al., 2021; van Kranenburg et al., 2016) but also hinder the formulation of a generalizable objective to learn the motif events. Under such circumstances, previous research usually adopts certain relaxed problem scenarios to discuss the motif‑discovery task. For example, in the JKU Patterns Development Database (JKU‑PDD) (Meredith, 2013), for the Discovery of Repeated Themes and Sections task of MIREX, motifs, themes, and sections are treated as equivalent in the annotations, although their lengths may differ quite a lot (Collins, 2013). Wu et al. (2023) considered a simplified scenario of finding the time interval of motif occurrence instead of the exact note group of musical importance. Finally, current repeated pattern discovery (RPD) algorithms, which are mostly rule‑based, usually produce numbers of redundant detections and have to rely on heuristics to remove them (Benammar et al., 2017; Hsiao et al., 2023; Lartillot, 2014; Meredith, 2013; Pinto, 2010). To our knowledge, data‑driven approaches to the motif discovery problem are rarely explored, likely because of the limited availability of annotated data.

In this paper, we discuss the motif discovery problem at the level of individual motif notes. As the main contribution of this paper, we propose a data‑driven approach to predict the musical importance of each note to be considered as in a motif, then integrate this knowledge into a rule‑based RPD algorithm, resulting in a hybrid motif discovery method. Directly learning to segment motif occurrences is challenging, particularly when available training data are scarce. To address this, we propose a new and simplified task, which is coined as motif note identification (MNID), to identify whether each note belongs to an occurrence of a motif (referred to as a motif note) or not (referred to as a non‑motif note). Subsequently, RPD algorithms are applied exclusively to the identified motif notes, ensuring that the output consists only of relevant motif notes.

Specifically, we discuss the motif annotations from two datasets, JKU‑PDD (Collins, 2013) and BPS‑Motif (Hsiao et al., 2023). For our purposes, we define motives as in the BPS‑Motif dataset: monophonic, repeated patterns of notes (ignoring rests) which are ‘considered as motifs’ by an annotator. Different motifs may occur at the same time, whereas a note may belong to multiple motifs. Occurrences of the same motif may differ in their exact number of notes as well as in pitches and interval. Such a definition applies to JKU‑PDD in general, but with two notable differences: first, JKU‑PDD is annotated with thematic repeated patterns,^² typically longer than the motivic ones in BPS‑Motif. To facilitate the discussion, we use the terms of motif discovery and MNID to discuss the two datasets altogether, while recognizing the difference in average pattern lengths and the derived musical meanings between them. Second, JKU‑PDD is available with two sets of annotations, one for polyphonic motives and one for monophonic ones; we follow the definition and use the monophonic annotations as our targets. Also, with such a definition, the MNID task can be viewed as a form of horizontal component (e.g., voice, melody, staff) extraction in symbolic music data. Its objectives therefore align closely with related horizontal component‑extraction tasks, such as melody identification (MeloID) (Kosta et al., 2022; Simonetta et al., 2019). Given such a similarity, the MNID task can benefit from the labels of related tasks (e.g., MeloID) to overcome the issue of data scarcity. Moreover, the MNID task can also be seamlessly integrated as a downstream task of a pretrained self‑supervised learning (SSL) model (Chou et al., 2021). Specifically, one can fine‑tune a pretrained SSL model to MNID through transfer learning on a small amount of training data.

We explore training two types of models for MNID: 1) a convolutional neural network (CNN) model originally designed for MeloID (Simonetta et al., 2019) and 2) a transformer model fine‑tuned from the pretrained MidiBERT model (Chou et al., 2021). Experiments demonstrate that using MNID models to remove non‑motif notes consistently yields significant improvements for the subsequent task of RPD. Additionally, we show that leveraging pseudo‑labels and MeloID results from external datasets offers a promising strategy to further enhance MNID performance. The source code of this work is available at https://github.com/york135/TISMIR2025_motif_discovery.

2 Related Work

2.1 Pattern matching and RPD

There are two primary interrelated task scenarios for the processing of musical patterns. First, the pattern matching or the pattern‑retrieval task aims to identify pattern occurrences that match or are musically relevant to a given query pattern (Frieler et al., 2018; Janssen et al., 2014; Janssen et al., 2017; Wu et al., 2023). Both rule‑based (Frieler et al., 2018; Janssen et al., 2017) and data‑driven (Wu et al., 2023) methods have been developed for pattern matching. Second, the RPD task aims to find all possible repeated patterns along with their occurrences without any given query (Meredith, 2006). The required output of the pattern matching/discovery tasks can be simply the time interval (Frieler et al., 2018; Wu et al., 2023) or the exact note group of the pattern (Hsiao et al., 2023; Meredith et al., 2002). These tasks have also been discussed in many specific scopes, e.g., within single pieces (intra‑opus) or corpora (inter‑opus), monophonic music or polyphonic music, or audio data or symbolic data (Hsiao et al., 2023; Krause et al., 2020). In this paper, we scope our motif‑discovery task as a kind of intra‑opus RPD task on polyphonic symbolic music data.

Hsiao et al. (2023) categorized RPD approaches into three main types: string‑based, geometry‑based, and feature‑based methods. First, string‑based methods represent music data as a one‑dimensional pitch sequence and find repeated patterns with sub‑string matching (Cambouropoulos et al., 2002; Hsu et al., 1998). This approach works well for monophonic music but is less applicable in polyphonic music. To solve this, the geometry‑based approach represents music notes as a multidimensional point set (Meredith, 2016). The translatable equivalent classes (TECs), which are sets of patterns that can match each other by way of a translation operation (i.e., shifting) in the multidimensional feature space, are then identified within the point set (Björklund, 2022; Meredith et al., 2002; Meredith, 2013; Meredith, 2019; Meredith, 2006). These TECs are considered to be the discovered repeated patterns. Starting from the Structure Induction Algorithm with Translational Equivalence Classes (SIATEC) algorithm (Meredith et al., 2002) that finds all TECs by exact matching, several subsequent works have been proposed to improve its efficiency and effectiveness. These improvements can be categorized into four types: 1) retaining only musically important patterns produced by SIATEC, such as Forth’s algorithm (Forth and Wiggins, 2009), COSIATEC, or SIATECCompress (Meredith, 2013); 2) discovering more plausible/important patterns than the original SIATEC, such as the use of compactness trawler in SIACT (Collins et al., 2010) and the segment heuristic discussed in Meredith (2013); 3) speeding up the computation efficiency with heuristics, such as SIAR (Collins, 2011) and SIATEC‑C (Björklund, 2022); and 4) discovering inexact patterns, such as SIARCT‑CFP (Collins et al., 2013). Apparently, one algorithm may achieve multiple types of improvement. We refer the interested readers to Björklund (2022) and Meredith (2016) for a comprehensive survey of pattern‑discovery algorithms. The performance of geometry‑based methods often requires additional heuristics to keep the musically important patterns and remove the unimportant ones. The most often mentioned heuristics are compactness (for example, suppressing isolated notes, filling significant gaps in detected patterns) (Collins et al., 2010) and compression (for example, discarding highly overlapped patterns or patterns with few repetitions) (Meredith et al., 2002; Meredith, 2013; Meredith, 2019). A widely noticed issue with the geometric‑based method is its high computational complexity (Björklund, 2022). Empirically, such computational complexity may even increase when applying the compression heuristics, as it usually requires additional recursive operations to sort all the pattern candidates.

While geometry‑based methods operate directly on the notes as a multi‑dimensional point set, feature‑based methods first transform the note data into a latent space using feature engineering or representation learning techniques, and then measure pattern‑level similarity within that transformed space. For example, Velarde et al. (2016) use the wavelet transform to obtain feature representations of the melody, then performed segmentation and agglomerative clustering to accomplish motif discovery. Recently, Wu et al. (2023) used the Siamese network architecture with contrastive learning to learn motif‑level representations for pattern retrieval.

2.2 Horizontal component extraction

Horizontal component extraction tasks have been extensively discussed in symbolic MIR. The definition of ‘horizontal component’ is, however, quite flexible and context‑dependent. Relevant tasks may incorporate MeloID for homophonic texture (Kosta et al., 2022; Simonetta et al., 2019), voice segregation for polyphony texture (Hsiao and Su, 2021; Karystinaios et al., 2023), and staff prediction for music score engraving (Foscarin et al., 2024). Early developments for these tasks are mostly heuristic‑based algorithms derived from the principles of melody perception (e.g., proximity, continuity) (Deutsch, 2013; Huron, 2001). Examples include the skyline method for MeloID (Uitdenbogerd and Zobel, 1999), contig mapping for voice separation (Chew and Wu, 2004; Guiomard‑Kagan et al., 2016), and some local optimization approaches (Kilian and Hoos, 2002). On the other hand, modern solutions are mostly based on deep learning architectures, including CNNs (Hsiao and Su, 2021; Simonetta et al., 2019), recurrent neural networks (Kosta et al., 2022; Lu and Su, 2018), graph neural networks (Foscarin et al., 2024; Karystinaios et al., 2023), and transformers (Chou et al., 2021; Zhao et al., 2023). Recently, MeloID was further considered as a downstream task of pre‑trained SSL models for symbolic music processing (Chou et al., 2021; Shen et al., 2023; Zhao, 2024).

3 Proposed Method

We follow the problem formulation of RPD proposed in MIREX (Collins, 2013). Take a music piece $D$ composed of $N$ distinct notes, i.e., $D : = {d_{n}}_{n = 1}^{N}$ , where $d_{n} : = (o_{n}, p_{n})$ represents the $n$ ‑th note, with its onset and pitch in $D$ . In particular, the onset value is a floating point number in crotchet beats, while the pitch value is an integer‑valued MIDI number. The objective of the motif‑discovery task is to find a list of motifs $M : = {m_{i}}_{i = 1}^{K}$ , $m_{i} \in D$ , where $K$ is the number of distinctive motifs. Each motif $m_{i}$ contains a list of occurrences $m_{i} : = {o_{i j}}_{j = 1}^{R_{i}}$ , $o_{i j} \in m_{i}$ , where $R_{i}$ is the number of occurrences of $m_{i}$ . Loosely speaking, we hope that $o_{i p}$ and $o_{i q}$ are musically similar for all $i \in [1, K]$ and for all $p$ , $q \in [1, R_{i}]$ such that $m_{i}$ is musically important in $D$ . In other words, unlike a general RPD system, a motif‑discovery system has to understand whether a motif candidate represents an important musical idea. Figure 1 shows an example to explain the idea of the motif‑discovery task. Given the input music data in Figure 1a, the objective of the motif‑discovery task is to color the notes and make the result look like Figure 1b: we need to extract the motif notes and identify each occurrence of each motif (red or blue notes) while ignoring the non‑motif (gray) notes. Note that some of the gray notes here can form repeated patterns, such as the chord Fmin (F3, A♭3, and C4) in the second bar, which repeats three times. However, they are unlikely to be a motif given the current context. Therefore, they are still considered as non‑motif notes.

A visualization of the motif‑discovery task. The excerpt comes from the first four bars of Beethoven’s *Sonata No. 1 in F Minor*, Op. 2, No. 1. Red and blue notes represent two different motifs, while gray notes are non‑motif notes that do not belong to an occurrence of any motif.

The right‑hand side of Figure 2 shows the proposed motif discovery framework. It consists of two steps—namely, motif note identification (MNID) and RPD. For MNID, a binary classification model $f (\cdot)$ is applied to remove the notes that are less likely to construct a motif, i.e.,

An overview of the proposed motif discovery framework, which divides the motif‑discovery task into motif note identification (MNID) and repeated pattern discovery (RPD). During training (left‑hand side), pseudo‑labeling is adopted to address the low‑resource issue for MNID, where the intersection of a pre‑trained melody identification (denoted as MeloID) model’s prediction and the MNID model’s prediction is treated as the pseudo‑label. During testing (right‑hand side), we first use an MNID model to identify motif notes from the input musical score. Then, we apply an RPD algorithm (such as SIATEC or CSA (Hsiao et al., 2023; Meredith et al., 2002) to discover repeated patterns from the identified motif notes.

1

\tilde{M} = f (D)

where $\tilde{M}$ represents the candidate set of motif notes. Then, in the RPD step, the conventional rule‑based repeated pattern‑discovery methods are applied on $\tilde{M}$ to obtain the discovered motifs ${\tilde{m}}_{i}$ and their occurrences ${\tilde{o}}_{i j}$ . The training of the MNID model, as shown on the left‑hand side of Figure 2, as well as the adopted RPD methods, are both described below (Section 3.1 for MNID; Section 3.2 for RPD).

3.1 MNID

A motif note is a note that belongs to any occurrence of a motif, and the MNID task is to identify motif notes from an input music piece. As mentioned in Section 1, we follow the annotation principles of motif in the BPS‑Motif dataset (Hsiao et al., 2023) to construct the MNID task. This task can then be considered as a modification of the melody‑identification task, with the following differences: 1) the input data can be of any musical texture, while the melody‑identification task is usually limited to homophony (Kosta et al., 2022); 2) not all the melody notes in classical music belong to a motif^³; and 3) motif notes do not always have to be monophonic because two different motives may occur at the same time. Finally, it should be noted that the length of a repeated pattern is highly subject to the annotator’s view of music analysis.

A motif note identification model (the ‘MNID model’ in Figure 2) takes the note sequence $D$ (musical score) as input and produces a binary prediction for each note $d_{n}$ , which indicates whether $d_{n}$ is a motif note or not. With these note‑level predictions, we can then remove non‑motif notes from the musical score (the ‘remove non‑motif notes’ block in Figure 2) and focus on discovering motifs from these motif notes. This step has the potential of reducing the false‑positive motifs discovered by the downstream RPD algorithm.

3.1.1 Model

We consider reusing previous melody‑identification studies to construct our MNID model $f (\cdot)$ . For our interest, we selected two relevant models with different levels of model complexity:

CNN: we first employ a CNN‑based MeloID model as our MNID model (Simonetta et al., 2019). The input of the model is a pianoroll with a temporal resolution of a 32nd note and a pitch resolution of one semitone. In our case, the model outputs a binary value (1 for motif note activation and 0 otherwise) for each pixel in the pianoroll. Note‑level results are obtained by aggregating pixel‑level predictions within a note; we refer to the original paper for details (Simonetta et al., 2019).
Fine‑tuned MidiBERT (Chou et al., 2021): MidiBERT is a pre‑trained model for symbolic music classification. It has been applied to several downstream tasks, including MeloID. MidiBERT consists of 12 bi‑directional transformer layers resembling the ${BERT}_{BASE}$ model (Devlin et al., 2019; Vaswani et al., 2017). The input of the model is the compound word representation (Hsiao et al., 2021). In this work, we directly fine‑tune the pre‑trained model for the MNID task; this can also be regarded as a way to address the data scarcity issue in MNID.

3.1.2 Pseudo‑labeling for MNID

Regarding the issue of data scarcity in MNID, we further consider pseudo‑labeling, a simple semi‑supervised learning method, to improve the performance of MNID. First, we train an MNID model on labeled data. This MNID model is taken as a pseudo‑labeler to predict motif notes for unlabeled data. These data with predictions (called pseudo‑labels) together with the original labeled data are then employed to train a new MNID model.

A major issue of pseudo‑labeling is that the predicted pseudo‑labels could be noisy. To address this, we propose to introduce MeloID (the ‘pre‑trained MeloID model’ in Figure 2) as an auxiliary task to refine the predicted pseudo‑labels. More specifically, by assuming that all motif notes are also melody notes, we assign only the notes that are 1) predicted as motif notes by the MNID pseudo‑labeler and also 2) predicted as melody notes by a MeloID model as the pseudo‑label (see the ‘Intersection’ block shown in Figure 2).^⁴ This MeloID‑assisted method effectively reduces false‑positive pseudo‑labels.

To obtain a MeloID model, we reproduce the model proposed by Simonetta et al. (2019), which achieved an F1‑score of 0.890 in MeloID under cross‑validation on the Mozart Piano Sonata (MPS) dataset. The resulting models are then used to extract melody from the MPS dataset. Note that, in the pseudo‑labeling process, ground‑truth melody annotation is not required. This ensures the scalability of the proposed method. In practice, both the pseudo‑labeling method and the ‘pseudo‑labeling with intersection with melody’ method can be combined with either CNN or MidiBERT.

Figure 3 shows two examples from the MPS dataset to demonstrate the effect of melody intersection in refining the pseudo‑labels. From the two examples, we observe that the MNID model may produce false‑positive predictions on the left‑hand accompaniment part, when they share certain similarities with the patterns in the right‑hand melody part. For example, the one pattern (i.e., purple lines) in the red dotted block of Figure 3c are similar to the inversion of the motif notes in the red block and are considered as motif notes by the pseudo‑labeler. Such excessive false positives may negatively affect subsequent RPD, ultimately limiting the improvement in overall motif discovery performance. In this case, taking the intersection of the MeloID results (Figures 3b and 3d) and MNID results can effectively suppress these excessive false positives. However, this does not imply that the MeloID model alone can serve as a pseudo‑labeler, as it may also output melodic but non‑motif notes (e.g., the first note in Figure 3b). There, the MNID model (Figure 3a) is still needed to filter out such melodic but non‑motif notes.

Comparison of the melody identification (MeloID) and motif note identification (MNID) results on two examples of the Mozart Piano Sonata dataset. **(a)** and **(b)** are the first 10 beats of the the first movement of Mozart’s *Piano Sonata No. 6 in D Major, KV284*. **(c)** and **(d)** are the first 10 beats of the second movement of Mozart’s *Piano Sonata No. 1 in C Major, KV279*.

3.2 RPD

Two types of RPD methods are discussed in this study. The first is SIATEC (Meredith et al., 2002; Meredith, 2013), one of the most widely discussed RPD algorithms to our knowledge. The second is the common structure algorithm (CSA), which has been proposed recently together with the BPS‑Motif dataset (Hsiao et al., 2023). Both algorithms use the vectorized representation of a point set (i.e., difference of two notes on the 2D onset‑pitch space) as the input feature, aiming at finding all the compact sets of the vectors that repeat throughout the music piece. Hsiao et al. (2023) reported that SIATEC takes some advantages in finding thematic patterns that are typically longer, sometimes covering the whole section, while CSA is generally better in finding short motifs. The right‑hand side of Figure 2 (Step 2) illustrates the concepts of the two algorithms, which are further described as follows.

3.2.1 SIATEC and SIATECCompressSegment

In this paper, we consider SIATEC and one of its variants, called SIATECCompressSegment (Meredith, 2013). As shown in Figure 2, the core concept of SIATEC is TEC. Two subsets (i.e., two patterns), $m$ and $n$ , exist, where $m, n \in D$ are translatable (denoted as $m \equiv n$ ) if there exists a vector $v$ such that the translation function $f (d, v) : m \to n; d \to d + v$ is bijective, where $d + v$ means shifting $d$ by the vector $v$ . All the patterns translatable with respect to $m$ form a TEC of $m$ in $D$ :

2

TEC (m, D) : = {n | n \equiv m, n \subseteq D} .

A maximal translatable pattern (MTP) is the longest pattern (measured by the number of notes) that can be translated by a single translatable vector $v$ [20]:

3

MTP (v, D) : = \max_{| d |} {d | d \in D and d + v \in D},

where $| d |$ is the number of notes in $d$ . SIATEC is then an algorithm that finds all the TECs of the MTPs available in $D$ , and, for each translatable vector $v$ , SIATEC only discovers the MTPs translatable by $v$ . Briefly speaking, it first computes the difference between all the note pairs $(d_{i}, d_{j})$ in $D$ , denoted as $v_{i j} : = d_{j} - d_{i}$ for all $1 \leq i < j \leq N$ . Let $V : = {v_{i j}}_{1 \leq i < j \leq N}$ be the set of all translatable vectors. A subset in $V$ containing all $v$ such that $v \equiv \hat{v}$ is then $MTP (\hat{v}, D)$ . Finally, the TECs are derived from the discovered MTPs in $D$ . We refer interested readers to Appendix A and Meredith et al. (2002) for the details of the SIATEC algorithm.

The implementation of SIATEC is available online.^⁵ This implementation further applies a heuristic that discards patterns that achieve the compression ratio of less than a fixed hyper‑parameter $r$ (Meredith, 2013). Specifically, it treats the SIATEC algorithm as a compression algorithm. For a TEC that contains $p$ MTPs, where each MTP contains $| d |$ notes, one can use $| d | + p - 1$ data points to encode it; the first $| d |$ data points encode the first MTP, while the remaining $p - 1$ data points represent the vectors between the first MTP and the remaining $p - 1$ MTPs. In this case, suppose there are no overlapping data points; the SIATEC compresses the original data points with a compression ratio of $\frac{p | d |}{| d | + p - 1}$ . One can then consider such a compression ratio the quality of a discovered pattern (Meredith, 2013) and simply discard patterns (TECs) that achieve a compression rate of less than $r$ .

SIATECCompressSegment is proposed in the ‘discovery of repeated patterns and themes’ challenge of MIREX2013 (Meredith, 2013). Following the above‑mentioned idea of data compression, two heuristics are employed to refine the predictions of SIATEC:

The Compress heuristics: The patterns (TECs) obtained by SIATEC are sorted by the compression ratio in a descending order. Then, the patterns are selected in a greedy manner, until all notes in $D$ are covered by any occurrence of a pattern. In other words, in each selection, the pattern that achieves the best compression ratio (given the set of currently remaining notes) is ‘greedily’ selected, regardless of whether this leads to the globally optimal solution. Unselected patterns are then discarded.
The Segment heuristics: All the notes within the time segment spanned by an occurrence of a pattern are included in that occurrence. The onset timing is used to determine this condition.

Intuitively, the Compress heuristic favors long and frequently repeated patterns. The Segment heuristic assumes that a pattern never ‘skips’ any note. Results from Meredith (2013) showed that these heuristics are effective on the JKU‑PDD, in which the SIATECCompressSegment algorithm achieves state‑of‑the‑art performance. However, the runtime of SIATECCompressSegment is much higher than the original SIATEC algorithm, making it inefficient for larger datasets. Therefore, in our experiments, we utilize SIATECCompressSegment for JKU‑PDD but SIATEC for the BPS‑Motif dataset, as SIATEC performs relatively efficient on larger datasets. See Section 5.5 for the discussion.

3.2.2 CSA

The CSA finds approximate repeated patterns with a bottom‑up approach: it constructs the pattern candidates first, and then finds the prominent ones. Also, it only considers the displacement vectors of the neighboring notes. The first step of the algorithm is to find the context note group of each note $d_{i} : = (o_{i}, p_{i})$ . As shown in the bottom right of Figure 2, the context note group (denoted as $C_{i}$ ) is a compact set of notes, which are next to $d_{i}$ . Defining $Δ C_{i}$ as the set of onset and pitch differences between every two subsequent notes in $C_{i}$ and $| C_{i} |$ as the length of $C_{i}$ , we have

4

\begin{matrix} C_{i} : = {d_{j} | 0 < o_{j} - o_{i} < δ \cap \\ max Δ C_{i} < (γ_{o}, γ_{p}) \cap | C_{i} | \geq m}, \end{matrix}

where the parameter $δ$ limits the range of the context notes and thus restricts the duration of patterns. It should be noted that the idea of limiting $maxΔ C_{i} < (γ_{o}, γ_{p})$ is similar to the compactness condition discussed in Collins et al. (2010) and Meredith (2006), although they are implemented in different ways.

5

S_{i} : = {d_{j} - d_{i} | d_{j} \in C_{i} \ d_{i}},

the representative pattern for $d_{i}$ is obtained by finding the matched vectors between $S_{i}$ and every $S_{j}$ for $j > i$ . Considering that different occurrences of a motif might not be exactly the same, we introduce two tolerance values $(θ_{o}, θ_{p})$ , such that two vectors differing within $θ_{o}$ in the time dimension and with $θ_{p}$ in the pitch dimension are considered to be matched. The set of matched vectors for $d_{i}$ is denoted as $P_{i j} : = S_{i} \tilde{\cap} S_{j}$ for all $j > i$ , where $\tilde{\cap}$ represents the operation of soft matching under the tolerance values described above. Then, all $P_{i j}$ are grouped into (occurrences of) pattern candidates according to the cardinality scores $s (P_{i j}, P_{i k})$ for all $j, k > i$ :

6

s (P_{i j}, P_{i k}) : = \frac{| P_{i j} \tilde{\cap} P_{i k} |}{\max (| P_{i j} |, | P_{i k} |)} .

If $s (P_{i j}, P_{i k}) > θ$ , where $θ$ is a threshold value, then $P_{i j}$ and $P_{i k}$ are considered to be two occurrences of the same pattern candidate. Otherwise, $P_{i j}$ and $P_{i k}$ are considered to be from different pattern candidates. The most representative pattern derived from $d_{i}$ is then the one with the largest set of occurrences:

7

M_{i} := \underset{| Q_{i j} |}{\arg max} Q_{i j},

8

where Q_{i j} := {P_{i k} | s (P_{i j}, P_{i k}) > θ} .

Finally, the representative pattern sets $M_{i}$ and $M_{j}$ are merged into a single set if the cardinal scores of any of the members in $M_{i}$ and $M_{j}$ are greater than $θ$ . Otherwise, $M_{i}$ and $M_{j}$ are considered as two distinct patterns. This merging process is also performed in a greedy manner. The remaining distinct sets after the merging process then represent the distinct patterns, and the members of each set represent the vector for the occurrences of the discovered pattern. Details of the CSA algorithm can be found in Appendix B. A summary of the hyper‑parameters is given in Table 1.

Table 1

Hyperparameters of the CSA algorithm and their values used in the experiments of both datasets.

Name	Definition	Values
Name	Definition	BPS‑Motif	JKU‑PDD
$θ$	The cardinal score threshold that decides whether to merge two motifs	0.5	0.7
$m$	The minimum note number for any motif occurrence	4	4
$θ_{o}$	The onset tolerance for matching two vectors (between two note pairs)	0	0
$θ_{p}$	The pitch tolerance for matching two vectors (between two note pairs)	1	3
$γ_{o}$	The inter‑onset interval threshold for the compactness condition	2	2
$γ_{p}$	The inter‑pitch interval threshold for the compactness condition	$\infty$	$\infty$
$δ$	The maximum duration of motifs/patterns	12	$\infty$

The main reasons why we include CSA in our study are two‑fold. First, as shown in its original paper (Hsiao et al., 2023), CSA achieves state‑of‑the‑art performance in most of the evaluation metrics on the BPS‑Motif dataset, one of the datasets that we use in this work. Second, although CSA was proposed as a general RPD algorithm, it employs a hyperparameter $δ$ to limit the duration of a pattern/motif. This design fits the nature of motif (a short musical idea, as discussed in Section 1), making it a good candidate RPD algorithm for motif discovery.

4 Experiments

The goals of our experiments are three‑fold. First, we benchmark the MNID tasks against the models and pseudo‑label training strategies mentioned in Section 3.1. Second, we demonstrate the effectiveness of MNID in improving the performance of motif discovery in polyphonic symbolic music data, especially on the occurrence and three‑layer measures. Third, we show that MNID can similarly improve the discovery of longer thematic patterns, in addition to those short motivic ones.

4.1 Datasets

Three datasets are adopted in the experiments:

The BPS‑Motif dataset is used to train and evaluate the MNID models and to evaluate the motif discovery task. The BPS‑Motif dataset contains the motif annotations of the first movements of Beethoven’s 32 piano sonatas (Hsiao et al., 2023).^⁶ There are in total around 127,000 notes, with 263 manually annotated motifs and 4,944 occurrences. In our experiments, the BPS‑Motif dataset is partitioned into five folds for cross‑validation.
The JKU‑PDD is used to evaluate the discovery of repeated patterns and themes. JKU‑PDD consists of five classical musical pieces with repeated pattern annotations (Collins, 2013).^⁷ This dataset is widely used in the evaluation of RPD tasks, including Discovery of Repeated Themes and Patterns in the MIREX campaign (Björklund, 2022; Collins, 2013; Meredith, 2013). Not limited to motifs, this dataset also contains annotations at the level of theme or section. Besides, this dataset has two versions: the polyphonic version includes all the notes (i.e., melody plus accompaniment) for both the data and the labels, whereas the monophonic version only includes the skyline notes of the polyphonic version. To make this dataset fit our scenario, where only musically important notes are annotated, we take the polyphonic data as the input and the monophonic labels as the output.^⁸^,^⁹ In this scenario, there are in total 6,421 notes, with 23 annotated patterns and 91 occurrences.
The MPS dataset is used to train the MeloID model and to generate the pseudo‑labels in our experiments. The MPS dataset contains melody note annotations for the 38 movements of Mozart’s 13 piano sonatas (Simonetta et al., 2019).^¹⁰ When training the MeloID model, Sonata K. 282 (No. 4 in E‑flat Major) is excluded because it is also collected in the JKU‑PDD dataset. This leads to a dataset with a total of 95,224 notes. Similarly, five‑fold cross‑validation is conducted.

The cross‑dataset bias in annotation between BPS‑Motif and JKU‑PDD requires special attention: 1) the repeated pattern annotations in JKU‑PDD are restricted to a subset of skyline notes, whereas BPS‑Motif annotations are not; 2) the annotated patterns in BPS‑Motif are generally shorter (i.e., more segmented) than those in JKU‑PDD; and 3) the scale of JKU‑PDD is substantially smaller than that of BPS‑Motif. The SIATECCompress and CSA algorithms were evaluated on JKU‑PDD and BPS‑Motif, respectively, when they were proposed; therefore, the two algorithms might also inherit the biases of the datasets. These cross‑dataset biases must be taken into account when interpreting the results of the two datasets, as will be discussed in Section 5.

4.2 Models for comparison

For the experiments on MNID, three different training settings are compared for both CNN and MidiBERT:^¹¹

no pseudo‑labeling;
using vanilla pseudo‑labeling (denoted as PL); and
using pseudo‑labeling combined with the results of MeloID (denoted as MI).

In addition to the above six settings, we further included a baseline that employs a modified version of the Skyline algorithm (Uitdenbogerd and Zobel, 1999). Skyline is a rule‑based algorithm that always keeps the note with the highest pitch as the melody. We simply adopt this algorithm for MNID but modify it to fit the requirement of the task: if a note has the highest pitch in more than 50% of its duration, it is considered as a motif note; otherwise, it is considered as a non‑motif note. As a result, seven settings are evaluated in total.

Then, for the experiments on motif/theme discovery, the aforementioned seven settings are multiplied by the two types of RPD method (i.e., SIATEC and CSA) mentioned in Section 3.2. Considering the computational efficiency and the effectiveness of the RPD methods, we have the following arrangements, which suffice as a systematic discussion on the combination of MNID and RPD in motif/theme discovery:

For JKU‑PDD, SIATECCompressSegment (denoted as SIATEC_CS for short) and CSA are compared. Reporting the results with SIATEC_CS instead of SIATEC is preferable since SIATEC_CS was reported as a state‑of‑the‑art method on JKU‑PDD (Collins, 2013). Our pilot study also showed that using SIATEC instead of SIATEC_CS does not affect our discussion of the results.
For the BPS‑Motif dataset, SIATEC and CSA are compared. SIATEC_CS is not used here due to its long runtime on the BPS‑Motif dataset, which we will briefly discuss in Section 5.2.

In addition to the seven MNID settings, we further include the oracle setting (i.e., perfect MNID) and a baseline that simply do not use MNID. Our discussion will focus on the effectiveness of MNID and the comparison of different MNID models.

4.3 Evaluation metrics

We report the performance of the MNID and motif‑discovery tasks. First, for MNID, we follow the conventional metrics in the melody‑identification task, which are note‑level accuracy (Acc), precision (P), recall (R), and F1‑score (F) (Kosta et al., 2022). For motif discovery, we follow the evaluation metrics established in the campaign of Discovery of Repeated Themes and Sections in MIREX 2013 (Collins, 2013). There are nine metrics in total, which are precision (P), recall (R), and F1‑score (F) for establishment (est), occurrence (occ), and three‑layer (thr) measurements. These metrics are derived in detail as below.

Following the notation used in Section 3, we denote the ground‑truth motifs by $M : = {m_{i}}_{i = 1}^{K}$ ; each motif $m_{i}$ contains a list of occurrences $m_{i} : = {o_{i j}}_{j = 1}^{R_{i}}$ , where $K$ is the total number of motifs and $R_{i}$ is the number of occurrences of the $i$ th motif. Similarly, we denote the predicted motifs by $\hat{M} : = {{\hat{m}}_{i}}_{i = 1}^{\hat{K}}$ and ${\hat{m}}_{i} : = {{\hat{o}}_{i j}}_{j = 1}^{{\hat{R}}_{i}}$ . The similarity score between two occurrences, $o_{i j}$ and ${\hat{o}}_{k l}$ , denoted as $s (o_{i j}, {\hat{o}}_{k l})$ , is

9

s (o_{i j}, {\hat{o}}_{k l}) : = \frac{| o_{i j} \cap {\hat{o}}_{k l} |}{\max (| o_{i j} |, | {\hat{o}}_{k l} |)},

which is simply the cardinality score between the two occurrences. The similarity matrix for $m_{i}$ and ${\hat{m}}_{k}$ , denoted as $S (m_{i}, {\hat{m}}_{k})$ , is then an $R_{i}$ ‑by‑ ${\hat{R}}_{k}$ matrix, with its $(j, l)$ th element being $s (o_{i j}, {\hat{o}}_{k l})$ .

The establishment measure evaluated how good a motif is established by the motif‑discovery algorithm. It simply defines the establishment similarity level for $m_{i}$ and ${\hat{m}}_{k}$ as $S_{e s t} (m_{i}, {\hat{m}}_{k}) := max (S (m_{i}, {\hat{m}}_{k}))$ . The establishment scores are then defined as:

10

P_{e s t} = \frac{1}{\hat{K}} \sum_{k = 1}^{\hat{K}} max_{i = 1, \dots, K} S_{est} (m_{i}, {\hat{m}}_{k});

11

R_{e s t} = \frac{1}{K} \sum_{i = 1}^{K} max_{k = 1, \dots, \hat{K}} S_{est} (m_{i}, {\hat{m}}_{k});

12

F_{e s t} = \frac{2 P_{e s t} R_{e s t}}{P_{e s t} + R_{e s t}} .

Aside from the establishment score, the occurrence measure further evaluates how good the algorithm discovers the occurrences of a motif. The occurrence‑level similarity metrics between two motifs is defined as

13

S_{o c c, p} (m_{i}, {\hat{m}}_{k}) : = \frac{1}{{\hat{R}}_{k}} \sum_{l = 1}^{{\hat{R}}_{k}} \max_{j = 1, \dots, R_{i}} s (o_{i j}, {\hat{o}}_{k l});

14

S_{o c c, r} (m_{i}, {\hat{m}}_{k}) : = \frac{1}{R_{i}} \sum_{j = 1}^{R_{i}} \max_{l = 1, \dots, {\hat{R}}_{k}} s (o_{i j}, {\hat{o}}_{k l}) .

Then, the occurrence scores are derived from the mean–max of $S_{o c c, p}$ and $S_{o c c, r}$ :

15

P_{o c c} = \underset{i, k s . t . S_{e s t} (m_{i}, {\hat{m}}_{k}) \geq c}{mean} max_{i = 1, \dots, K} S_{occ,p} (m_{i}, {\hat{m}}_{k});

16

R_{o c c} = \underset{i, k s . t . S_{e s t} (m_{i}, {\hat{m}}_{k}) \geq c}{mean} max_{k = 1, \dots, \hat{K}} S_{occ,r} (m_{i}, {\hat{m}}_{k});

17

F_{o c c} = \frac{2 P_{o c c} R_{o c c}}{P_{o c c} + R_{o c c}},

where $c$ is set to 0.75 by default (Collins, 2013), and ‘ $s . t .$ ’ means $s u c h t h a t$ . Finally, the three‑layer measurement is defined hierarchically as follows:

18

P_{1} (o_{i j}, {\hat{o}}_{k l}) : = \frac{| o_{i j} \cap {\hat{o}}_{k l} |}{| {\hat{o}}_{k l} |};

19

R_{1} (o_{i j}, {\hat{o}}_{k l}) : = \frac{| o_{i j} \cap {\hat{o}}_{k l} |}{| o_{i j} |};

20

F_{1} (o_{i j}, {\hat{o}}_{k l}) : = \frac{2 P_{1} (o_{i j}, {\hat{o}}_{k l}) R_{1} (o_{i j}, {\hat{o}}_{k l})}{P_{1} (o_{i j}, {\hat{o}}_{k l}) + R_{1} (o_{i j}, {\hat{o}}_{k l})};

21

P_{2} (m_{i}, {\hat{m}}_{k}) : = \frac{1}{{\hat{R}}_{k}} \sum_{l = 1}^{{\hat{R}}_{k}} \max_{j = 1, \dots, R_{i}} F_{1} (o_{i j}, {\hat{o}}_{k l});

22

R_{2} (m_{i}, {\hat{m}}_{k}) : = \frac{1}{R_{i}} \sum_{j = 1}^{R_{i}} \max_{l = 1, \dots, {\hat{R}}_{k}} F_{1} (o_{i j}, {\hat{o}}_{k l});

23

F_{2} (m_{i}, {\hat{m}}_{k}) : = \frac{2 P_{2} (m_{i}, {\hat{m}}_{k}) R_{2} (m_{i}, {\hat{m}}_{k})}{P_{2} (m_{i}, {\hat{m}}_{k}) + R_{2} (m_{i}, {\hat{m}}_{k})};

24

P_{3} (M, \hat{M}) : = \frac{1}{\hat{K}} \sum_{k = 1}^{\hat{K}} \max_{i = 1, \dots, K} F_{2} (m_{i}, {\hat{m}}_{k});

25

R_{3} (M, \hat{M}) : = \frac{1}{K} \sum_{i = 1}^{K} \max_{k = 1, \dots, \hat{K}} F_{2} (m_{i}, {\hat{m}}_{k});

26

F_{3} (M, \hat{M}) : = \frac{2 P_{3} (M, \hat{M}) R_{3} (M, \hat{M})}{P_{3} (M, \hat{M}) + R_{3} (M, \hat{M})},

where $P_{3} (M, \hat{M})$ , $R_{3} (M, \hat{M})$ , and $F_{3} (M, \hat{M})$ denote the three‑layer precision ( $P_{t h r}$ ), recall ( $R_{t h r}$ ), and F1‑score ( $F_{t h r}$ ), respectively.

Intuitively, establishment measures to what extent a method can find any of the occurrences of each pattern, i.e., establishing the existence of a pattern. Occurrence (occ) measures to what extent a method can find all occurrences of each pattern it established. Three‑layer (thr) measures the overall performance of a method in a hierarchical manner (from note‑level to occurrence‑level and motif‑level, totaling three layers).

4.4 Implementation details

The MeloID model strictly follows the same CNN‑based model design used in Simonetta et al. (2019), i.e., four convolution layers, 20 kernels per layer, with a kernel size of 32 $\times$ 16 without zero‑padding. The MNID model is also based on the same model architecture, but with two modifications: the kernel number is set to 16, while the kernel size is set to 33 $\times$ 17 with zero‑padding to keep the sizes of all the layers’ outputs the same.

For MidiBERT, we generally follow the fine‑tuning settings seen in Chou et al. (2021), with a few changes: 1) a class weight of 2.0 is applied to the motif notes to address the data‑imbalance issue, 2) the batch size is set to 8, and 3) the best model checkpoint is selected based on the F1‑score for the motif notes on the validation set. All other hyperparameters are retained the same as in MidiBERT’s official implementation.^¹² We use the official pre‑trained model weights for initialization and fine‑tune them for 10 epochs using the AdamW optimizer (Loshchilov and Hutter, 2019), with a learning rate of $2 \times 10^{- 5}$ and a weight decay of $10^{- 2}$ .

For SIATEC, the compression ratio $r$ is set to 2 in our experiments. For SIATEC_CS, we follow the default settings reported in Meredith (2013). For CSA, we mainly follow the hyperparameters in Hsiao et al. (2023). On the BPS‑Motif dataset, the hyperparameters $θ$ , $m$ , $θ_{o}$ , $θ_{p}$ , $γ_{o}$ , $γ_{p}$ , and $δ$ are set to 0.5, 4 notes, 0 crotchet beats, 1 semitone, 2 crotchet beats, $\infty$ semitones, and 12 crotchet beats, respectively. This set of hyperparemeters reproduces the BPS‑Motif results reported in the original paper (i.e., Hsiao et al., 2023) the best. On JKU‑PDD, we adopt another set of hyperparameters for CSA, particularly setting the duration limitation of motifs/patterns ( $δ$ ) to $\infty$ so as to extract longer (e.g., theme‑ or section‑level) patterns, which are prevalent in JKU‑PDD (see footnote 5). The hyperparameters $θ$ , $m$ , $θ_{o}$ , $θ_{p}$ , $γ_{o}$ , $γ_{p}$ , and $δ$ are set to 0.7, 4 notes, 0 crotchet beats, 3 semitones, 2 crotchet beats, $\infty$ semitones, and $\infty$ crotchet beats (i.e., no limitation on the duration of contexts/patterns), respectively. For a short summarization of the meanings of hyper‑parameters, please refer to Table 1.

4.5 Experimental procedure

The experimental procedure for the proposed method is listed as below.

Train MeloID models on the MPS dataset with five‑fold cross‑validation. Use these models to extract the melody for the MPS dataset.
Train the first set of MNID models on the BPS‑Motif dataset with five‑fold cross‑validation.
Apply the first set of MNID models and MeloID models to obtain pseudo‑labels with the proposed melody intersection method for MPS.
Train the second set of MNID models on both BPS‑Motif and MPS (with pseudo‑labels).
Use the second set of MNID models to perform MNID for BPS‑Motif or JKU‑PDD (depending on the dataset to be evaluated). For JKU‑PDD, we use the average prediction of all the five MNID models for inference.
Then, apply RPD methods to the motif notes identified by the MNID models.

During each step, we design the experiment carefully to ensure that the ground‑truth of the test fold is not revealed to any of the models.

5 Experiment Results

5.1 Motif note identification

The left part of Table 2 demonstrates the results of MNID experiments for the BPS‑Motif dataset. We observe that the naive Skyline method achieves decent performance (0.776 accuracy). Among the three methods (i.e., Skyline, CNN, and MidiBERT), without pseudo‑labeling, MidiBERT achieves the best accuracy (0.823), F1‑score (0.706), and precision (0.669). On the other hand, the simple CNN model outperforms Skyline only in accuracy and precision, falling behind MidiBERT in all metrics. This demonstrates the strength of fine‑tuning a pre‑trained model for the downstream MNID task, given a limited amount of training data. Then, comparing the pseudo‑labeling methods, we observe that, for MidiBERT, vanilla pseudo‑labeling improved both the accuracy and F1‑score; however, for CNN, the vanilla pseudo‑labeling method degrades the accuracy and F1‑score by 0.8 and 2.3 percentage points, respectively. These results show that the vanilla pseudo‑labeling method does not necessarily lead to better performance. Given that the quality of pseudo‑labels depends on the performance of the first model (pseudo‑labeler), we suspect that the CNN model does not benefit from the vanilla pseudo‑labeling because the CNN itself is not good enough, with only a 0.652 F1‑score. However, when the proposed MI method is used, the performance of both CNN and MidiBERT is consistently improved. To validate whether such an improvement is statistically significant, we repeat the experiments of ‘MidiBERT with PL’ and ‘MidiBERT with MI’ three times and perform an independent t test. We found $p < 0.02$ for both accuracy and F1‑score, which confirms the positive effect of the proposed MI method.^¹³

Table 2

MNID results on both the BPS‑Motif dataset and the JKU‑PDD. The subscripts denote the standard deviations across five folds. Note that, strictly speaking, the results on JKU‑PDD may not reflect the actual motif note identification results, as the ground‑truths here are not always motif notes. Please refer to Section 4.1 for detailed discussion.

MNID	Setting	BPS‑Motif				JKU‑PDD
MNID	Setting	Accuracy	F1‑score	Precision	Recall	Accuracy	F1‑score	Precision	Recall
Skyline	N/A	0.776 $_{0.029}$	0.671 $_{0.055}$	0.591 $_{0.060}$	0.822 $_{0.036}$	0.869 $_{0.131}$	0.716 $_{0.300}$	0.634 $_{0.328}$	0.999 $_{0.003}$
CNN	✗	0.785 $_{0.021}$	0.652 $_{0.043}$	0.604 $_{0.061}$	0.738 $_{0.057}$	0.849 $_{0.100}$	0.681 $_{0.273}$	0.594 $_{0.294}$	0.960 $_{0.048}$
	PL	0.777 $_{0.019}$	0.629 $_{0.046}$	0.596 $_{0.075}$	0.698 $_{0.088}$	0.860 $_{0.088}$	0.678 $_{0.277}$	0.599 $_{0.303}$	0.920 $_{0.075}$
	MI	0.802 $_{0.028}$	0.660 $_{0.050}$	0.636 $_{0.059}$	0.713 $_{0.077}$	0.873 $_{0.078}$	0.693 $_{0.265}$	0.625 $_{0.313}$	0.928 $_{0.062}$
MidiBERT	✗	0.823 $_{0.028}$	0.706 $_{0.053}$	0.669 $_{0.065}$	0.776 $_{0.071}$	0.828 $_{0.128}$	0.662 $_{0.292}$	0.571 $_{0.316}$	0.981 $_{0.012}$
	PL	0.831 $_{0.026}$	0.716 $_{0.056}$	0.679 $_{0.058}$	0.782 $_{0.066}$	0.825 $_{0.131}$	0.660 $_{0.293}$	0.567 $_{0.317}$	0.982 $_{0.013}$
	MI	0.839 $_{0.029}$	0.721 $_{0.063}$	0.701 $_{0.076}$	0.763 $_{0.061}$	0.866 $_{0.088}$	0.683 $_{0.280}$	0.599 $_{0.318}$	0.944 $_{0.041}$

The right part of Table 2 shows the MNID results of JKU‑PDD. A notable behavior of JKU‑PDD being different from the results of BPS‑Motif is that Skyline outperforms CNN and MidiBERT (with MI) by 2.3 and 3.3 percentage points in F1‑score, respectively. It should be noted that such competitiveness of Skyline is largely because of the annotation bias in JKU‑PDD: as mentioned in Section 4.1, the patterns of JKU‑PDD’s monophonic version are performed with the Skyline algorithm (Collins, 2013). This makes Skyline reach nearly 100% recall and thereon outperforms all the data‑driven models in F1‑score.^¹⁴ Aside from this biased effect, we still observe that, for the pseudo‑labeling methods, MI also consistently outperforms both PL and the ‘no pseudo‑label’ setting, which is the same as in the case of BPS‑Motif.

5.2 RPD runtime

Before discussing motif discovery results, we report the runtime of the RPD algorithms to support our choice of RPD algorithms to use in the downstream motif discovery step (see Section 4.2). Table 3 shows the runtime of three RPD algorithms—namely, SIATEC, SIATEC_CS, and CSA—on the BPS‑Motif dataset using one Intel i9‑3900KF CPU. It takes more than 4,000 min (around three days) to run SIATEC_CS on BPS‑Motif a single time. CSA and SIATEC are 4.2x and 12x faster than SIATEC_CS. Though SIATEC_CS was reported as the state‑of‑the‑art on JKU‑PDD, we have to consider a more effective solution such as SIATEC to discuss BPS‑Motif, considering the need to test eight different MNID settings (as shown in Table 4). Thus, we choose to use SIATEC instead of SIATEC_CS for the experiments on BPS‑Motif. It is also worth noting that, in Table 4, we also indicate that, under the oracle setting (i.e., ideal MNID), SIATEC performs on par with SIATEC_CS in establishment measures.

Table 3

The total RPD runtime on the BPS‑Motif dataset in minutes using one Intel i9‑13900KF CPU.

RPD	Runtime (min)
SIATEC	340.9
SIATEC_CS	4076.9
CSA	978.3

Table 4

Motif discovery results on the BPS‑Motif dataset. PL denotes pseudo‑labeling; MI denotes intersection of the melody line and pseudo‑labels. The subscripts est, occ, and thr indicate the establishment, occurrence, and three‑layer measurements, respectively. The last three rows are the oracle setting that assumes an 100% accuracy of motif note identification.

MNID	Setting	RPD	P $_{est}$	R $_{est}$	F $_{est}$	P $_{occ}$	R $_{occ}$	F $_{occ}$	P $_{thr}$	R $_{thr}$	F $_{thr}$
✗	N/A	SIATEC	0.180	0.644	0.280	0.210	0.277	0.224	0.041	0.300	0.071
Skyline	N/A		0.226	0.660	0.333	0.410	0.275	0.308	0.065	0.335	0.107
CNN	✗		0.240	0.631	0.344	0.403	0.221	0.268	0.073	0.319	0.116
	PL		0.236	0.609	0.336	0.399	0.240	0.281	0.072	0.307	0.112
	MI		0.249	0.643	0.354	0.445	0.251	0.307	0.076	0.331	0.121
MidiBERT	✗		0.257	0.653	0.364	0.446	0.263	0.319	0.081	0.337	0.126
	PL		0.258	0.654	0.366	0.440	0.266	0.320	0.082	0.341	0.127
	MI		0.258	0.656	0.366	0.433	0.260	0.309	0.083	0.342	0.130
✗	N/A	CSA	0.553	0.847	0.663	0.130	0.570	0.204	0.127	0.263	0.169
Skyline	N/A		0.555	0.806	0.652	0.312	0.464	0.354	0.214	0.357	0.264
CNN	✗		0.521	0.761	0.610	0.330	0.410	0.349	0.202	0.334	0.246
	PL		0.516	0.718	0.593	0.327	0.388	0.343	0.198	0.320	0.238
	MI		0.540	0.748	0.619	0.327	0.387	0.340	0.211	0.340	0.255
MidiBERT	✗		0.576	0.811	0.668	0.332	0.439	0.362	0.224	0.363	0.272
	PL		0.580	0.818	0.673	0.336	0.436	0.363	0.228	0.359	0.275
	MI		0.591	0.803	0.676	0.350	0.430	0.372	0.231	0.358	0.276
Oracle	N/A	SIATEC	0.288	0.732	0.410	0.536	0.361	0.418	0.105	0.443	0.165
	N/A	SIATEC_CS	0.327	0.583	0.413	0.491	0.312	0.356	0.185	0.322	0.231
	N/A	CSA	0.734	0.866	0.789	0.443	0.540	0.468	0.353	0.480	0.400

5.3 Motif discovery on BPS‑Motif

Table 4 shows the motif discovery results on the BPS‑Motif dataset. First, comparing the methods with and without MNID, we observe that, no matter which MNID model is used, the motif discovery performances are substantially improved over the baseline without MNID. Take our best MNID model (MidiBERT with MI) as an example. For SIATEC, the F1‑scores are improved by 8.6 percentage points for the establishment measure (0.280 $\to$ 0.366), 8.5 percentage points for the occurrence measure (0.224 $\to$ 0.309), and 5.9 percentage points for the three‑layer measure (0.071 $\to$ 0.130). For CSA, the F1‑score improvements are 1.3 percentage points (0.663 $\to$ 0.676), 16.8 percentage points (0.204 $\to$ 0.372), and 10.7 percentage points (0.169 $\to$ 0.276) for the three metrics, respectively. These F1‑score improvements come from the substantial improvement in precision measures, where $P_{occ}$ and $P_{thr}$ are improved by 22.0 and 10.4 percentage points, respectively. Even our worst MNID model, CNN with PL, shows substantial improvements in occurrence and three‑layer F1‑scores with only a slight reduction in the establishment F1‑score for CSA (but it still improves the establishment F1‑score for SIATEC). To summarize, the improvements achieved with MNID mostly lie in the occurrence and three‑layer measures, while the improvement of the establishment F1‑score is relatively modest. A case wherein MNID benefits the occurrence and three‑layer scores but not the establishment scores will be discussed in the case study (Section 5.6).

Then, by comparing the performances of CNN and MidiBERT, we found that using MidiBERT as the MNID model achieves better F1‑scores in all metrics, no matter which RPD algorithm is used. This is expected, as MidiBERT performs better than CNN in the MNID task. In general, the use of MI slightly improves the F1‑scores, but there are few exceptions where the use of MI slightly degrades the occurrence F1‑score. For example, in comparing MidiBERT with MI with MidiBERT, when SIATEC is used as the RPD algorithm, the occurrence F1‑score degrades by around one percentage point when MI is used. However, when CSA is used as the RPD algorithm, MidiBERT with MI consistently outperforms MidiBERT in all F1‑scores. From the above discussion, we summarize that, by removing notes that are unlikely to be a part of a motif occurrence, clear improvements in precisions and F1‑scores can be achieved.

The bottom rows of Table 4 present oracle cases where the MNID is assumed to be perfect. Using CSA, the resulting F1‑scores are 0.789, 0.468, and 0.400 for the three measures, substantially outperforming our best proposed system (0.676, 0.372, and 0.276). This indicates that there is still room for improvement, and, by improving the MNID performances, the overall motif discovery results could be further improved.

5.4 Motif discovery results on JKU‑PDD

Table 5 shows the evaluation results on JKU‑PDD. Similarly, using an MNID model also improves the overall motif discovery performance (and also improves more on the occurrence and the three‑layer metrics), regardless of which algorithm is used for RPD. Furthermore, by comparing performances of different MNID models, we found that, while MidiBERT generally leads to better motif discovery performance than CNN when CSA is used for RPD, CNN prevails when SIATEC_CS is used for RPD. This trend is particularly clear in terms of occurrence F1‑score: CNN outperforms MidiBERT by 22.5 percentage points when both models are paired with SIATEC_CS (0.501 vs. 0.276); however, MidiBERT outperforms CNN by 7.4 percentage points when both models are paired with CSA (0.414 vs. 0.340). These differences mostly come from the significant performance differences in one of the five compositions; since there are only a few annotated patterns and occurrences in each of the compositions, recalling one occurrence (or not) would lead to a substantial change in the occurrence F1‑score. A similar trend can be found between MidiBERT with PL and MidiBERT with MI, where the use of different RPD algorithms changes the ranking of the occurrence F1‑score of the two models.

Table 5

Motif discovery results on JKU‑PDD. Note that, in this experiment, we use the annotations in the monophonic version as ground‑truth, as they resemble motif annotations better.

MNID	Setting	RPD	P $_{est}$	R $_{est}$	F $_{est}$	P $_{occ}$	R $_{occ}$	F $_{occ}$	P $_{thr}$	R $_{thr}$	F $_{thr}$
✗	N/A	SIATEC_CS	0.210	0.323	0.251	0.000	0.000	0.000	0.220	0.277	0.243
Skyline	N/A		0.415	0.663	0.492	0.380	0.631	0.472	0.403	0.534	0.439
CNN	✗		0.446	0.626	0.511	0.410	0.653	0.501	0.420	0.548	0.460
	PL		0.418	0.670	0.492	0.395	0.665	0.494	0.387	0.536	0.426
	MI		0.433	0.707	0.520	0.502	0.762	0.601	0.403	0.577	0.454
MidiBERT	✗		0.359	0.589	0.431	0.239	0.327	0.276	0.360	0.487	0.402
	PL		0.398	0.625	0.469	0.290	0.467	0.357	0.383	0.492	0.412
	MI		0.387	0.668	0.468	0.396	0.606	0.478	0.372	0.513	0.417
✗	N/A	CSA	0.155	0.404	0.214	0.140	0.355	0.180	0.062	0.269	0.095
Skyline	N/A		0.281	0.567	0.350	0.371	0.489	0.407	0.235	0.527	0.295
CNN	✗		0.261	0.509	0.324	0.295	0.414	0.340	0.222	0.505	0.277
	PL		0.251	0.476	0.306	0.273	0.392	0.310	0.222	0.462	0.271
	MI		0.261	0.504	0.324	0.256	0.388	0.298	0.232	0.511	0.288
MidiBERT	✗		0.245	0.568	0.321	0.383	0.490	0.414	0.200	0.535	0.263
	PL		0.267	0.576	0.340	0.398	0.477	0.418	0.220	0.539	0.277
	MI		0.253	0.496	0.315	0.249	0.412	0.309	0.212	0.480	0.263
Oracle	N/A	SIATEC	0.209	0.508	0.280	0.715	0.684	0.699	0.204	0.527	0.270
	N/A	SIATEC_CS	0.589	0.661	0.620	0.553	0.680	0.606	0.621	0.656	0.637
	N/A	CSA	0.263	0.407	0.310	0.300	0.419	0.337	0.245	0.434	0.295

On the other hand, the results on the BPS‑Motif datasets are relatively stable, where a better MNID model generally leads to better motif discovery performances. From these observations, we need to emphasize the importance of building a larger dataset for evaluating motif discovery performances, which allows for a more stable and comprehensive evaluation. The BPS‑Motif dataset marks a step in this direction, but it sacrifices the diversity of the compositions, as all the music pieces in BPS‑Motif are created by a single composer (Beethoven). A more diverse and larger dataset would be welcomed so that motif discovery systems can be evaluated more comprehensively.

5.5 Discussions on RPD algorithms

The above experiments shows that the RPD algorithms behave differently in the two datasets: CSA outperforms both SIATEC and SIATEC_CS on BPS‑Motif (Table 4), while SIATEC_CS outperforms CSA on JKU‑PDD (Table 5). This phenomenon can be explained by the cross‑dataset bias mentioned in Section 4.1: SIATEC_CS was tested on JKU‑PDD and CSA was tested on BPS‑Motif when they were proposed, respectively. However, it would be more insightful to further examine how the matching mechanism of the algorithm itself affects the performance of motif discovery. While there are multiple factors affecting the behaviors, we focus on the number of distinctive motifs produced by each RPD algorithm in the following discussion.

Table 6 compares the number of distinctive motifs produced by SIATEC, SIATEC_CS, and CSA in the oracle JKU‑PDD and BPS‑Motif datasets. In BPS‑Motif, while the average number of motifs is 8.2, SIATEC produces over $10^{4}$ motifs, most of which are obviously false and redundant outputs. A closer inspection of SIATEC’s output reveals that many of these false outputs consist of notes that are far apart in time (e.g., a pattern with only five notes but spanning more than 20 bars in duration). This issue has also been noted in previous work (Collins, 2011). The compactness conditions of CSA effectively suppress such false outputs. On BPS‑Motif, the average duration of a motif occurrence is only 4.73 crotchet beats for CSA but more than 170 crotchet beats for both SIATEC and SIATEC_CS. In JKU‑PDD, since the duration and total number of notes in a musical piece are shorter than in BPS‑Motif, this issue becomes less severe. This explains why SIATEC_CS achieves a relatively better performance than CSA on JKU‑PDD but falls behind CSA on BPS‑Motif.

Table 6

The average number of distinct motifs (per song) of the ground‑truth and RPD algorithms’ predictions on the BPS‑Motif dataset and the JKU‑PDD. The Oracle MNID is employed.

RPD	BPS‑Motif	JKU‑PDD
SIATEC	13365.5	2799.8
SIATEC_CS	34.9	15.6
CSA	27.3	43.2
Ground‑truth	8.2	4.6

However, the precision values in Table 4 and 5 are not affected that much by such false outputs exceeding the ground truth by several orders. This is because, in motif discovery evaluation, the metrics determine true positives based on the maximum similarity (see Equations [10]–[26] for the definitions of the metrics) or above‑threshold similarity (see Equations [15] and [16] for the occurrence measure) between two patterns, and there is no restriction on how many times each ground‑truth pattern can be matched as a true positive. As a result, redundant patterns may be repeatedly counted as true positives, and many false detections may be simply ignored and not be counted as false positives. This highlights a limitation of the current evaluation metrics in understanding the complex, multi‑factor behaviors of the motif discovery task, and we leave further investigation of this issue as future work.

5.6 Case study

To further present the effectiveness of the proposed motif discovery method, we present two case studies on the BPS‑Motif dataset. Together with the original score, the second to fourth rows of Figure 4 compare 1) the ground truth, 2) the result using CSA as the RPD method, and 3) the result using CSA with MNID (particularly MidiBERT with MI) for the two examples. Figure 4 presents a passage featuring rhythmic accelerations and variations (denoted as d), followed by a tonic sequence not obeying a strict sequence regarding intervallic distances (denoted as k). The last row of Figure 4 shows that the MNID model effectively contrasts the arpeggiated accompaniment with the melody, allowing the RPD algorithm to better capture the sequencing motifs k in mm. 14–18 and the very challenging variations (rhythmic accelerations) d in mm. 8–14; these correspond to the two established motifs, labeled as green B and purple A, respectively. On the other hand, when the MNID model is not applied, as shown in the third row of Figure 4, the RPD fails to isolate the melodic elements, resulting in a number of extraneous fragments in the result. Also, the dark red E (should correspond to d in the ground‑truth) occurs in both the melody and accompaniment parts, which implies that the accompaniment notes interfere with the RPD process and render the established motifs less accurate than the case using MNID. Here, our model demonstrates the benefits of integrating the MNID model with RPD analysis to navigate complex classical compositions.

Illustration of motif discovery results for Beethoven’s *Piano Sonata No. 10 in G Major*, 1st movement, mm. 8–18. From top to bottom: original score, piano roll with ground truth motif annotation, piano roll with the results using the common structure algorithm, and piano roll with results using MNID (MidiBERT with MI) plus the common structure algorithm. For the piano roll representation, a note is represented as an onset (circle dot) and a duration (horizontal line); the bar line is represented as the vertical blue dotted line. Notes with gray color are non‑motif notes. Motif notes are in the colors other than gray, and the notes belonging to the same motif are represented as the same color. The note group of each occurrence is bounded by a gray translucent box. The motif name is marked in the beginning of each occurrence. For the ground truth, the motif names are lowercase alphabets following the dataset annotation. For the motif discovery results, the motif names are uppercase and assigned in alphabetical order according to the first occurrence time.

Figure 4 also serves as an example showing how MNID improves more in response to the occurrence and three‑layer measures than to the establishment measure. Following the definition of the establishment measure, to evaluate whether a pattern is established, the established similarity between two motifs is the maximum similarity between any two occurrences from the two motifs. Therefore, in the result of CSA without MNID, as long as one of the occurrences of E matches one of the occurrences of the ground‑truth d, d is successfully established, despite the numerous false‑positive occurrences of E in the motif discovery result. On the other hand, these false positives are counted in occurrence and three‑layer measures and therefore lead to low precision and F1‑score values.

Then, Figure 5 demonstrates a more challenging case wherein the left hand predominantly carries the motifs and the motif notes cannot be extracted from the skyline algorithm, as their pitch values are not the highest among all concurrently played notes. Comparing the third and fourth rows in Figure 5, we again observe that the data‑driven MNID method plays the key role in deleting the large numbers of irrelevant patterns while retaining the motif notes in the left hand part, and thereby simplify the RPD results. Our model effectively distinguishes between the chordal accompaniment and dynamically evolving motifs, especially motif d, characterized by its syncopation and variable interval patterns. However, in this challenging case, the MNID model still fails to identify the motif e, which contains rapid notes and inexact repetition. Besides, the motif d is also identified as several different established motifs (A to E) in the result. To summarize, we qualitatively demonstrate that the pseudo‑labeling with melody intersection provides some form of regularization to the model, by making the motif notes in the pseudo‑label present only in the melody notes.

Illustration of motif discovery results for Beethoven’s *Piano Sonata No. 16 in G Major*, 1st movement, mm. 74–81. See the caption of Figure 4 for the notation in detail.

6 Conclusions

In this paper, we propose a two‑stage framework consisting of data‑driven motif note identification and rule‑based RPD techniques to better solve the motif‑discovery task. By leveraging the power of the pre‑trained MidiBERT and pseudo‑labeling, a benchmark motif note identification accuracy of 0.839 on the BPS‑Motif dataset using a limited amount of training data is realized. As a result, the motif note identification model substantially improved the performance of repeated pattern discovery on two existing datasets by effectively removing the musically unimportant notes in the input data.

A major limitation of this work is that, even if we remove the non‑motif notes, we still cannot fully prevent the redundant and highly overlapped pattern detections within the motif note set in the repeated pattern discovery stage, as can be seen in Figure 5. To solve this issue, we propose two possible directions as our future work: 1) devise a data‑driven post‑processing model to assess musical importance at pattern level and 2) perform fine‑grained music structure analysis down to the motif level. Both directions, however, need to rely on the advance of pre‑trained models and the extension of fine‑grained data annotation for symbolic music analysis. Specifically, an advanced pre‑trained model can better serve as the backbone model for various downstream tasks such as MNID with few training data, which is suitable for motif discovery and could play an important role in future work.

Acknowledgments

This work is supported in part by National Science and Technology Council under Grant NSTC 113‑2221‑E‑001‑013, the Academia Sinica Grand Challenge (GCS) Program under Grant AS–GCS–112–M07, and the Postdoctoral Scholar Program of Academia Sinica under Grant AS‑PD‑1141‑M15‑2. The first and the second authors were affiliated with the Institute of Information Science, Academia Sinica, Taiwan, during the period of this work. The authors would like to thank Tsung‑Ping Chen for providing the source code and the hyperparameter settings of CSA.

Competing Interests

The authors have no competing interests to declare.

Notes

[1] Motif discovery, repeated pattern discovery, and the MIREX task called ‘Discovery of Repeated Themes and Sections’ are highly relevant to each other and sometimes considered equivalent in the literature. In this paper, motif theme discovery refers to the task when the musical importance of the discovered patterns is emphasized, while repeated pattern discovery refers to a more general scenario in which musical importance is partly or not considered.

[2] As defined in https://doi.org/10.1093/gmo/9781561592630.article.27789, theme had a certain completeness, a roundedness, which distinguishes it from the shorter and more elemental motif.

[3] An example is the cadenza part, whose melody line is typically not a motif, as these notes are usually not repeated within the musical piece.

[4] Note that this assumption is not always correct, as there is no strict definition stating that all motif notes should also be melody notes. However, considering that motif is musically important, it is reasonable to assume that most motif notes are present in the melody line which ‘carries the most significant meaning’ (Simonetta et al., 2019).

[5] In this work, we used the implementation provided in https://github.com/wsgan001/repeated_pattern_discovery/.

[6] Available at https://github.com/Wiilly07/Beethoven_motif.

[7] The original URL of this dataset is not available anymore, but an exact copy is available at https://github.com/ns2max/jkupdd_dataset.

[8] It should be noted that previous MIREX experiments were performed on either ‘monophonic input data and monophonic output labels’ or ‘polyphonic input data and polyphonic labels’ (Collins, 2013). This is different from our goal of motif discovery on polyphonic data.

[9] Using monophonic labels still cannot fit our scenario when there are multiple voices (i.e., polyphony texture). However, we have to do so, as this is the most practical way to make use of the annotation provided in JKU‑PDD.

[10] The dataset is available at https://limunimi.github.io/Symbolic-Melody-Identification/.

[11] We did not consider the case only using the results of MeloID as pseudo‑labels. This is because our pilot study showed that even using melody ground‑truth as pseudo‑labels resulted in no improvement if there are no MNID‑derived pseudo‑labels.

[12] https://github.com/wazenmai/MIDI-BERT.

[13] We also found that the 95% confidence interval of the accuracy and F1‑score are all within 0.6% for both settings, showing that randomness is not a key factor here.

[14] This does not mean Skyline can achieve a 100% accuracy, as not all the notes in Skyline are annotated as motif notes.

Additional Files

The additional files for this article can be found as follows:

Supplementary Appendix A

Details of the SIATEC algorithm. DOI: https://doi.org/10.5334/tismir.250.s1.

Supplementary Appendix B

Details of the CSA. DOI: https://doi.org/10.5334/tismir.250.s2.