A Lightweight Two‑Branch Architecture for Multi‑Instrument Transcription via Note‑Level Contrastive Clustering

Ruigang Li; Yongxu Zhu

doi:10.5334/tismir.300

1 Introduction

Automatic music transcription (AMT), which converts audio signals into symbolic musical notation, represents a fundamental challenge in music information retrieval (MIR) (Benetos et al., 2019). While current systems achieve remarkable accuracy in transcribing polyphonic music for specific instrument timbres (Hawthorne et al., 2018; Riley et al., 2024; Tamer et al., 2023a), we investigate a more practical scenario: processing audio mixtures containing $K$ instrument classes and outputting $K$ tracks of musical notes, each corresponding to a distinct timbre. We term this task ‘timbre‑separated transcription,’ which differs from conventional source separation by operating at the symbolic level to extract note representations rather than reconstructing audio waveforms. In this paper, we study the musical sounds produced by instruments. For study purposes, the term ‘timbre’ refers specifically to instrument categories (e.g., violin, viola, and flute), distinguished from individual physical sources.

Current solutions often use unified classification‑based models that treat timbre identification as a categorization task (Gardner et al., 2022; Wu et al., 2020), which limits their flexibility. They demand extensive training data, fix the maximum number of separable sources, and fail to generalize to unseen timbres, essentially functioning as a ‘timbre dictionary.’ Moreover, these models are typically large and computationally demanding, making them inaccessible to general users despite advances in transcription accuracy.

To address these issues, we propose a lightweight two‑branch architecture that decouples pitch/onset estimation from timbre representation learning and employs deep clustering (Hershey et al., 2016) to achieve timbre‑separated transcription. Our key advance over prior deep clustering transcription (Tanaka et al., 2020) is note‑level clustering (vs. frame‑level), tailored for symbolic output. The first branch performs timbre‑agnostic transcription, predicting frame‑level activations with a compact, fully convolutional network. The second branch learns direction‑aware timbre embeddings, which are clustered at the note level to enable dynamic instrument separation. Furthermore, we discuss the impact of the dataset and possible improvements. All code is open‑sourced, with the model already deployed in a web‑based assistive transcription tool^¹.

2 Background and Related Work

2.1 AMT fundamentals

The majority of existing AMT approaches employ deep neural networks to transform audio inputs into piano roll– like representations, characterized as two‑dimensional matrices indexed by time frames and note pitches. The process generally comprises: 1) time–frequency feature extraction, 2) frame‑level probability estimation with neural networks, and 3) note generation through binarization or specialized output networks.

Stage 1 is typically implemented with classical signal‑processing transforms. The short‑time Fourier transform (STFT) yields linear‑spaced frequency bins, whereas mel‑scale transform and constant‑Q transform (CQT) provide logarithmic spacing that better matches musical pitch perception. However, these interpretable transforms may not always be optimal, leading to explorations of learnable encodings (Luo and Mesgarani, 2018; Ravanelli and Bengio, 2018; Zeghidour et al., 2021) and hybrid methods (Rouard et al., 2023; Su and Yang, 2015). The CQT is readily enriched into a harmonic CQT (HCQT) (Bittner et al., 2017) by vertically shifting the spectrogram according to the bin offsets of successive harmonics and concatenating the shifted copies along an additional axis (Balhar and Hajič, 2019). This prior‑driven expansion allows small convolutional kernels to focus efficiently on musically relevant frequency components.

Stage 2 produces a two‑dimensional posteriogram that encodes the probability of pitch activation at each time–frequency bin. Convolutional layers leverage the spectro‑temporal grid to capture local structures (Bittner et al., 2022; Wu et al., 2019), while temporal‑sequence models extend the receptive field along time, modeling long‑range dependencies (Hawthorne et al., 2018). To translate frame‑wise posteriors into discrete notes, in stage 3, most systems rely on onset detection (i.e., the initial frame of each note) and then link successive active frames within an onset‑defined segment to form complete note events (Bittner et al., 2022; Hawthorne et al., 2018; Wu et al., 2020, 2024).

2.2 Timbre‑Separated transcription

Multi‑instrument timbre‑separated transcription goes beyond traditional AMT by extracting instrument‑specific notes from polyphonic mixtures. Tamer et al. (2023b) demonstrated that simply cascading source separation followed by transcription leads to suboptimal results due to error propagation. Therefore, it is necessary to combine transcription and separation in a mutually reinforcing manner.

Innovative methods have emerged for this challenge. Gardner et al. (2022) introduced a sequence‑to‑sequence AMT model that outputs multiple instrument tracks with explicit instrument assignments, achieving high accuracy but relying heavily on diverse multi‑timbre data. Multitask learning frameworks like Timbre‑Trap (Cwitkowitz et al., 2024) and Cerberus (Manilow et al., 2020) show that joint optimization can lead to synergistic effects. Wu et al. (2020) proposed a self‑attention–based instance segmentation approach that performs joint note detection and instrument classification, achieving timbre‑separated transcription for a closed set of trained instruments.

For handling unseen timbres, researchers have explored different strategies. The zero‑shot learning method proposed by Lin et al. (2021) employs a query‑based mechanism: under contrastive supervision, it learns to encode a clean reference example into a timbre embedding that generalizes to unseen instruments. This embedding is then used to sequentially modulate the U‑Net layers, enabling zero‑shot separation and transcription of audio sources with novel timbres not encountered during training. The work by Tanaka et al. (2020) pioneered the application of deep clustering networks to timbre‑separated transcription. Their approach develops a generalizable timbre encoder and enables separation with a non‑fixed number of classes through clustering, showing how these tasks can complement each other when properly combined.

2.3 Deep clustering methodology

Deep clustering serves as a fundamental approach addressing two core challenges in source separation: the permutation problem and speaker‑independent separation. The methodology operates by encoding each time‑frequency bin into a high‑dimensional feature, then assigning labels through clustering algorithms to generate separation masks.

The pioneering work of Hershey et al. (2016) first applied deep clustering to separate unknown speakers. This was later extended by Luo et al. (2018), who introduced learnable attractors as speaker‑specific reference points in an embedding space that cluster time–frequency bins, enabling end‑to‑end separation of a variable number of sources. The adaptation to music transcription was first achieved by Tanaka et al. (2020), but their method uses transcription mainly to aid audio separation, lacks dedicated note‑creation postprocessing, and omits key implementation details. More fundamentally, these studies share two critical limitations: clustering at the frame level produces fragmented results, and the hard‑assignment strategy cannot handle overlapping situations. In contrast, our clustering postprocess is tailored for music transcription, aggregating frame embeddings into coherent note events with minimal computational overhead.

3 Proposed Method

3.1 Problem configuration

Let $X = \sum_{i = 1}^{K} x_{i} \in R^{l}$ be the mixed audio signal of $K$ instrument classes (timbres), where $l = 22.05 [kHz] \times Time [sec]$ is the signal length and $x_{i}$ is the single timbre audio corresponding to the frame activation $y_{F i} \in {[0, 1]}^{N \times T}$ and onset activation $y_{O i} \in {[0, 1]}^{N \times T}$ . Since we are aiming for the pitch range $C_{1} \sim B_{7}$ , we set $N = 7 \times 12 = 84$ . The goal of timbre‑agnostic transcription is to map $X$ to $\sum_{i = 1}^{K} y_{F i}$ and $\sum_{i = 1}^{K} y_{O i}$ , while the goal of timbre‑separated transcription is to obtain ${y_{F i}}_{i = 1}^{K}$ . First, we apply the CQT to obtain $Q \in C^{F \times T}$ , where $T$ is the number of frames and $F$ is the number of frequency points analyzed. We extended the CQT analysis to eight octaves with three bins per semitone, ensuring each note has at least two harmonics and yielding $F = 8 \times 12 \times 3 = 288$ bins. Our proposed model outputs the timbre‑agnostic transcription results $Y_{F} = \sum_{i = 1}^{K} {\hat{y}}_{F i}$ and $Y_{O} = \sum_{i = 1}^{K} {\hat{y}}_{O i}$ and the timbre‑sensitive representation $V \in R^{D \times N \times T}$ . Finally, $V$ is clustered to assign labels to $Y_{F}$ , separating ${{\hat{y}}_{F 1}, {\hat{y}}_{F 2}, \dots, {\hat{y}}_{F K}}$ from $Y_{F}$ .

3.2 Overall architecture

As shown in Figure 1, our timbre‑separated transcription model comprises two parallel branches: timbre‑agnostic transcription and timbre encoding, both taking the HCQT as input. We also design a postprocessing pipeline tailored to this task.

Overview of the proposed method. **Top:** Overall pipeline, which takes a multi‑timbral mixture audio as input and outputs note events for each constituent timbre. **Bottom left:** AMT branch producing timbre‑agnostic transcription outputs—frame activation posteriorgram $Y_{F}$ and onset activation posteriorgram $Y_{O}$ . **Bottom right:** Timbre‑encoding branch yielding a $D$ ‑dimensional timbre embedding $V$ for each time–frequency bin, where $N = 84$ denotes the target pitch range.

The timbre‑agnostic transcription branch builds upon BasicPitch (Bittner et al., 2022), a lightweight fully convolutional network that demonstrates strong performance using only local contextual information. We identify several limitations and introduce targeted optimizations, yielding a more compact architecture with comparable accuracy.

The timbre‑encoding branch adopts a similar fully convolutional design. To incorporate global context, we employ InstanceNorm and GroupNorm for adaptive normalization across the entire input. While convolutional architectures are more deployment‑friendly on low‑resource devices than recurrent neural network (RNN)‑based approaches commonly used in deep clustering (Hershey et al., 2016; Tanaka et al., 2020), they may lack long‑range modeling capacity; alternative designs are discussed in the experiments.

3.3 EnergyNorm for spectral normalization

Conventional two‑dimensional normalization schemes often involve subtractive operations and compute statistics directly over the time–frequency plane. These practices alter the relative amplitude relationships among harmonic components within the same analysis frame, thereby undermining the acoustic basis of timbre perception.

We, therefore, introduce an interpretable normalization approach based on frame‑wise energy. Leveraging Parseval’s theorem and assuming a Gaussian time‑domain signal, we standardize the sample variance of frame energy to unity, yielding a chi‑squared–distributed amplitude envelope. Given a CQT spectrum $Q \in C^{F \times T}$ , the process is described as:

1

\begin{array}{ll} E_{t} = \sum_{f = 1}^{F} | q_{f, t} |^{2} & \in R^{T}, \end{array}

2

\begin{array}{ll} σ = \sqrt{V a r (E_{t})} & \in R, \end{array}

3

\begin{array}{ll} E_{norm} = \frac{| Q |^{2}}{σ} & \in R^{F \times T} . \end{array}

This approach reduces the normalization complexity from two dimensions to one, significantly lowering computational costs. Furthermore, to address the limitations of strictly nonnegative features in neural network learning, we apply a logarithmic transformation followed by an affine transformation to enhance representational capacity. The final normalized representation is defined as:

4

\tilde{Q} = k \cdot (\log | Q |^{2} - \log σ) + b,

where $k$ and $b$ are learnable scalars.

3.4 Dilated convolution for harmonic context

In BasicPitch, a large convolutional kernel spanning 39 frequency bins (one octave plus one note) is explicitly mentioned as effective for mitigating octave errors. However, the reference implementation applies padding to both ends, effectively limiting the receptive field to half an octave above and below, rather than the intended full range above or below. Given the explicit spatial correspondence between input and output representations, this limitation prevents the model from directly attending to information at harmonic intervals. To address this issue, we adopted a more robust strategy by extending the receptive field to cover two octaves in both directions. We implemented this using a dilated convolution with a frequency‑axis kernel size of 25 and a dilation factor of 3. This configuration ensures comprehensive harmonic coverage while simultaneously reducing the parameter count.

3.5 Focal loss for class imbalance

BasicPitch employs a class‑balanced cross‑entropy loss to address the sparsity of onsets, with negative and positive class weights set at 0.05 and 0.95, respectively. However, our experiments revealed that this weighting scheme exacerbates class imbalance. Figure 2 presents our controlled experimental results, demonstrating that assigning a larger weight to the positive class of onsets leads to severe false positives, indicating that sparse classes should be assigned smaller weights.

Training outcomes using Focal Loss with varying positive class weights. Each column represents a training session from initialization to convergence.

We directly used Focal Loss (Lin et al., 2017) in Eqn. (5) with $γ = 1$ , setting the positive class weights for notes and onsets at 0.2 and 0.06, respectively.

5

\begin{matrix} \begin{matrix} FL (p_{t}) = - α_{t} (1 - p_{t})^{γ} \log (p_{t}), \\ p_{t} = {\begin{cases} p & if y = 1 \\ 1 - p & if y = 0, \end{cases} α_{t} = {\begin{cases} α & if y = 1 \\ 1 - α & if y = 0. \end{cases} \end{matrix} \end{matrix}

3.6 InfoNCE for contrastive cluster formation

Hershey et al. (2016) and Tanaka et al. (2020) use the following deep clustering objective:

6

\begin{matrix} L_{affinity} & = & {‖ {U U}^{T} - {Z Z}^{T} ‖}_{F}^{2} \\ = & {‖ U^{T} U ‖}_{F}^{2} - 2 {‖ U^{T} Z ‖}_{F}^{2} + {‖ Z^{T} Z ‖}_{F}^{2}, \end{matrix}

where $Z \in R^{M \times K}$ denotes the one‑hot class labels for active bins ( $K$ : number of classes, $M$ : number of bins with ground‑truth activation) and $U^{T} \in R^{D \times M}$ is the timbre encodings at the corresponding locations extracted from $V \in R^{D \times N \times T}$ . The $m$ ‑th column $u_{m} \in R^{D}$ of $U^{T}$ represents the timbre encoding of the $m$ ‑th active time–frequency bin. Defining $\hat{Z} = Z Z^{T} \in R^{M \times M}$ yields:

7

{\hat{z}}_{i, j} = {\begin{matrix} 1 & if u_{i} and u_{j} should share class \\ 0 & otherwise . \end{matrix}

This loss function benefits from low spatial complexity and high computational efficiency. However, it is at least three orders of magnitude larger than the AMT loss (frame and onset transcription with Focal Loss) and converges slowly, which makes joint optimization more difficult when a backbone is shared. Furthermore, this objective enforces orthogonality between encodings of different classes. We argue that strict orthogonality is an overly stringent constraint; for clustering objectives, antiparallel alignment yields larger interclass distances than orthogonality. Consequently, we adopt the contrastive InfoNCE loss (Oord et al., 2018) to facilitate cluster formation:

8

L_{InfoNCE} = - \sum_{n = 1}^{M} \log \frac{\sum_{\begin{matrix} m = 1, m \neq n \\ y_{n} = y_{m} \end{matrix}}^{M} \exp (\frac{u_{n}^{⊤} u_{m}}{τ})}{\sum_{\begin{matrix} m = 1, m \neq n \end{matrix}}^{M} \exp (\frac{u_{n}^{⊤} u_{m}}{τ})},

where $y_{m}$ denotes the ground‑truth class label of $u_{m}$ . Empirically setting the temperature parameter $τ = 0.15$ aligns the magnitude of the clustering loss with the AMT loss and synchronizes their convergence rates. Crucially, in scenarios involving a single timbre, the InfoNCE loss naturally vanishes. In contrast, the MSE‑based loss continues to force feature vectors toward specific directions even in the absence of contrasting classes, causing the mean output to deviate significantly from the origin and resulting in potential instability.

3.7 Postprocessing for timbre‑separated transcription

Existing deep clustering–based source‑separation methods are predominantly designed for spectrogram reconstruction. In the field of music transcription, however, there is a lack of discussion on methods for creating specific notes from clustering results.

Tanaka et al. (2020) proposed a frame‑level transcription method that predicts a binary mask by assigning an additional cluster to ‘silence bins’ and performing K‑means clustering over the entire time–frequency space. However, this approach suffers from several critical limitations:

The absence of onset information hinders accurate note creation from masks.
Frame‑level separation is prone to fragmenting notes into scattered pieces.
The large number of bins makes clustering algorithms very slow.
Most clustering algorithms require specifying the number of clusters, and the hard‑assignment strategy cannot handle overlapping situations, where instruments play the same pitch simultaneously.

Benefiting from the dual‑branch architecture, the separation process can be conducted at the note level. We first obtain notes using BasicPitch’s note decoding method, then weight and sum the encodings of each frame in the note to get its timbre encoding, which is subsequently clustered via spectral clustering. This method aggregates time–frequency bins of the same note, greatly reducing the number of samples for clustering and mitigating note fragmentation. For clustering, we construct an affinity matrix by exponentiating cosine similarities between note embeddings, matching the formulation in our clustering loss (Eqn. 6). Figure 3 illustrates the results of frame‑level and note‑level transcription, with the latter nearly perfectly reconstructing compared to the former’s fragmented notes.

Results of frame‑level and note‑level postprocessing for triple separation.

The last issue remains unresolved, as our attempted iterative matching‑filter approach suffers from category merging and is limited to frame‑level processing; thus, we do not discuss it further.

4 Evaluation

4.1 Experimental setup

4.1.1 Datasets

Table 1 summarizes the datasets used in our experiments. We train primarily on MusicNet (Thickstun et al., 2017) but replace its DTW‑aligned annotations, which are known to be inaccurate, with the refined labels from MusicNetEM (Maman and Bermano, 2022). All audio is resampled to 22,050 Hz and split into non‑overlapping 900‑frame (i.e., 10.4‑s) clips (frame size: 256 samples, matching the CQT hop length). Tracks sharing the same instrument are mixed to form a single timbral class per piece.

Table 1

Datasets used in experiments. ‘Our:AMT’ and ‘Our:Sep’ are our synthetic datasets created to examine whether human composition and real recordings are indispensable. The latter’s timbres are grouped into 10 classes; when training, cross‑class mixtures yield single‑sample multi‑timbre pieces. Bold datasets are used for training.

Dataset	Dur.	Songs	Instr.	K/Song
MusicNet	34 h	330	11	1–8
BACH10	334 s	10	4	4
PHENICX	637 s	4	10	8–10
URMP	1.3 h	44	14	2–4
Our:AMT	24 h	8316	33	1
Our:Sep	836 s	120	34	1

For evaluation, we use three real‑world polyphonic datasets with solo tracks: BACH10 (Duan et al., 2010) (fixed quartet of violin, bassoon, clarinet, and saxophone), URMP (Li et al., 2019), and PHENICX (Miron et al., 2016), with the latter featuring complex orchestral mixtures with an average of 9.5 instrument classes per piece.

Motivated by Slakh2100 (Manilow et al., 2019), a large‑scale, fully synthesized dataset, we question the necessity of human‑composed, real‑recorded data. We, therefore, generate synthetic audio via a simple algorithm that randomly places notes across our 84‑note pitch range and adds harmonically overlapping chord tones with probability. MIDI sequences are rendered with FluidSynth, applying randomization in dynamics, pitch tuning, and articulation to improve realism. Each sample uses a single timbre; an example appears in Figure 4.

Randomly generated piano roll **(left)** and the corresponding CQT spectrogram of the synthesized audio using a trumpet timbre **(right)**.

For the timbre‑agnostic model, we synthesize 252 clips with 900 frames each from 33 General MIDI programs, yielding $\sim$ 24.1 hours of training data. For timbre‑separated transcription, we define 10 categories of acoustically similar instruments, each contributing 12 clips with 600 frames each, totaling $\sim$ 13.9 minutes of base material. During training, clips are dynamically mixed by blind addition across categories with random gain scaling; all possible combinations are exhaustively enumerated across epochs, greatly increasing data diversity.

4.1.2 Evaluation metrics

We evaluate transcription performance at frame and note levels using mir_eval (Raffel et al., 2014). Since BasicPitch’s note creation requires thresholds for frames and onsets, we employ a coarse‑to‑fine search strategy to identify optimal thresholds that maximize model performance, assuming the metric is concave in the threshold. The search converges to $1 0^{- 5}$ precision: first optimizing the frame threshold by maximizing frame‑wise $F_{1}$ ( $F_{F}$ ), then fixing it while tuning the onset threshold to maximize note‑level $F_{1}$ ( $F_{N}$ ). A note is correct if its pitch matches ground truth and its onset is within $\pm 50$ ms of the reference.

For timbre‑separated transcription, we extract note events following Section 3.7 using the optimized thresholds. The number of clusters $K$ is set manually. To resolve the label permutation problem, we align estimated and reference piano‑roll matrices by minimizing MSE over all track permutations. The separated note‑level $F_{1}$ score ( $F_{S}$ ) is then computed as the average across tracks. Furthermore, to evaluate the effectiveness of our postprocessing, we also compute the note‑level $F_{1}$ score ( $F_{F S}$ ) of notes generated from frame‑level clustering results.

Our core metrics are $F_{F}$ , $F_{N}$ , and $F_{S}$ . To isolate the contribution of timbre encoding, we report the ratio $F_{S} / F_{N}$ . All results are averaged over at least three independent training runs.

4.2 Implementation details

4.2.1 Parameter setting

We use the parameter‑efficient downsampling CQT architecture of Schörkhuber and Klapuri (2010) with a hop size of 256 samples ( $\sim$ 11 ms at 22,050 Hz). The timbre‑embedding dimension is set to $D = 12$ . Training and validation data follow a 10:1 ratio. We optimize with AdamW with initial $l r = 3 \times 1 0^{- 4}$ . Training lasts 60 epochs with a batch size of 18, and the model is selected based on the minimum validation loss. For timbre‑separated transcription on our synthetic dataset, we generate mixtures of two or three timbres per sample and add mild white noise. The final model comprises: CQT (19,944 parameters), AMT branch (18,978), and timbre encoder (25,983).

4.2.2 Timbre‑Agnostic transcription baseline

We re‑implement BasicPitch (Bittner et al., 2022) and Onsets&Frames (Hawthorne et al., 2018) in PyTorch as timbre‑agnostic baselines.

We remove the pitch‑prediction neck from BasicPitch, following the original observation that pitch supervision is nonessential, yielding a 56,517‑parameter model ( $\sim 3 \times$ the size of ours). Experiments labeled with ‘BP’ use this architecture.

For Onsets&Frames, we retain only the frame and onset branches and adapt the network to our input configuration, resulting in a 1,714,076‑parameter model. Experiments prefixed with ‘OF’ employ this architecture.

4.2.3 Timbre‑Separated transcription baseline

We reproduce the method of Tanaka et al. (2020). Due to the absence of official code and unspecified hyperparameters, we set the BiLSTM hidden size to 256, STFT window length to 1024 (matching the original frequency resolution), hop size to 256 (consistent with our setup), and embedding size to 12. We replace the original loss with our $L_{InfoNCE}$ and use our AMT branch for multipitch estimation. The resulting timbre encoder contains 4,897,776 trainable parameters.

4.3 Overall performance comparison

Table 2 presents the results for timbre‑agnostic transcription. Comparing the rows ‘Ours,’ ‘ ${BP}^{fl}$ ,’ and ‘ ${OF}^{fl}$ ’ (all trained with focal loss) and the additional rows ‘MN,’ ‘ ${BP}_{MN}^{fl}$ ,’ and ‘ ${OF}_{MN}^{fl}$ ’ (trained on MusicNet), we observe that our model achieves performance on par with BasicPitch using significantly fewer parameters. Moreover, our CNN‑based architecture substantially outperforms the RNN‑based Onsets&Frames, confirming the efficiency of our design.

Table 2

Comparison of timbre‑agnostic transcription. Unless stated otherwise, models are trained on our synthetic dataset. ‘noLog’: EnergyNorm without log; ‘BN’: BatchNorm replacing EnergyNorm (BasicPitch‑style); ‘Conv39’: BasicPitch’s 39‑tap conv replacing our dilated conv; ‘BPloss’: BasicPitch’s loss; ‘MN’: trained on MusicNet; ‘lCQT’: learnable CQT; ‘l ${CQT}_{MN}$ ’: learnable CQT trained on MusicNet; ‘BP’: re‑implementation of BasicPitch (Bittner et al., 2022); ‘ ${BP}^{fl}$ ’: BP with focal loss; ‘ ${BP}_{MN}^{fl}$ ’: ${BP}^{fl}$ trained on MusicNet; ‘OF’: re‑implementation of Onsets&Frames (Hawthorne et al., 2018) with its original BCE loss; ‘ ${OF}^{fl}$ ’: OF with focal loss; and ‘ ${OF}_{MN}^{fl}$ ’: ${OF}^{fl}$ trained on MusicNet. In headers, ‘tutti’ = full‑mix polyphonic pieces; ‘stems’ = single‑instrument samples.

Dataset Metric (%)	BACH10 tutti		BACH10 stems		PHENICX		URMP tutti		URMP stems
Dataset Metric (%)	F_F	F_N	F_F	F_N	F_F	F_N	F_F	F_N	F_F	F_N
Ours	84.6	75.2	91.6	88.9	63.2	49.4	75.4	71.4	81.1	83.3
noLog	80.7	68.1	90.0	86.6	61.1	45.7	72.4	66.7	80.7	82.1
BN	85.9	76.4	91.9	88.6	58.5	46.0	72.6	67.8	78.5	80.3
Conv39	84.1	74.6	90.8	85.8	62.1	46.2	75.4	70.6	81.6	83.0
BPloss	84.3	63.6	91.2	77.0	44.6	46.2	74.5	61.7	81.1	71.2
MN	84.8	79.4	88.4	86.7	69.7	55.6	79.7	77.4	82.3	84.4
lCQT	79.1	63.5	88.4	83.5	63.2	47.9	74.8	68.1	81.4	81.9
$l {CQT}_{MN}$	86.5	79.2	89.1	87.9	70.1	56.3	79.5	75.9	81.6	82.5
BP	84.8	53.7	92.0	26.9	58.0	44.9	70.7	60.4	80.9	76.8
${BP}^{fl}$	86.1	75.2	92.3	89.3	62.2	47.7	76.2	71.5	82.2	84.9
${BP}_{MN}^{fl}$	85.7	80.5	89.3	87.8	69.8	54.0	79.9	77.9	82.8	84.9
OF	82.7	70.7	91.4	86.6	55.9	47.6	67.9	64.4	81.1	80.7
${OF}^{fl}$	82.1	72.3	91.8	90.0	55.5	47.0	70.1	66.7	81.2	83.7
${OF}_{MN}^{fl}$	84.6	76.7	87.9	88.6	65.5	56.7	76.3	74.1	81.1	86.1

A similar trend appears in timbre‑separated transcription in Table 3: our method (‘Ours’) markedly surpasses the RNN‑based baseline of Tanaka et al. (2020) (‘Tanaka’). Notably, during training, both RNN baselines exhibit an earlier rise in validation loss, indicating overfitting to the training set. Given their much larger parameter counts, this suggests limited generalization under data constraints and further underscores the robustness, sample efficiency, and trainability of our lightweight architecture.

Table 3

Comparison of timbre‑separated transcription. Unless stated otherwise, models are trained on MusicNet with InfoNCE loss (8). ‘D16’: $D = 16$ ; ‘MSE’: using $L_{affinity}$ (6); ‘Syn’: trained on our synthetic dataset; ‘Rescale’: forcibly scaling amplitude using Frame prediction before InstanceNorm in the timbre‑encoding branch; ‘Share’: sharing the first residual block between two branches; and ‘Tanaka’: baseline model (Tanaka et al., 2020) using InfoNCE. All experiments use identical pretrained AMT branch parameters except ‘Share.’

Dataset Metric (%)	BACH10 2 mix			BACH10 3 mix			BACH10 4 mix			URMP 2 mix			URMP 3 mix
Dataset Metric (%)	F_FS	F_S	ratio	F_FS	F_S	ratio	F_FS	F_S	ratio	F_FS	F_S	ratio	F_FS	F_S	ratio
Ours	83.4	84.6	99.0	77.8	80.1	96.7	66.2	72.7	89.9	68.9	66.8	82.6	58.5	60.0	77.1
D16	83.2	84.4	98.8	76.8	79.5	96.0	64.5	70.4	87.1	69.1	68.4	84.7	58.0	59.5	76.5
MSE	80.4	82.2	96.3	70.8	75.7	91.4	59.0	67.9	84.0	65.4	65.4	80.9	53.9	57.1	73.5
Syn	72.4	78.1	91.4	58.7	68.0	82.1	46.0	56.1	69.4	49.6	53.9	66.7	40.9	45.1	58.1
Rescale	83.4	84.6	99.1	76.8	79.6	96.2	63.6	70.9	87.7	68.4	66.8	82.7	58.6	59.8	76.9
Share	82.1	83.0	98.4	76.1	78.9	96.6	68.5	72.8	91.0	69.0	68.3	84.6	57.2	57.5	74.1
Tanaka	77.9	79.6	93.2	66.1	69.0	83.4	55.4	59.2	73.2	65.5	64.1	79.3	56.5	56.5	72.8

4.4 Ablation study on core components

EnergyNorm Comparing rows ‘Ours’ and ‘noLog’ in Table 2, we find that nonnegative features indeed constrain the model’s expressive capacity. We further compare against BatchNorm (row ‘BN’), which is used in BasicPitch. While ‘BN’ slightly outperforms our method on BACH10, it underperforms significantly on the other two test sets. Inspired by Esaki et al. (2024), we hypothesize that our synthetic data distribution aligns more closely with BACH10, and because BatchNorm stores learned statistics for inference, it suffers from poor generalization to more divergent domains, highlighting the advantage of our normalization strategy in cross‑dataset robustness.

Dilated convolution Replacing BasicPitch’s big kernel with our frequency‑dilated kernel (rows ‘Ours’ and ‘Conv39’) yields slightly better performance despite fewer parameters, confirming the benefit of an expanded receptive field. Moreover, non‑dilated variants consistently require higher thresholds (not shown), likely because they cannot directly model octave‑spanning context, leading to stronger harmonic ghosts (false positives) that must be suppressed post‑hoc.

Focal loss While the sophisticated postprocessing in BasicPitch may attenuate the undesirable impact of the blurred onset predictions produced by its original loss (Figure 2), such mitigation is inherently limited. Table 2 reveals that models trained with focal loss (‘Ours’, ‘ ${BP}^{fl}$ ’, ‘ ${OF}^{fl}$ ’) consistently achieve significantly higher $F_{N}$ than their counterparts without it, demonstrating that accurate onset estimation remains crucial for high‑quality note creation.

InfoNCE loss Comparing rows ‘Ours’ and ‘MSE’ in Table 3, which correspond to $L_{InfoNCE}$ and $L_{affinity}$ , respectively, shows that $L_{InfoNCE}$ consistently and significantly outperforms $L_{affinity}$ .

Postprocessing As shown in Table 3, $F_{F S}$ is almost always lower than $F_{S}$ , indicating that our postprocessing method is not only faster but also yields better performance. This is also illustrated in the first row of Figure 5, where note‑level aggregation yields more separable embeddings, significantly improving clustering robustness.

T‑distributed stochastic neighbor embedding (t‑SNE) visualization of timbre embeddings. **(a)** Frame‑level embeddings for BACH10 Piece 2, **(b)** note‑level aggregates of **(a)**, **(c)** frame‑level for URMP Piece 18, and **(d)** frame‑level for URMP Piece 18 using top‑ $k$ attention.

4.5 Exploratory and negative results analysis

4.5.1 Limitations of synthetic data

Synthetic data underperforms MusicNet due to two domain gaps. First, our synthetic notes span the full 84‑note range, whereas real instruments occupy limited registers; timbre is approximately consistent only within a narrow pitch range (Duan et al., 2008), leading to fragmented clusters when scattered across octaves. Second, synthesizers produce static timbres lacking the dynamic variation of real performances. These limitations underscore the need for human‑composed and real‑recorded datasets capturing authentic timbral complexity.

4.5.2 Learnable CQT

Motivated by learnable time–frequency representations, we explore whether a learnable CQT can improve performance. To ensure stability, we first train the full network with fixed CQT parameters, then unfreeze them for joint fine‑tuning. While training loss decreases noticeably, test performance degrades on our synthetic data (‘Ours’ vs. ‘lCQT’) and yields no improvement on MusicNet (‘MN’ vs. ‘ $l {CQT}_{MN}$ ’).

We attribute this to overfitting: the added flexibility enhances model capacity, but, with limited training data, it leads to memorization rather than generalization. For lightweight systems, handcrafted CQT parameters thus offer better robustness. The potential benefits of learnable CQT may only emerge in large‑scale settings, which remains an open question.

4.5.3 Exploring alternative encoding architectures

We evaluate several design choices for the timbre‑encoding branch; results are summarized in Table 3.

Encoding dimension. Increasing the embedding dimension to 16 (row ‘D16’) yields no significant gain. Notably, Lin et al. (2021) achieve effective separation with only six dimensions, suggesting that even lighter encodings may suffice.

Multitask coupling. Prior work (Cwitkowitz et al., 2024; Manilow et al., 2020) suggests that multitask learning can be mutually beneficial. We test two coupling strategies: 1) sharing the first residual block between branches (‘Share’) and 2) rescaling the timbre embeddings using corresponding frame predictions before InstanceNorm (‘Rescale’). Neither improves performance and often degrades it, indicating that such coupling may not suit lightweight networks.

Attention mechanisms. In light of the contributions of Transformers to timbre modeling (Wu et al., 2020, 2024), we explore attention (Vaswani et al., 2017) for its potential to incorporate global context and enhance cluster compactness (not included in the table). Given the large $N \times T$ size, we restrict ourselves to linear‑complexity variants. All tested configurations significantly hurt performance. Figure 5(d) shows that self‑attention densifies intra‑class connections but spuriously links distinct timbre manifolds. This occurs when attention weights connect ambiguous boundary regions where embeddings from different instruments appear similar, e.g., at overlapped notes, thereby bridging otherwise separable clusters. This is particularly detrimental to spectral clustering, which relies on graph connectivity rather than metric separability. To prevent such fusion, we also evaluate a Transformer variant with residual connections on the attention output, which merely maintains baseline performance. Attention might be more effective after note‑level aggregation, but that prevents end‑to‑end training. For lightweight networks, especially when frame‑level embeddings are not well separated, such operations appear unnecessary and potentially detrimental to separability.

4.6 Efficiency

Our model is simple and lightweight enough to run directly in a web browser. On an Intel^® Ultra 7 255H CPU using Microsoft Edge, timbre‑agnostic transcription of a 301‑s duet takes 17.1 s, compared to 39.1 s for BasicPitch. The postprocessing step requires 78 ms. For timbre‑separated transcription, our model inference takes 38.8 s, followed by 704 ms for clustering‑based postprocessing. In contrast, RNN‑based baselines suffer from ONNX export issues and have prohibitively large parameter counts, rendering them impractical for real‑world deployment.

5 Reproducibility

The data and code used in this paper are accessible via https://github.com/madderscientist/timbreAMT and DOI: https://doi.org/10.5281/zenodo.19229826.

6 Conclusion

We have introduced a compact and efficient architecture for timbre‑separated music transcription that overcomes major limitations of current approaches: fixed instrument vocabularies, poor generalization to unseen timbres, and high computational cost. By performing clustering on coherent note events rather than raw time–frequency bins, our method reduces fragmentation, improves separation quality, and supports flexible inference without waveform reconstruction.

Several promising directions remain for future work. First, the current reliance on empirically tuned thresholds for note creation could be replaced by learned, content‑adaptive thresholds. Second, training data could be enhanced by mixing instruments in complementary frequency bands to better reflect natural orchestration. Finally, our model still requires the number of instrument classes to be specified at inference time and is unable to handle overlapping situations. The matching‑filter approach described at the end of Section 3.7 remains a promising direction worth further exploration.

Competing Interests

The authors have no competing interests to declare.

Note

[1] Link to the web tool: https://madderscientist.github.io/noteDigger/