PESTO: Real‑Time Pitch Estimation with Self‑Supervised Transposition‑Equivariant Objective

Alain Riou; Bernardo Torres; Ben Hayes; Stefan Lattner; Gaëtan Hadjeres; Gaël Richard; Geoffroy Peeters

doi:10.5334/tismir.251

1 Introduction

Pitch, alongside loudness and timbre, is one of the primary auditory sensations and plays a central role in audio perception. Its automatic estimation is a fundamental task in audio analysis, with broad applications in music information retrieval (MIR) and speech processing.

Pitch perception is closely tied to the acoustic periodicity of the audio waveform. From a spectral perspective, it is related to the spacing between its partials that corresponds to the fundamental frequency (F0) (Oxenham, 2012; Yost, 2009). In the case of harmonic or quasi‑harmonic signals, these two characteristics coincide, so that the perceived pitch is often measured via F0 estimation. This has been a long‑standing research problem, with numerous solutions based on signal‑processing techniques being proposed (Camacho and Harris, 2008; Dubnowski et al., 1976; de Cheveigné and Kawahara, 2002; Noll, 1967).

More recently, deep learning methods such as CREPE (Kim et al., 2018) have achieved almost‑perfect results, so pitch estimation is often considered a solved problem. However, these approaches exhibit several limitations: they are not lightweight, rely heavily on annotated data that are difficult to obtain, and fail to generalize well to out‑of‑domain data, limiting their practical utility. Additionally, they rely on non‑causal convolutional neural networks with large kernels applied to downsampled audio waveforms, making them unsuitable for real‑time applications.

Real‑time estimation of fundamental frequency is, indeed, an integral component of a range of speech and music audio systems. These include staples of modern audio production, such as pitch correction and audio‑to‑MIDI conversion, as well as creative musical effects such as vocal harmonizers. Interactive digital tools for music education,^¹ creativity support tools, and jam‑along systems also rely heavily on real‑time inference of musical attributes, including pitch (Meier et al., 2023). As noted by Stefani and Turchet (2022), real‑time application of MIR algorithms is generally challenging due to computational limitations and the availability of only causal information. Selecting appropriate algorithms for pitch tracking thus typically requires a compromise between efficiency and performance.

Neural audio models have enabled new categories of audio processing, such as timbre transfer (Huang et al., 2018), speaking voice conversion (Bargum et al., 2024; Sisman et al., 2021), and singing voice conversion (Huang et al., 2023). These techniques have garnered interest in real‑time applications, as demonstrated by various open‑ and closed‑source software releases.^²

Differentiable digital signal processing (DDSP) (Engel et al., 2020b), a hybrid of deep learning and signal processing, has been central to these advances (Hayes et al., 2023b). Due to the challenges of optimizing oscillation frequency by gradient descent (Hayes et al., 2023a), such models often rely on fundamental frequency (F0) estimates to predict synthesis parameters and drive oscillatory components.

Neural source‑filter models (Wang et al., 2019), another class of hybrid neural vocoders, employ periodic excitation signals with F0 matching the target speech and are widely used in singing voice conversion (SVC) (Huang et al., 2023). Popular real‑time SVC systems, such as So‑VITS‑SVC^³ and DDSP‑SVC,^⁴ depend on external F0 estimators, which contribute to inference latency.

For real‑time applications of such models, there is a tension between what they require from an F0 estimator at training time, i.e., high‑quality F0 annotations, and at inference time, i.e., streaming prediction with low latency. As a result, practitioners face a challenging design choice. Some opt to use an efficient estimator for both training and inference. For example, some applications (Fabbro et al., 2020; Ganis et al., 2021; Yang et al., 2024) opt for a classical method such as Yin (de Cheveigné and Kawahara, 2002), while others use a knowledge‑distilled neural network.^⁵ This minimizes inference latency, albeit at the expense of training with sub‑optimal F0 annotations which may harm performance (Alonso and Erkut, 2021). Others opt to use a high‑quality neural pitch estimator, which will provide more reliable annotations at the expense of considerable inference latency. Alternatively, practitioners may decide to train with the high‑quality estimator and substitute in the more efficient algorithm at inference time, at the risk of introducing a mismatch between train and test conditions.

Paper proposal. This work addresses such training and inference design trade‑offs by proposing several key contributions: (i) a lightweight state‑of‑the‑art neural pitch estimator, (ii) a self‑supervised training strategy that does not require labeled data, and (iii) a real‑time causal inference model.

More precisely, our approach applies a Variable‑ $Q$ Transform (VQT) to audio signals and processes pitch‑shifted frames using a Siamese architecture (Section 3.1). We incorporate a Toeplitz fully‑connected layer into the architecture, enabling it to naturally preserve transpositions (Section 3.3). Furthermore, we formulate pitch estimation as a classification problem and introduce a novel loss function that enforces pitch‑shift equivariance without requiring a decoder (Section 3.2). To further enhance practical utility, we propose a simple yet effective modification that allows the model to process audio streams in real‑time (Section 4).

We evaluate our model both on music and speech datasets (Section 5). Our results (Section 6) indicate that it significantly outperforms prior self‑supervised baselines and achieves results competitive with supervised methods, using $68$ times fewer parameters than the most lightweight supervised model. In particular, we observe better generalization to unseen datasets compared to the baselines.

Moreover, because our model operates on VQT frames rather than raw audio waveforms, it can handle arbitrary sampling rates and hop sizes during inference without requiring downsampling, regardless of the parameters used during training. These features, combined with the ability to retrain or fine‑tune the model without annotated data and minimal computational resources, make it highly practical for various real‑world scenarios.

To facilitate its usage and encourage further research in this direction, we open‑source the full training code and release a pip‑installable package, along with pretrained models.^⁶

This paper extends our previous publication (Riou et al., 2023) in several ways. We replace the original frontend (Constant‑Q Transform [CQT]) by a VQT and show it drastically improves performance. We also introduce slight architectural changes. Moreover, we provide a deeper analysis with additional experiments, extending evaluations to speech data and comparing with the recent PENN model (Morrison et al., 2023). More detailed ablation studies are also conducted. In addition, we implement and release a real‑time implementation of our model compatible with audio streams, enhancing its practical utility. Finally, we improve the clarity of the paper with refined explanations, rebranded figures, and more detailed discussions.

2 Related Work

2.1 Pitch estimation

Monophonic pitch estimation has been a subject of interest for over 50 years (Noll, 1967). The earlier methods typically obtain a pitch curve by processing a candidate‑generating function such as cepstrum (Noll, 1967), autocorrelation function (Dubnowski et al., 1976), or average magnitude difference function (Ross et al., 1974). Other functions, such as the normalized cross‑correlation function (Boersma, 1993; Talkin, 1995) and the cumulative mean normalized difference function (de Cheveigné and Kawahara, 2002; Mauch and Dixon, 2014), have also been proposed. On the other hand, Camacho and Harris (2008) perform pitch estimation by predicting the pitch of the sawtooth waveform whose spectrum best matches the one of the input signal.

As in many other domains, these signal processing–based approaches have been recently outperformed by data‑driven ones. In particular, CREPE (Kim et al., 2018) is a deep neural network consisting of six convolutional blocks followed by a fully‑connected layer. Operating directly on waveform data, CREPE processes audio chunks of 64 ms (1024 samples at 16 kHz). Trained in a supervised manner on a large collection of music datasets, it achieves exceptional performance and has become a widely adopted solution for monophonic pitch estimation.

Building on CREPE, Singh et al. (2021) introduce dilated convolutions with residual connections to expand the receptive field, while Ardaillon and Roebel (2019) improve the architecture further and propose downsampling input signals to 8 kHz to reduce computational costs. Morrison et al. (2023) later proposed additional training strategies, such as incorporating layer normalization, increasing batch sizes, and making minor architectural changes to enhance performance.

These advancements have led to state‑of‑the‑art results using both music and speech datasets. However, the fundamental framework—modeling pitch estimation as a supervised classification problem—remains unchanged. This dependence on large quantities of annotated data is a significant drawback, as obtaining precise F0 annotations is time‑consuming and challenging. Furthermore, these methods often perform poorly when applied to data outside the training distribution (Morrison et al., 2023), limiting their applicability in diverse scenarios.

2.2 Self‑Supervised Learning with siamese networks

Self‑Supervised Learning (SSL) has emerged as a promising paradigm to address the challenges associated with gathering large quantities of annotated data, which is often tedious, error‑prone, and biased. SSL leverages the data itself to provide a supervision signal by training a neural network to solve a pretext task. One common pretext task involves training Siamese networks (Hadsell et al., 2006) to project data points into a latent space and minimize the distance between pairs of inputs that share semantic information. These positive samples (or views) can be artificially created by applying transforms to an input data point.

In other words, the goal of Siamese networks is to learn a mapping $f : X \to Y$ that is invariant to a set of transforms $T : X \to X$ ; that is, for any data point $x \in X$ and any transform $t \in T$ , the following holds:

1

f (t (x)) = f (x) .

The set of transforms $T$ is usually composed of semantic‑preserving data augmentations. For example, in the image domain, Chen et al. (2020) propose using transforms such as cropping, rotation, and color jittering, which drastically change the pixel values but preserve the content of the image itself. Later, Baevski et al. (2022) propose creating views by masking the input.

However, without additional constraints, the network might simply learn the trivial solution to Eq. (1), mapping all inputs to the same point. This phenomenon, called representation collapse, can be typically prevented by adding new loss terms computed over several data points. Successful approaches, typically requiring large batch sizes, include minimizing the similarity between negative samples through a contrastive loss (Chen et al., 2020) or regularizing over the batch statistics (Bardes et al., 2022; Wang and Isola, 2020; Zbontar et al., 2021).

Introducing asymmetry in the Siamese architecture can also be exploited to prevent collapse by changing the training dynamics, such as adding a predictor network after one branch while stopping gradients in the other (Chen and He, 2021; Grill et al., 2020). This approach relies solely on positive pairs and is less sensitive to batch size.

2.3 Equivariant Self‑Supervised Learning

Most SSL models aim to learn a mapping that is invariant to certain transformations. However, recent research has explored the idea of learning equivariant mappings instead.

Equivariance, in contrast to invariance, can be mathematically expressed as follows. Let $G$ be a group, with group actions $t : X \times G \to X$ and $t^{'} : Y \times G \to Y$ . A mapping $f : X \to Y$ is said to be equivariant with respect to $G$ if, for any $x \in X$ and $g \in G$ , the following condition holds:

2

f (t (x, g)) = t^{'} (f (x), g),

In other words, $t$ and $t^{'}$ are transforms acting on the input and output spaces, respectively, with parameters from the group $G$ , and a mapping $f$ is equivariant with regard to $G$ if transforming the input $x$ by $t$ results in a corresponding transformation of the output $f (x)$ under $t^{'}$ .

Note that invariance is a special case of equivariance, where $t^{'} (\cdot, g)$ acts as the identity function for all $g \in G$ . However, in general, Eq. (2) does not have a trivial solution for $f$ . As a result, Siamese networks trained to optimize an equivariance objective are not prone to collapse.

Equivariant representation learning has been explored as a means to navigate the latent space of (variational) autoencoders (Falorsi et al., 2018; Hinton et al., 2011). More recently, efforts have been made to incorporate equivariance into self‑supervised models, conditioning parts of the neural architecture on the parameters of the transforms applied (Dangovski et al., 2021; Devillers and Lefort, 2023). These approaches have shown particular promise in computer vision tasks that involve rotations (Garrido et al., 2023; Winter et al., 2022).

2.4 Self‑supervised learning for MIR

While originally developed for image representation learning, the aforementioned approaches have been successfully adapted to the audio domain (Anton et al., 2023; Niizumi et al., 2022; Saeed et al., 2021). Notably, contrastive pretraining has demonstrated impressive results for various MIR downstream tasks, such as auto‑tagging and genre classification (McCallum et al., 2022; Spijkervet and Burgoyne, 2021).

Few studies have attempted to train neural networks with equivariant properties to pitch and tempo shifts. SPICE (Gfeller et al., 2020) offers a method for pitch estimation without the need for annotated data. It creates pairs of views by pitch‑shifting an input CQT frame by a known number of semitones, $k_{1}$ and $k_{2}$ . The model learns a mapping $f$ that projects a CQT frame to a scalar value, and it is trained to ensure that the difference between the scalar projections is proportional to the pitch difference, $k_{2} - k_{1}$ . However, Gfeller et al. (2020) observe that this objective alone is insufficient for training, and thus the model requires an additional decoder and a reconstruction loss to prevent collapse. This approach has been extended to tempo (Gagneré et al., 2024; Morais et al., 2023) and key estimation (Kong et al., 2024).

Quinton (2022) applies a comparable approach to tempo estimation. In this case, an audio segment is time‑stretched by two factors, $α_{1}$ and $α_{2}$ , to create a pair of segments $(x_{1}, x_{2})$ . A neural network then projects these segments to scalars $z_{1}$ and $z_{2}$ , and the model is trained to minimize the following loss function:

3

L_{=} | \frac{z_{2}}{z_{1}} - \frac{α_{2}}{α_{1}} | .

This objective encourages the scalar projections to be proportional to the actual tempo of the song. Unlike SPICE, Quinton (2022) does not require a decoder, and the model avoids collapse by using only the scalar projection loss.

Another family of self‑supervised and unsupervised methods for pitch estimation rely on analysis‑by‑synthesis. Engel et al. (2020a) train a neural network to predict the parameters of a differentiable harmonic‑plus‑noise synthesizer that approximates the input audio, relying on parameter regression on synthetically generated signals. On a similar approach, Torres et al. (2024) suggest instead using an optimal transport‑inspired spectral loss function for training a fundamental frequency estimator for harmonic signals.

3 PESTO

Our model is a neural network $f_{θ} : R^{F^{'}} \to {[0, 1]}^{K}$ , with parameters $θ$ , which takes as input a truncated VQT frame and returns a pitch distribution $y = (y_{1}, \dots, y_{K}) \in {[0, 1]}^{K}$ .

3.1 Frontend

In this section, we first describe the CQT, which serves as a specification for the VQT used in our experiments. We then introduce the VQT, which is a variant of the CQT, which smoothly decreases the Q factors of the analysis filters for low frequencies.

3.1.1 The constant‑Q transform

The CQT shares fundamental principles with wavelet analysis, offering a frequency‑dependent time–frequency resolution (Brown, 1991). Its defining characteristic is the constant ratio (Q) between center frequency and bandwidth across all frequency bins. The center frequencies $f_{k}$ follow a geometric progression:

4

f_{k} = f_{min} 2^{\frac{k}{12 * B}},

where $f_{min}$ represents the lowest analysis frequency and $B$ denotes the number of bins per semitone in an equal‑tempered scale. This logarithmic frequency spacing creates a musically intuitive representation with the property that, in its log‑frequency domain, a frequency translation represents a chromatic transposition regardless of the absolute frequency while maintaining the distance between harmonics for a given pitch. In practical implementations, particularly in MIR, the CQT typically employs a uniform hop size across all frequency bands matched to the highest frequency's requirements. While this approach introduces redundancy in the low‑frequency bands compared to the Wavelet transform, it ensures temporal alignment of CQT coefficients across all frequency bins, facilitating subsequent analysis and processing tasks.

3.1.2 The variable‑Q transform

The VQT (Schörkhuber et al., 2014) differs from the CQT by adding a parameter $γ$ , which smoothly decreases the Q factors of the analysis filters for low frequencies. For center frequency $f_{k}$ , the length of the analysis window is given by

5

w_{k} = ⌈ \frac{Q f_{s}}{f_{k} + \frac{γ}{ζ}} ⌉,

where $Q$ is the Q factor of the analysis filters; $f_{s}$ is the sampling rate; and $ζ = 2^{\frac{1}{12 B}} - 1$ is a constant that depends on $B$ , the number of bins per semitone, following the implementations of Cheuk et al. (2020) and Schörkhuber et al. (2014). For $γ = 0$ , the VQT is equivalent to the CQT. As in the CQT, the center frequencies ${f_{k}}_{k = 1}^{K}$ follow an exponential scaling. Figure 1 illustrates the analysis window size as a function of $γ$ and center frequency.

Variable‑Q Transform (VQT) analysis window size as a function of $γ$ and analysis center frequency. During the computation of the VQT, all kernels are padded to the nearest power of $2$ to the largest analysis window. The Constant‑Q Transform corresponds to the case $γ = 0$ . Both axes are log‑scaled.

Since its time–domain analysis window for low frequencies is smaller than that of the CQT, the VQT can be computed more efficiently in terms of both time and memory. As we demonstrate empirically, smaller analysis windows that do not span several frames are also beneficial for the pitch estimation task.

3.1.3 Pitch‑shift in the VQT domain

Our technique to build pitch‑shifted views from a single VQT frame, originally proposed for the CQT by Gfeller et al. (2020), is depicted in Figure 2.

Illustration of the pitch‑shift process in the Variable‑Q Transform (VQT) domain. From a given frame, we construct two views by cropping equally‑sized sub‑frames from it, with a shift of $k$ between them. Since the frequency scale is logarithmic in the VQT domain, this translation corresponds to an approximate pitch shift of $k$ bins.

More precisely, from a full VQT frame $(x_{1}, \dots, x_{F})$ , an integer $k_{max} \leq F / 2$ , let $F^{'} = F - 2 k_{max}$ . Then, given an integer $k$ sampled from $[- k_{max}, k_{max}]$ , one can easily compute frames with approximately the same timbre and a pitch shift of $k$ bins, i.e. $k / B$ semitones, by extracting

x = (x_{k_{max}}, \dots, x_{F - k_{max}}) \in R^{F^{'}}

and

x^{(k)} = (x_{k_{max} + k}, \dots, x_{F - k_{max} + k}) \in R^{F^{'}}

from the original VQT frame $x$ . We set $k_{max}$ to $16$ .

3.2 Training objective

The training procedure for our model takes inspiration from previous SSL methods based on Siamese networks. It is depicted in Figure 3.

Overview of the PESTO model. Given a 1D Variable‑Q Transform frame (displayed horizontally, where the horizontal axis corresponds to frequency), we first crop it as described in Section 3.1.3 to create a pair of pitch‑shifted views $(x, x^{(k)})$ . We then obtain $\tilde{x}$ and ${\tilde{x}}^{(k)}$ by randomly applying pitch‑preserving transforms to the views. The neural network $f_{θ}$ predicts pitch distributions from the different views and is trained by minimizing both an invariance loss between $y$ and $\tilde{y}$ and an equivariance loss between $\tilde{y}$ and ${\tilde{y}}^{(k)}$ .

Given a single VQT frame $x$ , we first crop the frame with a known offset and then translate it by a random but known number of bins $k$ . Let $x^{(k)}$ be the resulting pitch‑shifted frame. Then, we randomly apply a set of pitch‑preserving transformations to both $x$ and $x^{(k)}$ (white noise, random gain). We denote the transformed versions as $\tilde{x}$ and ${\tilde{x}}^{(k)}$ , respectively. Next, we pass $x$ , $\tilde{x}$ , and ${\tilde{x}}^{(k)}$ through the same neural network $f_{θ} : R^{F^{'}} \to {[0, 1]}^{K}$ , obtaining the pitch distributions $y$ , $\tilde{y}$ , and ${\tilde{y}}^{(k)}$ , respectively.

We know by construction that $x$ and $\tilde{x}$ have the same pitch. Therefore, we want their pitch distributions $y$ and $\tilde{y}$ to be equal, ensuring that $f_{θ}$ is invariant to the transformations. Similarly, we know that the pitch difference between $\tilde{x}$ and ${\tilde{x}}^{(k)}$ is exactly $k$ bins. Thus, we want their pitch distributions to be equal but shifted by $k$ bins, ensuring that the network is equivariant to the pitch shift.

We train the network $f_{θ}$ to minimize objectives that enforce these invariance and equivariance constraints.

3.2.1 Invariance loss

We want the distributions $y$ and $\tilde{y}$ to be as close as possible. To achieve this, we simply minimize the cross‑entropy of $y$ relative to $\tilde{y}$ :

6

L_{inv} (y, \tilde{y}) = \sum_{i = 1}^{K} {\tilde{y}}_{i} log y_{i} .

3.2.2 Equivariance loss

By construction, there is a pitch shift of exactly $k$ bins between $\tilde{x}$ and ${\tilde{x}}^{(k)}$ . In other words, for any pitch $k^{'} \in {1, \dots, K}$ , the probability for the pitch of $\tilde{x}$ to be $k^{'}$ equals the probability for the pitch of ${\tilde{x}}^{(k)}$ to be $k^{'} + k$ . For example, if we apply a pitch shift of +1 semitone to the frame, the probability that the original frame is a C4 is equal to the probability that the transposed one is a C#4, and this holds for every pitch independently of the ground‑truth pitch of the frame.

As $f_{θ}$ should return the pitch distribution of its input, we therefore want its outputs $\tilde{y}$ and ${\tilde{y}}^{(k)}$ to verify:

7

{\tilde{y}}_{k^{'} + k}^{(k)} = {\tilde{y}}_{k^{'}},

for all $k^{'} \in {1, \dots, K - k}$ .

To achieve this, we consider

a = (α, α^{2}, \dots, α^{K}) \in R^{K},

where $α > 0$ is a fixed scalar.

Then, assume that

8

{\begin{matrix} {\tilde{y}}_{k^{'}}^{(k)} = 0 & for all k^{'} \leq k \\ {\tilde{y}}_{k^{'}} = 0 & for all k^{'} > K - k \end{matrix}

If Eq. (7) holds, then

9

\begin{array}{rcl} a \cdot {\tilde{y}}^{(k)} & = & \sum_{i = k + 1}^{K} α^{i} {\tilde{y}}_{i}^{(k)} \\ = & \sum_{i = 1}^{K - k} α^{i + k} {\tilde{y}}_{i} by Eq . (7) \\ = & α^{k} a \cdot \tilde{y} \end{array}

Taking inspiration from Quinton (2022), we therefore define our criterion $L_{equiv}$ as

10

L_{equiv} (\tilde{y}, {\tilde{y}}^{(k)}, k) = h_{τ} (\frac{a \cdot {\tilde{y}}^{(k)}}{a \cdot \tilde{y}} - α^{k}),

where $h_{τ}$ stands for the Huber loss function (Huber, 1964), defined for any scalar $x \in R$ by

11

h_{τ} (x) = {\begin{matrix} \frac{x^{2}}{2} & if | x | \leq τ \\ \frac{τ^{2}}{2} + τ (| x | - τ) & otherwise \end{matrix} .

Thanks to $h_{τ}$ , the gradients of our equivariance loss are proportional to the error when it is small enough (e.g., a few cents) but constant otherwise (e.g., in the case of octave errors).

From Eq. (9), if both conditions (7) and (8) hold, then $L_{equiv} = 0$ . However, $L_{equiv}$ can be null without these conditions to be satisfied. Hence, there is a need for a third loss term.

3.2.3 Regularization loss

Since we want $\tilde{y}$ and ${\tilde{y}}^{(k)}$ to be identical up to a translation of $k$ bins, we propose to minimize the shifted cross‑entropy (SCE) of one relative to the other. In practice, we pad both distributions with zeros on both sides, i.e., for any integer $i \in Z$ , if $i < 1$ or $i > K$ , then ${\tilde{y}}_{i} = {\tilde{y}}_{i}^{(k)} = 0$ .

The SCE loss $L_{SCE}$ is then defined as

12

L_{SCE} (\tilde{y}, {\tilde{y}}^{(k)}, k) = \sum_{i = 1}^{K} {\tilde{y}}_{i + k}^{(k)} log {\tilde{y}}_{i} .

This loss is minimal if and only if the conditions from Eqs. (7) and (8) are met. However, contrary to $L_{equiv}$ , the (shifted) cross‑entropy does not take into account the ordering of the probability bins, making the value of $L_{SCE}$ independent of the actual pitch error between $\tilde{y}$ and ${\tilde{y}}^{(k)}$ . We, therefore, combine both losses.

3.2.4 Loss weighting

Conceptually, there is no reason for our final objective not to be symmetric. However, cross‑entropy is not inherently symmetric, as it is defined for one distribution relative to another; we therefore symmetrize it by swapping terms. Additionally, we observed in our preliminary experiments that stopping the gradients of the target distributions helps stabilize training. Therefore, our complete loss objective incorporates this gradient‑stopping mechanism (denoted by $sg$ ) and is written as follows:

13

\begin{aligned} L & = & \frac{λ_{inv}}{2} (L_{inv} (y, sg (\tilde{y})) + L_{inv} (\tilde{y}, sg (y))) \\ + \frac{λ_{equiv}}{2} (L_{equiv} (\tilde{y}, {\tilde{y}}^{(k)}, k) + L_{equiv} ({\tilde{y}}^{(k)}, \tilde{y}, - k)) \\ + \frac{λ_{SCE}}{2} (L_{SCE} (\tilde{y}, sg ({\tilde{y}}^{(k)}), k) + L_{SCE} ({\tilde{y}}^{(k)}, sg (\tilde{y}), - k)) \end{aligned}

The weights $λ_{*}$ are updated during training using the respective gradients of the losses concerning the last layer $w$ of the network $f_{θ}$ , following, e.g., Esser et al. (2021) and MacGlashan et al. (2022).

Let $S = {inv, equiv, SCE}$ . For $s \in S$ , we first compute the following quantity:

14

g_{s} = \frac{∥ \nabla_{w} L_{s} ∥}{\sum_{s^{'} \in S} ∥ \nabla_{w} L_{s^{'}} ∥},

where $∥ \cdot ∥$ stands for the $L_{2}$ norm.

Indeed, as studied in MacGlashan et al. (2022), the norm of the gradient of each loss can be interpreted as its contribution to the total objective to optimize. To balance the contributions of each loss, we therefore weight each loss $L_{s}$ with the sum of the contributions of the other losses, i.e., $1 - g_{s}$ .

To prevent brutal variations of $λ_{s}$ , in practice we update it as an exponential moving average of the $g_{s}$ , i.e.,

15

{\begin{matrix} λ_{s} (0) = 1 \\ λ_{s} (t + 1) = η λ_{s} (t) + (1 - η) (1 - g_{s}) \end{matrix} .

3.3 Architecture

The architecture of our pitch estimator is inspired by Bittner et al. (2022) and Weiß and Peeters (2022). Each input VQT frame is processed independently through the following sequence: first, layer normalization (Ba et al., 2016) is applied, followed by a series of 1D convolutions along the log‑frequency dimension—three with skip‑connections (He et al., 2016) and four regular ones. The kernel size is set to $13 B = 39$ to span more than one octave (Bittner et al., 2022)—for $B = 3$ bins per semitone. As in Weiß and Peeters (2022), we apply a non‑linear leaky‑ReLU with a slope of $0.3$ (Xu et al., 2015) and dropout with rate $0.2$ (Srivastava et al., 2014) between each convolutional layer. Importantly, the kernel size and padding of each of these layers are chosen so that the frequency dimension is never reduced. A translation of $k$ bins of a VQT frame, therefore, leads to a translation of $k$ bins of the output pitch distribution. The output is then flattened, fed to a final fully‑connected layer, and normalized by a softmax layer to become a probability distribution of the desired shape.

Note that all layers (convolutions,^⁷ element‑wise non‑linearities, layer‑norm, and softmax), except the last final fully‑connected layer, preserve transpositions. To make the final fully‑connected layer also transposition‑equivariant, and in line with previous works on fully‑convolutional networks (Ardaillon and Roebel, 2019; Shelhamer et al., 2017), we propose to use Toeplitz fully‑connected layers. These consist of a standard linear layer without bias but whose weights matrix $A \in R^{F^{'} \times K}$ is a Toeplitz matrix, i.e., each of its diagonals is constant.

16

A = (\begin{matrix} a_{0} & a_{- 1} & a_{- 2} & \dots & a_{- K + 2} & a_{- K + 1} \\ a_{1} & a_{0} & a_{- 1} & ⋱ & ⋱ & a_{- K + 2} \\ a_{2} & a_{1} & ⋱ & ⋱ & ⋱ & ⋮ \\ ⋮ & ⋱ & ⋱ & ⋱ & ⋱ & ⋮ \\ a_{F^{'} - 1} & \dots & \dots & \dots & \dots & a_{F^{'} - K} \end{matrix})

In practice, this matrix $A$ is equivalent to a 1D‑convolution with kernel $(a_{- K + 1}, \dots, a_{F^{'} - 1})$ and padding $K - 1$ and therefore preserves translations while having fewer parameters than an arbitrary fully‑connected layer.

3.4 Re‑centering pitch distributions

Recall that our model learns to predict pitches from VQT frames up to an additive constant and that the bins of the predictions do not correspond one‑to‑one to VQT log‑frequencies: during training, the bin associated with a given pitch is completely arbitrary and only depends on the initialization of the model.

In practice, due to random weight initialization, we sometimes observed that the performances of our model significantly drop when a low pitch value is initially mapped to a high bin of the pitch distribution (or the opposite) because the whole pitch distribution is concentrated in a few bins. This is particularly likely when training on datasets that span a wide range of frequencies (such as MDB‑stem‑synth). While this phenomenon is far from the norm, it can limit the usability of our model in real‑world scenarios.

To prevent that, we set our output probability distribution to cover more than $10$ octaves ( $K = 384$ bins with a resolution of $B = 3$ bins per semitone), making collapse unlikely for most real‑world data. In addition, we explicitly force the median of all the predictions to remain roughly in the center of the pitch distribution during training. If it deviates too much, we apply a circular shift to the kernel $(a_{m - 1}, \dots, a_{- n + 1})$ of the final Toeplitz fully‑connected layer, which moves the predicted pitch distributions accordingly.^⁸ This simple trick enables us to rectify bad initialization during training with minimal overhead.

3.5 Pitch decoding

3.5.1 Inferring pitch values from a pitch distribution

For a given input VQT frame, let $y = (y_{1}, \dots, y_{K}) \in R^{K}$ be the output of PESTO (a pitch distribution in the log‑frequency domain). Relative pitch decoding is inferred by applying argmax‑local weighted averaging (Kim et al., 2018; Morrison et al., 2023) to $y$ :

17

p_{rel} (y) = \frac{\sum_{i = a - 2 B}^{a + 2 B} i * y_{i}}{\sum_{i = a - 2 B}^{a + 2 B} y_{i}},

where $a = arg max (y)$ and $B = 3$ is the number of bins per semitone in the VQT. Although, in this work, pitch decoding is only used for evaluation, we note that it can easily be made differentiable by using the expected value of the full pitch distribution (Engel et al., 2020a; Torres et al., 2024).

3.5.2 From relative to absolute pitch

Conversion from relative ( $p_{rel}$ ) to absolute MIDI pitch ( $\hat{p}$ ) is performed by applying the affine mapping:

18

\hat{p} (y) = \frac{1}{B} (p_{rel} (y) + p_{0}),

where $p_{0}$ is a fixed shift that only depends on the trained model. As in Gfeller et al. (2020), we calibrate this shift $p_{0}$ by relying on a set of synthetic data with known pitch. We synthesize harmonic signals ${s_{f_{0} = j}}_{j = 60}^{84}$ with five harmonics, fundamental frequency values ranging from MIDI pitch $60$ – $84$ , and harmonic amplitudes and overall gain drawn from a uniform distribution. $p_{0}$ is computed as the median distance between the predicted pitches and the ground‑truth fundamental across all synthetic signals.

4 Real‑Time Pitch Estimation

Our model is very lightweight and achieves a processing speed significantly faster than real‑time, which makes it theoretically well‑suited for real‑time applications (see Section 6.5). However, practical real‑time performance requires considerations beyond raw compute speed, as it requires the model to be causal and able to process individual audio streams.

In this section, we describe a streamlined approach for implementing real‑time inference, allowing the model to process audio streams with minimal latency.

4.1 Streamable VQT

By design, our model processes individual VQT frames independently, which makes it inherently suitable for real‑time applications. However, given an audio stream $s = (s_{1}, s_{2}, \dots)$ , computing a VQT frame centered around $s_{i}$ requires frames from $s_{i - w}$ to $s_{i + w}$ , where $w$ is the largest VQT window size. This non‑causal operation is incompatible with real‑time processing, as it requires future audio samples.

We build upon nnAudio (Cheuk et al., 2020), which implements audio time‑frequency transforms as PyTorch modules using convolutional layers.^⁹ In nnAudio's time–domain implementation, the transforms basis kernels are precomputed and stored as frozen model parameters. While nnAudio computes the CQT by convolving the input audio with CQT kernels followed by normalization, we modify this approach by replacing the CQT kernels with VQT kernels. However, the convolution kernel lengths still exceed the typical buffer size of 5–20 ms needed for real‑time applications (see Figure 1). To address this, we replace standard convolutional layers with cached convolutions, allowing our model to support streamed inputs by storing previous audio chunks in an internal circular buffer (Caillon and Esling, 2022).

4.2 Lag minimization

Our model predicts the pitch of the center of the VQT frames given as input. In addition to the compute time $τ$ of a forward pass of our model (typically less than 10 ms), this implies a theoretical lag equal to half of the VQT's kernel size between when a buffer is acquired and when the pitch of a frame centered on this buffer is returned:

19

δ = \frac{w}{2 f_{s}} + τ

where $w$ represents the VQT largest kernel size and $f_{s}$ is the sampling rate (see Figure 4a). Given that $w$ is typically proportional to $f_{s}$ (see Eq. (5)), the main limiting factor is the window size.

Illustration of the latency of our model, including how to mitigate it with buffer refilling. When a new buffer is consumed, the returned prediction is the pitch of the center of the Variable‑Q Transform (VQT) frame. Therefore, there is a delay of $w / 2$ between when a buffer of audio is obtained and its actual pitch is estimated. Buffer refilling places the most recent buffer at the center of the processed VQT frame, thus improving the reactivity of the model.

For the CQT, the kernel size $w$ is particularly large at low frequencies (e.g., $w = 131072$ samples, or $2.7$ seconds, at 48 kHz for an A0). By using the VQT, this issue is alleviated by reducing $w$ drastically, with the added bonus of enhancing model performance (see Section 7.1). For instance, by selecting $γ = 7$ , we have a maximum kernel size of $w = 8192$ samples at 48 kHz (171 ms). Even with such improvements, a lag of around 70 ms remains perceptible for most applications, such as real‑time resynthesis.

Increasing $γ$ could further decrease $w$ (see Figure 1), but, beyond a certain point, this also compromises model accuracy, as shown in Section 7.1.

To minimize this lag, we propose a buffer refilling technique that artificially creates windows centered closer to the current buffer. By shifting the input buffer to the center of the VQT frame, and filling the right side of the window (i.e., the non‑causal region) by repeating samples from the end of the buffer, we are able to predict the pitch of the current audio chunk with minimal latency. For a causal audio stream $s = (s_{1}, \dots, s_{t})$ , the frame is computed from a modified chunk:

20

(\underset{past samples}{\underset{⏟}{s_{t - w + ⌊ m (w - h) ⌋}, \dots, s_{t - 1}, s_{t}}}, \underset{refilled samples}{\underset{⏟}{s_{t - ⌊ m (w - h) ⌋}, \dots, s_{t}}}) \in R^{w},

where $h$ is the hop/buffer size and $m \in [0, \frac{1}{2}]$ is the refill factor, with $m = 0$ corresponding to no refilling at all (default behavior) and $m = \frac{1}{2}$ corresponding to the maximum possible refilling.

While it is common to reflect samples when repeating data as padding at analysis window boundaries, buffer refilling keeps the repeated samples in their original order. This is because reflection in the middle of the window would cause destructive interference between the imaginary components of the CQT kernel, which is conjugate symmetric under reversal.

As illustrated in Figure 4b, this technique enables us to construct windows centered on the most recent audio buffer, leading to:

21

δ^{'} = \frac{1}{f_{s}} ⌊ \frac{1}{2} w - m (w - h) ⌋ + τ, m \in [0, \frac{1}{2}] .

In particular, the total lag can be reduced to $h / 2 f_{s} + τ$ using maximum refilling ( $m = 0.5$ ), making the system extremely reactive.

To reduce the overall latency of our model, we also aim to reduce the compute time $τ$ of our model itself. We improve the original implementation of nnAudio's VQT by computing both real and imaginary kernels with a single convolution layer, and further reduce compute time with Just‑In‑Time (JIT) compilation.

4.3 Frequency–domain toeplitz layer

Further, we propose a faster implementation of the final Toeplitz fully‑connected layer of our architecture (see Section 3.3). This layer, necessary for PESTO's translation equivariance, can be further optimized to reduce runtime complexity. Naïve implementation of the layer as a 1D convolution incurs a cost of $O ((m + n - 1) \cdot n)$ . This can be reduced to $O (m \cdot n)$ through optimizations such as implicit padding, as implemented in optimized backends such as oneDNN and cuDNN, or even by directly realizing the $m \times n$ Toeplitz matrix. However, given the size of both the activation vector and implicit kernel, time complexity can be further reduced by performing convolution as multiplication in the frequency domain. This allows the Toeplitz layer to be applied in log‑linear time and eliminates redundancy in the model's forward pass.

5 Experimental Setup

5.1 Implementation details

Unless specified otherwise, VQT computations are performed with $f_{min} = 27.5$ Hz, which is the frequency of the lowest key of the piano (A0), $B = 3$ bins per semitones and at most $F = 99 B$ log‑frequency bins, which corresponds to the maximal number of bins respecting the Nyquist frequency for a $16$ ‑kHz signal. The maximal pitch‑shift between frames is $k_{max} = 16$ bins, i.e., slightly more than five semitones. We set the VQT parameter $γ = 7$ for our main experiments and use a resolution of $B = 3$ bins per semitone.

For our equivariance loss, we fix $α = 2^{\frac{1}{12 B}}$ . The exponential moving average rate for the loss weighting is $η = 0.999$ . White noise and random gain are applied as pitch‑preserving augmentations with a probability of $0.7$ to the cropped VQT frames. Standard deviation values for the white noise are drawn from a uniform distribution between $0.1$ and $2$ , and gain values are drawn between $- 6$ and $3$ dB. For training, we use a batch size of $256$ and the Adam optimizer (Kingma and Ba, 2015) with a learning rate of $1 0^{- 4}$ and default parameters. The model is trained for $50$ epochs using a cosine annealing learning rate scheduler. We refer to our code for precise configurations and hyperparameters. With our architecture being very lightweight, training requires only 845 MB of GPU memory and can be performed on a single GTX 1080Ti.

5.2 Datasets

Three datasets are considered for this study:

MIR‑1K (Hsu and Jang, 2010) contains $1000$ tracks (about $2$ h) of amateur singing of Chinese pop songs, with separate vocal and background music tracks provided. The isolated vocals partition is used for most experiments, apart from the one described in Section 6.7.
MDB‑stem‑synth (Salamon et al., 2017) contains 15.56 h of re‑synthesized monophonic music played by 25 different instruments. Perfect fundamental frequency annotations are available.
PTDB (Pirker et al., 2011) contains 4718 speech recordings from English speakers’ speech and corresponding laryngograph recordings for a total of $9.6$ h.

The datasets vary in size, pitch range, and granularity for the provided ground‑truth pitch annotations. Pitch annotations are provided with a hop size of 20 ms for MIR‑1K, 2.904 ms ( $128$ / $44100$ ) for MDB‑stem‑synth, and 10 ms for PTDB. We opt not to resample the annotations, as our model is not bound to a fixed hop size or sample rate. During training, we can take advantage of the original granularity of the annotations, while, on inference, we can choose the hop size that best fits the application. This is another advantage of using SSL with time‑frequency frontends compared to using supervised learning and raw waveform training: the sample rate and hop size can be changed on the fly for inference.

As noted in previous work (Morrison et al., 2023), a crucial problem in neural pitch estimation is overfitting to the pitch characteristics in the training data. Therefore, we conduct separate model training and evaluation on all datasets and examine generalization performance through cross‑evaluation. Another practical issue is the potential misalignment of pitch annotations, which can be difficult to detect and potentially worsen the performance of a model trained on these. For instance, on PTDB, pitch annotations are provided at time steps with a $5$ ‑ms offset (e.g., $5$ , $15$ , or $25$ ms for a hop size of $10$ ms) compared to MIR‑1K/MDB‑stem‑synth, where annotations are aligned with multiples of the hop size (e.g., $0$ ms, $10$ ms, $20$ ms). Indeed, we observed that not accounting for this offset negatively impacts the evaluation metrics. Since our model does not use annotations at training time, it is not affected by this issue, and potential misalignments can be accounted for without retraining.

5.3 Baselines

We compare our model to state‑of‑the‑art SSL and supervised neural pitch estimators:

CREPE (Kim et al., 2018) is a supervised model trained on MDB‑stem‑synth (Salamon et al., 2017) and MIR‑1K (Hsu and Jang, 2010), as well as Bach10 (Duan et al., 2010), RWC‑Synth (Mauch and Dixon, 2014), MedleyDB (Bittner et al., 2014), and NSynth (Engel et al., 2017). We use the default pretrained model provided in the official repository and deactivate Viterbi smoothing for fair comparisons.^¹⁰
PENN (Morrison et al., 2023) is a supervised model which improves the FCNF0 architecture. We do not retrain PENN, but we reproduce the authors' results on MDB‑stem‑synth and PTDB with the publicly available FCNF0++ model,^¹¹ which was trained on the training partitions of both datasets.
SPICE (Gfeller et al., 2020) is an SSL model trained to minimize the pitch difference (up to the known shift) between two cropped CQT frames. SPICE additionally reconstructs the input frame using a decoder and employs a reconstruction regularization.
DDSP‑inv (Engel et al., 2020a) performs self‑supervised pitch estimation by using analysis‑by‑synthesis and estimating the parameters of a harmonics plus noise synthesizer, with pretraining on synthetic data.

5.4 Evaluation metrics

Our evaluation procedure is standard and common to previous works (Gfeller et al., 2020; Kim et al., 2018; Morrison et al., 2023). In practice, estimated and reference frequencies are converted to fractional semitones using the mapping:

22

hz2mid : f \mapsto 12 {log}_{2} ((\frac{f}{440})) + 69

which maps the A4 (440 Hz) to $69$ (which is the MIDI standard) and the other pitches accordingly.

For any voiced frame, the pitch error $e$ between the estimated fundamental frequency $\hat{f}$ and the ground‑truth $f$ is therefore:

23

\begin{aligned} e (\hat{f}, f) & = | hz2mid (\hat{f}) - hz2mid (f) | \\ = 12 | {log}_{2} (\hat{f} / f) | \end{aligned}

which exactly corresponds to the (fractional) number of semitones between the prediction and ground‑truth pitch. From this error, we compute the standard metrics introduced by Poliner et al. (2007):

Raw pitch accuracy (RPA), which measures the proportion of voiced frames for which the pitch error is less than half a semitone ( $50$ cents).
Raw chroma accuracy (RCA), which measures the proportion of voiced frames for which $| e (\hat{f}, f)$ $mod 12 | \leq 0.5$ , thus accepting octave errors.

In addition, we also report the mean pitch error (MnE) and median pitch error (MdE) in semitones over all voiced frames.

6 Results

6.1 Comparison with self‑supervised methods

PESTO significantly advances the state‑of‑the‑art in self‑supervised pitch estimation, achieving $97.7 %$ and $97.0 %$ RPA on MIR‑1K and MDB‑stem‑synth respectively, outperforming previous self‑supervised approaches SPICE and DDSP‑inv by a large margin (Table 1).

Table 1

Performances of our model compared to previous self‑supervised methods. For both baselines, we report the results provided in the paper.

		Raw Pitch Accuracy
Model	# params	MIR‑1K	MDB‑stem‑synth
SPICE	2.38M	90.6	89.1
DDSP‑inv	21.6M	91.8	88.5
PESTO	130k	97.7	97.0

6.2 Comparison with supervised methods

Table 2 shows experimental results comparing PESTO with supervised methods. Despite having only $130$ k parameters— $170$ times fewer than CREPE's $22.2$ M—our model exceeds its performance on MIR‑1K ( $97.7 %$ vs. $97.5 %$ RPA) and PTDB ( $89.7 %$ vs. $87.1 %$ RPA), while being comparable on MDB ( $97 %$ vs. $97.3 %$ ). While PENN achieves superior results when trained on matched domains ( $99.6 %$ RPA on MDB, $95.1 %$ RPA on PTDB), PESTO maintains competitive performance without requiring supervision.

Table 2

Performance comparison across different datasets and training configurations. PESTO is self‑supervised, while CREPE and PENN are state‑of‑the‑art supervised models. Colors indicate evaluation scenarios: same‑dataset, multi‑dataset, and cross‑dataset evaluation. For raw pitch accuracy (RPA) and raw chroma accuracy (RCA), higher is better, while, for mean pitch error (MnE) and median pitch error (MdE) lower is better. Best results for each metric/test dataset are in bold.

			MIR‑1K				MDB‑stem‑synth				PTDB
Model	# params	Training data	RPA	RCA	MnE	MdE	RPA	RCA	MnE	MdE	RPA	RCA	MnE	MdE
CREPE	22.2M	Many	97.5	98.0	0.23	0.07	97.3	97.4	0.13	0.02	87.1	89.9	>1.24	>0.11
PENN	8.9M	MDB‑ss	–	–	–	–	99.7^†	–	–	–	63.2^†	–	–	–
		PTDB	–	–	–	–	51.6^†	–	–	–	94.4^†	–	–	–
		Both	90.6	92.4	1.27	0.11	99.6	99.6	0.05	0.03	95.1	96.0	0.30	0.06
PESTO	130k	MIR‑1K	97.7	98.0	0.21	0.12	94.8	95.9	0.43	0.09	87.7	90.3	0.84	0.13
		MDB‑ss	94.6	96.1	0.57	0.14	97.0	97.1	0.18	0.08	88.3	89.9	0.62	0.13
		PTDB	95.6	96.9	0.63	0.13	96.3	96.6	0.21	0.09	89.7	91.2	0.56	0.13
		All	95.6	96.7	0.49	0.14	97.1	97.3	0.17	0.08	88.5	90.0	0.60	0.13

[i] ^† Results taken from the original paper (Morrison et al., 2023)

6.3 Cross‑dataset evaluation

Table 2 shows detailed results for cross‑dataset evaluation.

Music $\to$ Speech: PESTO models trained on music data outperform CREPE and PENN (when trained on music) on speech datasets. PENN, when trained only on MDB, achieves $63.2$ RPA on PTDB, while PESTO achieves $88.3$ RPA. CREPE, trained on many more music datasets, achieves $87.1$ RPA on PTDB.

Speech $\to$ Music: When trained on PTDB (speech), PESTO still achieves $96.3 %$ RPA on MDB (music), while PENN's performance drops to $51.6 %$ RPA.

Music/Speech $\to$ Singing Voice: PESTO consistently outperforms PENN on MIR‑1K across training conditions. When trained on both MDB and PTDB, PENN achieves $90.6 %$ RPA, while PESTO achieves $94.6 %$ RPA or higher in all cross‑evaluations for MIR‑1K.

Overall, our results corroborate the findings that supervised models are heavily influenced by the training pitch distribution (Morrison et al., 2023) and show that PESTO is more robust to out‑of‑distribution data. Although PESTO training and evaluating on the same dataset yield the best results, the cross‑domain performance gap remains considerably smaller than when using supervised baselines. Notably, the MdE metrics remain stable across training sets, suggesting PESTO captures fundamental pitch characteristics, while dataset‑specific training primarily improves robustness to challenging frames.

6.4 Multi‑dataset training

When evaluating on a given dataset, we observe that models trained solely on that dataset generally have better performance than models trained on combined datasets. MDB is a notable exception, where comprehensive training yields the best results, though the difference is marginal ( $97.1 %$ RPA instead of $97.0 %$ ). This effect may be attributed to MDB's larger size dominating the combined dataset distribution.

On the contrary, PENN performs better (on PTDB) or equally (on MDB) when trained on multiple datasets, and it clearly outperforms PESTO when the training and test distributions overlap.

6.5 Inference speed

We evaluated the real‑time capabilities of our PESTO model, as well as PENN (Morrison et al., 2023) and YIN (de Cheveigné and Kawahara, 2002) as baselines, without Viterbi decoding. The real‑time factor (RTF), defined as the ratio of processing time to audio input duration, was measured across CPU (Intel Core i9‑12900H 2.9GHz) and GPU (NVIDIA RTX A2000 8GB) platforms. Using 5‑s audio segments at 16 kHz with a 10‑ms hopsize, we performed 100 measurements for each configuration. For PENN GPU inference, we used a batch size of 2048 frames using the publicly available API. All reported timings in Table 3 include the inference pipeline from loaded audio input to pitch estimation (or pitch candidates for YIN) output.

Table 3

Real‑time factors (lower is better) for pitch‑detection models (RTF < 1 indicates real‑time capable).

	Real‑time Factor
Model	CPU	GPU
PESTO	0.0354	0.0032
PENN	0.1706	0.0096
YIN	0.0568	–

PESTO outperforms all other models in RTF by a significant margin. PENN achieves real‑time performance (in domain) but requires roughly five times more processing time than PESTO on CPU and three times more on GPU, while using parallel computation of frames with a batch size of $2048$ . Finally, PESTO is approximately $60 %$ faster than the YIN baseline on CPU and 18 times faster when ran on a GPU.

6.6 Effect of buffer refilling

To determine the effect of the lossy buffer refilling technique (Section 4.2) on PESTO's performance, we retrained the model for refill factors $m \in {0.1, 0.2, \dots, 0.5}$ on the MIR‑1k dataset. RPA and RCA metrics are reported on the test dataset in Table 4. We observe a minimal degradation in performance, with an RPA of $97.4 %$ at the maximum possible refill factor ( $m = 0.5$ ), compared to the standard PESTO model, which achieves $97.7 %$ ( $m = 0.0$ ). This suggests that PESTO is relatively robust to the removal of future context via buffer refilling and is thus suitable for deployment in real‑time and latency‑critical scenarios.

Table 4

Effect of buffer refilling on PESTO's performance on the MIR‑1k dataset. The model is trained and tested with VQT inputs modified according to the procedure detailed in Section 4.2. A refill factor $m = 0.0$ is equivalent to the regular PESTO model, while $m = 0.5$ indicates the maximum possible refilling.

		Refill factor ( $m$ )
Method	Metric	0.0	0.1	0.2	0.3	0.4	0.5
Refill	RPA	97.7	97.4	97.5	97.5	97.4	97.4
Refill	RCA	98.0	97.8	97.9	97.9	97.7	97.7
Zero	RPA	97.7	97.0	97.5	97.5	97.4	95.4
Zero	RCA	98.0	97.4	97.9	97.9	97.8	95.7

For comparison, we also try replacing the refilled portion of the analysis window with zeros. For $m \leq 0.4$ , this appears to perform equivalently to buffer refilling, but, at the maximum refill factor of $m = 0.5$ , we find that the zero‑filling technique suffers degraded performance.

6.7 Robustness to background music

In real‑world scenarios, we often do not have access to clean signals. In this section, we evaluate to what extent the predictions of our model are robust to the presence of background music in the signals. To do so, we use the MIR‑1k dataset, for which we have access to the separated vocals and background music, which allows testing various signal‑to‑noise (here, vocals‑to‑background) ratios.

We indicate the results in Table 5. As foreseen, the RPA of PESTO when trained only on clean vocals (row $β = 0$ ) considerably drops: from $97.2$ % to $46.8$ %.

Table 5

Robustness of PESTO and other baselines to background music on the MIR‑1K dataset, with various signal‑to‑noise ratios.

	Raw Pitch Accuracy
Model	clean	$20 dB$	$10 dB$	$0 dB$
PESTO
$β = 0$	97.2	93.7	81.6	46.8
$β = \frac{1}{2}$	98.1	97.9	95.8	79.7
$β = 1$	97.1	96.7	94.0	78.9

$β \sim N (0, \frac{1}{2})$	98.0	97.6	94.9	79.3
$β \sim N (0, 1)$	98.1	97.8	95.6	82.5

$β \sim U (0, \frac{1}{2})$	98.3	98.0	95.9	79.2
$β \sim U (0, 1)$	98.0	97.6	95.2	80.8
SPICE^†	91.4	91.2	90.0	81.6
CREPE	97.5	97.1	95.3	85.8
PENN	90.6	81.0	51.1	20.9

[i] ^†Results taken from the original paper (Gfeller et al., 2020)

To improve the robustness to background music, we propose to modify the training pipeline slightly. Instead of the data augmentations described in Section 5.1, we create the augmented view $\tilde{x}$ of our original vocals signal $x_{vocals}$ by mixing it (in the complex‑CQT domain) with its corresponding background track $x_{background}$ :

24

\tilde{x} = \frac{1}{1 + | β |} (x_{vocals} + | β | x_{background}),

with $β \in R$ .^¹² The model is thus trained to ignore the background music to make its predictions.

We investigate different strategies for sampling $β$ , whose respective results are depicted on Table 5.

We observe that using background music as data augmentation consistently improves performance regardless of the sampling strategy. Notably, sampling from a distribution instead of using a fixed value yields better results. This improvement arises because the model learns to handle varying signal‑to‑noise ratios and because the background level differs between the input $\tilde{x}$ and its pitch‑shifted counterpart ${\tilde{x}}^{(k)}$ . Consequently, not only does the invariance loss $L_{inv}$ help the model ignore the background, but the other losses also contribute. Sampling $β$ from the uniform distribution $U (0, \frac{1}{2})$ generally leads to the best performance, except when vocals and background are equally mixed ( $SNR = 0 dB$ ), where $β \sim N (0, 1)$ appears optimal, achieving $82.5$ % RPA. Interestingly, incorporating background music during training not only enhances performance on noisy signals but also improves results on clean ones. Our best model achieves $98.3$ % RPA on the clean test set of MIR‑1K, surpassing previous data‑augmentation strategies (see Table 2). This demonstrates that, when background music is available, it can significantly benefit single pitch estimators.

Finally, PESTO outperforms both supervised (CREPE, PENN) and self‑supervised baselines (SPICE), apart from $SNR = 0 dB$ , where CREPE performs better. However, the drop in performance between $SNR = 10 dB$ and $SNR = 0 dB$ is higher for PESTO (approximately $15$ %) than that for SPICE or CREPE (about $10$ %), which suggests there is still room for improvement.

7 Ablations

7.1 Impact of the input frontend

We vary the $γ$ parameter of the VQT between $0$ (CQT) and $15$ and train models for each dataset. Figure 5 shows the RPA and RCA results for cross‑evaluation for different values of $γ$ . Each reported value is the mean of the top three scores achieved out of five runs with different random seeds, for a given training dataset and $γ$ value.

Comparison of pitch accuracy metrics across different datasets as a function of the Variable‑Q Transform parameter $γ$ . Each subplot shows test performance on a specific dataset (MDB, MIR‑1K, or PTDB), with line colors and markers indicating the training dataset. Solid lines represent raw pitch accuracy, while dashed lines represent raw chroma accuracy. The points indicate the mean of the top three scores out of five runs with different random seeds.

Our experiments demonstrate that the VQT as an input to the model significantly enhances pitch‑estimation accuracy compared to the CQT. As it uses smaller kernel sizes at lower frequencies ( $γ > 0$ ) than the CQT, the VQT achieves superior time–frequency alignment with the ground‑truth pitch contours. As an example, we illustrate in Figure 6 a comparison of spectrogram frames for the CQT ( $γ = 0$ ) and the VQT ( $γ = 7$ ). The VQT produces sharper temporal boundaries and more precise frequency localization than the CQT.

Comparison between CQT ( $γ = 0$ , top) and Variable‑Q Transform ( $γ = 7$ , bottom) spectrograms for one example from MIR‑1K (left) and PTDB‑TUG (right) datasets. Highlighted rectangles indicate time/frequency misalignments between the spectrograms and the ground‑truth pitch contour (black line), particularly noticeable for the CQT at lower frequencies.

While increasing $γ$ improves performance across all datasets, this benefit plateaus and eventually deteriorates when frequency resolution is compromised. This trade‑off is particularly pronounced in the speech dataset, which has rapid phonetic transitions compared to musical signals yet still requires sufficient spectral detail for accurate pitch tracking. Interestingly, the optimal $γ$ value is not the same for all datasets, highlighting the fact that there is a domain‑dependent tradeoff between low‑frequency resolution and time resolution. Values in the range $\sim [7 - 10]$ , however, provide the best results with relatively low variation for all datasets.

In theory, under supervised training, one could leverage the temporal context to compensate for misalignments or learn the optimal time–frequency representation for pitch estimation. Our results suggest, however, that a small adjustment to the CQT already provides a more flexible and efficient solution that generalizes across diverse acoustic domains while allowing for SSL training without the need for labeled data. Furthermore, as Section 4 details, the use of the VQT significantly improves the computation speed of the model, which is critical for real‑time.

7.2 Design choices

In this section, we examine the impact of various design choices on the performance of our model. The results for training on MIR‑1K are shown in Table 6a. First, we observe that the final Toeplitz fully‑connected layer, which ensures the architecture is transposition‑preserving, is essential for the model to learn effectively. In contrast, data augmentations are not necessary for training but usually enhance performance.

Table 6

Respective contribution of various design choices of our model (losses, data augmentations, Toeplitz layer) on its performances.

(a) Training on MIR‑1k

			MIR‑1k		MDB‑ss		PTDB
$L_{equiv}$	$L_{inv}$	$L_{SCE}$	RPA	RCA	RPA	RCA	RPA	RCA
✗	✗	✓	1.6	6.1	0.8	11.5	0.0	11.5
✗	✓	✗	88.2	88.5	78.7	78.9	76.0	77.4
✓	✗	✗	97.3	97.6	87.2	92.2	82.2	84.1
✗	✓	✓	97.2	97.8	95.8	96.5	88.3	90.0
✓	✗	✓	97.2	97.5	90.4	91.6	82.4	84.3
✓	✓	✗	97.1	97.5	86.9	93.6	85.9	87.7
✓	✓	✓	97.7	98.0	94.8	95.9	87.7	90.3
No data augmentations			97.2	97.6	93.8	94.6	88.0	90.1
No Toeplitz layer			1.3	7.1	1.5	5.7	0.0	12.5

(b) Training on MDB‑stem‑synth

	MIR‑1k		MDB‑ss		PTDB
$L_{equiv}$	RPA	RCA	RPA	RCA	RPA	RCA
✗	95.4	96.6	97.0	97.1	88.6	90.3
✓	94.6	96.1	97.0	97.1	88.3	89.9

(c) Training on PTDB

	MIR‑1k		MDB‑ss		PTDB
$L_{equiv}$	RPA	RCA	RPA	RCA	RPA	RCA
✗	95.1	96.9	95.6	96.0	89.8	91.3
✓	95.6	96.9	96.3	96.6	89.7	91.2

The influence of the different loss functions is more complex to interpret. When using only $L_{SCE}$ , the model does not learn at all, whereas $L_{equiv}$ alone yields good in‑domain performance. However, combining $L_{equiv}$ with the other two losses helps the model generalize to out‑of‑domain distributions. Interestingly, $L_{inv}$ alone produces decent performance, indicating that learning to be insensitive to white noise and gain changes enables the model to focus on the fundamental frequency of the signal, though the accuracy is lower than when $L_{equiv}$ is included.

Surprisingly, discarding $L_{equiv}$ and using only the other two losses achieves the best out‑of‑distribution performance. This counterintuitive result prompted further evaluation of $L_{equiv}$ when training on MDB‑stem‑synth (Table 6b) and PTDB (Table 6c). These experiments, however, did not reveal a consistent pattern: $L_{equiv}$ appears beneficial for training on PTDB but slightly detrimental for MDB‑stem‑synth.

This suggests that the design of our loss functions and the weighting strategy (based on gradient norms; see Section 3.2.4) may benefit from further refinement.

8 Conclusion

In this paper, we introduced a novel self‑supervised method for pitch estimation. Through evaluation on several music and speech datasets, we demonstrated that our method significantly outperforms previous self‑supervised baselines and achieves performance on par with supervised approaches, despite having only $130$ k parameters.

Our method exhibits superior generalization capabilities compared to baselines, particularly in scenarios where there is a shift between the training and test distributions. Additionally, we proposed a simple solution to enable the model to process audio streams with arbitrary sampling rates or buffer sizes, without requiring retraining. With its very low latency of less than 5 ms, this makes our approach highly suitable for real‑time applications.

The self‑supervised paradigm we propose is free from specific musical assumptions, making it applicable to tasks with very limited annotated data. This enhances its relevance for non‑Western music information retrieval (Han et al., 2023; Li et al., 2022), as well as for broader applications for which collecting annotated data can be impractical, such as bioacoustics (Best et al., 2025; Hagiwara et al., 2024) and physics modeling (Bagad et al., 2024).

The versatility of our method makes it suitable for a wide range of real‑world applications. To encourage its adoption and further experimentation, we released our code and pretrained models as a pip‑installable package, along with the full training pipeline.

Furthermore, exploiting equivariance for solving classification problems is a promising direction for future research, as it allows models to directly return probability distributions using only a single (potentially synthetic) labeled element. This has potential applications beyond pitch estimation, such as tempo estimation (Gagneré et al., 2024) or key detection (Kong et al., 2024).

Finally, while our model focuses on monophonic pitch estimation, the training objective does not constrain it to a single prediction. This differs from previous self‑supervised pitch‑estimation methods that frame the task as a regression problem (Engel et al., 2020a; Gfeller et al., 2020). In particular, it paves the way towards self‑supervised multi‑pitch estimation.

Acknowledgments

This work has been funded by the ANRT CIFRE convention n^∘2021/1537 and Sony France.

B. Torres and G. Richard are supported by the European Union (ERC, HI‑Audio, 101052978). Views and opinions expressed are, however, those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them.

We first thank the anonymous reviewers for their meticulous and constructive feedback, which greatly helped us improve the quality of this paper. Moreover, we would like to thank Alexis André for motivating the need for a real real‑time implementation of PESTO and for pushing the boundaries of its artistic applications with his crazy generated visuals. We also thank Cyran Aouameur and Yanis Amedjkane for implementing a PESTO plugin in Juce and for their insightful suggestions, as well as DeLaurentis for integrating PESTO on stage within her live performances. Finally, we thank Emmanuel Deruty for very inspiring discussions about the nature of pitch.

Competing Interests

The authors have no competing interests to declare.

Notes

[3] e.g. Yousician, Yuru

[4] See, for example, Mawf, DDSP-VST, and Neutone.

[5] https://github.com/svc-develop-team/so-vits-svc

[6] https://github.com/yxlllc/DDSP-SVC

[7] As, for example, in the DDSP-VST plugin.

[8] https://github.com/SonyCSLParis/pesto

[9] Convolutions roughly preserve translations since the kernels are applied locally, meaning that, if two translated inputs are convolved by the same kernel, then the output results will be almost translations of each other as well.

[10] Precisely, we force the median of the pitch distribution to lie between bins 144 and 240 at the end of each epoch.

[11] https://github.com/KinWaiCheuk/nnAudio

[12] https://github.com/marl/crepe

[13] https://github.com/interactiveaudiolab/penn

[14] ${\tilde{x}}^{(k)}$ is obtained from $x^{(k)}$ similarly; however, the value $β$ is not necessarily identical.