PESTO: Real‑Time Pitch Estimation with Self‑Supervised Transposition‑Equivariant Objective

Alain Riou; Bernardo Torres; Ben Hayes; Stefan Lattner; Gaëtan Hadjeres; Gaël Richard; Geoffroy Peeters

doi:10.5334/tismir.251

PESTO: Real‑Time Pitch Estimation with Self‑Supervised Transposition‑Equivariant Objective

Transactions of the International Society for Music Information Retrieval

Volume 8 (2025): Issue 1

By: Alain Riou, Bernardo Torres, Ben Hayes, Stefan Lattner, Gaëtan Hadjeres, Gaël Richard and Geoffroy Peeters

Open Access

|Sep 2025

Figures & Tables

Variable‑Q Transform (VQT) analysis window size as a function of $γ$ and analysis center frequency. During the computation of the VQT, all kernels are padded to the nearest power of $2$ to the largest analysis window. The Constant‑Q Transform corresponds to the case $γ = 0$ . Both axes are log‑scaled.

Illustration of the pitch‑shift process in the Variable‑Q Transform (VQT) domain. From a given frame, we construct two views by cropping equally‑sized sub‑frames from it, with a shift of $k$ between them. Since the frequency scale is logarithmic in the VQT domain, this translation corresponds to an approximate pitch shift of $k$ bins.

Overview of the PESTO model. Given a 1D Variable‑Q Transform frame (displayed horizontally, where the horizontal axis corresponds to frequency), we first crop it as described in Section 3.1.3 to create a pair of pitch‑shifted views $(x, x^{(k)})$ . We then obtain $\tilde{x}$ and ${\tilde{x}}^{(k)}$ by randomly applying pitch‑preserving transforms to the views. The neural network $f_{θ}$ predicts pitch distributions from the different views and is trained by minimizing both an invariance loss between $y$ and $\tilde{y}$ and an equivariance loss between $\tilde{y}$ and ${\tilde{y}}^{(k)}$ .

Illustration of the latency of our model, including how to mitigate it with buffer refilling. When a new buffer is consumed, the returned prediction is the pitch of the center of the Variable‑Q Transform (VQT) frame. Therefore, there is a delay of $w / 2$ between when a buffer of audio is obtained and its actual pitch is estimated. Buffer refilling places the most recent buffer at the center of the processed VQT frame, thus improving the reactivity of the model.

Table 1

Performances of our model compared to previous self‑supervised methods. For both baselines, we report the results provided in the paper.

		Raw Pitch Accuracy
Model	# params	MIR‑1K	MDB‑stem‑synth
SPICE	2.38M	90.6	89.1
DDSP‑inv	21.6M	91.8	88.5
PESTO	130k	97.7	97.0

Table 2

Performance comparison across different datasets and training configurations. PESTO is self‑supervised, while CREPE and PENN are state‑of‑the‑art supervised models. Colors indicate evaluation scenarios: same‑dataset, multi‑dataset, and cross‑dataset evaluation. For raw pitch accuracy (RPA) and raw chroma accuracy (RCA), higher is better, while, for mean pitch error (MnE) and median pitch error (MdE) lower is better. Best results for each metric/test dataset are in bold.

			MIR‑1K				MDB‑stem‑synth				PTDB
Model	# params	Training data	RPA	RCA	MnE	MdE	RPA	RCA	MnE	MdE	RPA	RCA	MnE	MdE
CREPE	22.2M	Many	97.5	98.0	0.23	0.07	97.3	97.4	0.13	0.02	87.1	89.9	>1.24	>0.11
PENN	8.9M	MDB‑ss	–	–	–	–	99.7^†	–	–	–	63.2^†	–	–	–
		PTDB	–	–	–	–	51.6^†	–	–	–	94.4^†	–	–	–
		Both	90.6	92.4	1.27	0.11	99.6	99.6	0.05	0.03	95.1	96.0	0.30	0.06
PESTO	130k	MIR‑1K	97.7	98.0	0.21	0.12	94.8	95.9	0.43	0.09	87.7	90.3	0.84	0.13
		MDB‑ss	94.6	96.1	0.57	0.14	97.0	97.1	0.18	0.08	88.3	89.9	0.62	0.13
		PTDB	95.6	96.9	0.63	0.13	96.3	96.6	0.21	0.09	89.7	91.2	0.56	0.13
		All	95.6	96.7	0.49	0.14	97.1	97.3	0.17	0.08	88.5	90.0	0.60	0.13

[i] ^† Results taken from the original paper (Morrison et al., 2023)

Table 3

Real‑time factors (lower is better) for pitch‑detection models (RTF < 1 indicates real‑time capable).

	Real‑time Factor
Model	CPU	GPU
PESTO	0.0354	0.0032
PENN	0.1706	0.0096
YIN	0.0568	–

Table 4

Effect of buffer refilling on PESTO's performance on the MIR‑1k dataset. The model is trained and tested with VQT inputs modified according to the procedure detailed in Section 4.2. A refill factor $m = 0.0$ is equivalent to the regular PESTO model, while $m = 0.5$ indicates the maximum possible refilling.

		Refill factor ( $m$ )
Method	Metric	0.0	0.1	0.2	0.3	0.4	0.5
Refill	RPA	97.7	97.4	97.5	97.5	97.4	97.4
Refill	RCA	98.0	97.8	97.9	97.9	97.7	97.7
Zero	RPA	97.7	97.0	97.5	97.5	97.4	95.4
Zero	RCA	98.0	97.4	97.9	97.9	97.8	95.7

Table 5

Robustness of PESTO and other baselines to background music on the MIR‑1K dataset, with various signal‑to‑noise ratios.

	Raw Pitch Accuracy
Model	clean	$20 dB$	$10 dB$	$0 dB$
PESTO
$β = 0$	97.2	93.7	81.6	46.8
$β = \frac{1}{2}$	98.1	97.9	95.8	79.7
$β = 1$	97.1	96.7	94.0	78.9

$β \sim N (0, \frac{1}{2})$	98.0	97.6	94.9	79.3
$β \sim N (0, 1)$	98.1	97.8	95.6	82.5

$β \sim U (0, \frac{1}{2})$	98.3	98.0	95.9	79.2
$β \sim U (0, 1)$	98.0	97.6	95.2	80.8
SPICE^†	91.4	91.2	90.0	81.6
CREPE	97.5	97.1	95.3	85.8
PENN	90.6	81.0	51.1	20.9

[i] ^†Results taken from the original paper (Gfeller et al., 2020)

Comparison of pitch accuracy metrics across different datasets as a function of the Variable‑Q Transform parameter $γ$ . Each subplot shows test performance on a specific dataset (MDB, MIR‑1K, or PTDB), with line colors and markers indicating the training dataset. Solid lines represent raw pitch accuracy, while dashed lines represent raw chroma accuracy. The points indicate the mean of the top three scores out of five runs with different random seeds.

Comparison between CQT ( $γ = 0$ , top) and Variable‑Q Transform ( $γ = 7$ , bottom) spectrograms for one example from MIR‑1K (left) and PTDB‑TUG (right) datasets. Highlighted rectangles indicate time/frequency misalignments between the spectrograms and the ground‑truth pitch contour (black line), particularly noticeable for the CQT at lower frequencies.

Table 6

Respective contribution of various design choices of our model (losses, data augmentations, Toeplitz layer) on its performances.

(a) Training on MIR‑1k

			MIR‑1k		MDB‑ss		PTDB
$L_{equiv}$	$L_{inv}$	$L_{SCE}$	RPA	RCA	RPA	RCA	RPA	RCA
✗	✗	✓	1.6	6.1	0.8	11.5	0.0	11.5
✗	✓	✗	88.2	88.5	78.7	78.9	76.0	77.4
✓	✗	✗	97.3	97.6	87.2	92.2	82.2	84.1
✗	✓	✓	97.2	97.8	95.8	96.5	88.3	90.0
✓	✗	✓	97.2	97.5	90.4	91.6	82.4	84.3
✓	✓	✗	97.1	97.5	86.9	93.6	85.9	87.7
✓	✓	✓	97.7	98.0	94.8	95.9	87.7	90.3
No data augmentations			97.2	97.6	93.8	94.6	88.0	90.1
No Toeplitz layer			1.3	7.1	1.5	5.7	0.0	12.5

(b) Training on MDB‑stem‑synth

	MIR‑1k		MDB‑ss		PTDB
$L_{equiv}$	RPA	RCA	RPA	RCA	RPA	RCA
✗	95.4	96.6	97.0	97.1	88.6	90.3
✓	94.6	96.1	97.0	97.1	88.3	89.9

(c) Training on PTDB

	MIR‑1k		MDB‑ss		PTDB
$L_{equiv}$	RPA	RCA	RPA	RCA	RPA	RCA
✗	95.1	96.9	95.6	96.0	89.8	91.3
✓	95.6	96.9	96.3	96.6	89.7	91.2

References

Authors

Metrics

Articles in this issue

DOI: https://doi.org/10.5334/tismir.251 | Journal eISSN: 2514-3298

Journal RSS Feed

Language: English

Submitted on: Jan 9, 2025

Accepted on: Aug 1, 2025

Published on: Sep 9, 2025

Published by: Ubiquity Press

In partnership with: Paradigm Publishing Services

Publication frequency: 1 issue per year

Keywords:

pitch estimation,

self-supervised learning,

equivariance,

real-time,

streamable convolutions,

variable-Q transform,

lightweight,

f0 estimation,

music information retrieval,

Toeplitz matrix

© 2025 Alain Riou, Bernardo Torres, Ben Hayes, Stefan Lattner, Gaëtan Hadjeres, Gaël Richard, Geoffroy Peeters, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.

Volume 8 (2025): Issue 1

PESTO: Real‑Time Pitch Estimation with Self‑Supervised Transposition‑Equivariant Objective

Figures & Tables

Figure 1

Figure 2

Figure 3

Figure 4

Table 1

Table 2

Table 3

Table 4

Table 5

Figure 5

Figure 6

Table 6

Paradigm

My account