
Figure 1
Variable‑Q Transform (VQT) analysis window size as a function of and analysis center frequency. During the computation of the VQT, all kernels are padded to the nearest power of to the largest analysis window. The Constant‑Q Transform corresponds to the case . Both axes are log‑scaled.

Figure 2
Illustration of the pitch‑shift process in the Variable‑Q Transform (VQT) domain. From a given frame, we construct two views by cropping equally‑sized sub‑frames from it, with a shift of between them. Since the frequency scale is logarithmic in the VQT domain, this translation corresponds to an approximate pitch shift of bins.

Figure 3
Overview of the PESTO model. Given a 1D Variable‑Q Transform frame (displayed horizontally, where the horizontal axis corresponds to frequency), we first crop it as described in Section 3.1.3 to create a pair of pitch‑shifted views . We then obtain and by randomly applying pitch‑preserving transforms to the views. The neural network predicts pitch distributions from the different views and is trained by minimizing both an invariance loss between and and an equivariance loss between and .


Figure 4
Illustration of the latency of our model, including how to mitigate it with buffer refilling. When a new buffer is consumed, the returned prediction is the pitch of the center of the Variable‑Q Transform (VQT) frame. Therefore, there is a delay of between when a buffer of audio is obtained and its actual pitch is estimated. Buffer refilling places the most recent buffer at the center of the processed VQT frame, thus improving the reactivity of the model.
Table 1
Performances of our model compared to previous self‑supervised methods. For both baselines, we report the results provided in the paper.
| Raw Pitch Accuracy | |||
|---|---|---|---|
| Model | # params | MIR‑1K | MDB‑stem‑synth |
| SPICE | 2.38M | 90.6 | 89.1 |
| DDSP‑inv | 21.6M | 91.8 | 88.5 |
| PESTO | 130k | 97.7 | 97.0 |
Table 2
Performance comparison across different datasets and training configurations. PESTO is self‑supervised, while CREPE and PENN are state‑of‑the‑art supervised models. Colors indicate evaluation scenarios: same‑dataset, multi‑dataset, and cross‑dataset evaluation. For raw pitch accuracy (RPA) and raw chroma accuracy (RCA), higher is better, while, for mean pitch error (MnE) and median pitch error (MdE) lower is better. Best results for each metric/test dataset are in bold.
| MIR‑1K | MDB‑stem‑synth | PTDB | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Model | # params | Training data | RPA | RCA | MnE | MdE | RPA | RCA | MnE | MdE | RPA | RCA | MnE | MdE |
| CREPE | 22.2M | Many | 97.5 | 98.0 | 0.23 | 0.07 | 97.3 | 97.4 | 0.13 | 0.02 | 87.1 | 89.9 | >1.24 | >0.11 |
| PENN | 8.9M | MDB‑ss | – | – | – | – | 99.7† | – | – | – | 63.2† | – | – | – |
| PTDB | – | – | – | – | 51.6† | – | – | – | 94.4† | – | – | – | ||
| Both | 90.6 | 92.4 | 1.27 | 0.11 | 99.6 | 99.6 | 0.05 | 0.03 | 95.1 | 96.0 | 0.30 | 0.06 | ||
| PESTO | 130k | MIR‑1K | 97.7 | 98.0 | 0.21 | 0.12 | 94.8 | 95.9 | 0.43 | 0.09 | 87.7 | 90.3 | 0.84 | 0.13 |
| MDB‑ss | 94.6 | 96.1 | 0.57 | 0.14 | 97.0 | 97.1 | 0.18 | 0.08 | 88.3 | 89.9 | 0.62 | 0.13 | ||
| PTDB | 95.6 | 96.9 | 0.63 | 0.13 | 96.3 | 96.6 | 0.21 | 0.09 | 89.7 | 91.2 | 0.56 | 0.13 | ||
| All | 95.6 | 96.7 | 0.49 | 0.14 | 97.1 | 97.3 | 0.17 | 0.08 | 88.5 | 90.0 | 0.60 | 0.13 | ||
[i] † Results taken from the original paper (Morrison et al., 2023)
Table 3
Real‑time factors (lower is better) for pitch‑detection models (RTF < 1 indicates real‑time capable).
| Real‑time Factor | ||
|---|---|---|
| Model | CPU | GPU |
| PESTO | 0.0354 | 0.0032 |
| PENN | 0.1706 | 0.0096 |
| YIN | 0.0568 | – |
Table 4
Effect of buffer refilling on PESTO's performance on the MIR‑1k dataset. The model is trained and tested with VQT inputs modified according to the procedure detailed in Section 4.2. A refill factor is equivalent to the regular PESTO model, while indicates the maximum possible refilling.
| Refill factor () | |||||||
|---|---|---|---|---|---|---|---|
| Method | Metric | 0.0 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 |
| Refill | RPA | 97.7 | 97.4 | 97.5 | 97.5 | 97.4 | 97.4 |
| RCA | 98.0 | 97.8 | 97.9 | 97.9 | 97.7 | 97.7 | |
| Zero | RPA | 97.7 | 97.0 | 97.5 | 97.5 | 97.4 | 95.4 |
| RCA | 98.0 | 97.4 | 97.9 | 97.9 | 97.8 | 95.7 | |
Table 5
Robustness of PESTO and other baselines to background music on the MIR‑1K dataset, with various signal‑to‑noise ratios.
| Raw Pitch Accuracy | ||||
|---|---|---|---|---|
| Model | clean | |||
| PESTO | ||||
| 97.2 | 93.7 | 81.6 | 46.8 | |
| 98.1 | 97.9 | 95.8 | 79.7 | |
| 97.1 | 96.7 | 94.0 | 78.9 | |
| 98.0 | 97.6 | 94.9 | 79.3 | |
| 98.1 | 97.8 | 95.6 | 82.5 | |
| 98.3 | 98.0 | 95.9 | 79.2 | |
| 98.0 | 97.6 | 95.2 | 80.8 | |
| SPICE† | 91.4 | 91.2 | 90.0 | 81.6 |
| CREPE | 97.5 | 97.1 | 95.3 | 85.8 |
| PENN | 90.6 | 81.0 | 51.1 | 20.9 |
[i] †Results taken from the original paper (Gfeller et al., 2020)

Figure 5
Comparison of pitch accuracy metrics across different datasets as a function of the Variable‑Q Transform parameter . Each subplot shows test performance on a specific dataset (MDB, MIR‑1K, or PTDB), with line colors and markers indicating the training dataset. Solid lines represent raw pitch accuracy, while dashed lines represent raw chroma accuracy. The points indicate the mean of the top three scores out of five runs with different random seeds.

Figure 6
Comparison between CQT (, top) and Variable‑Q Transform (, bottom) spectrograms for one example from MIR‑1K (left) and PTDB‑TUG (right) datasets. Highlighted rectangles indicate time/frequency misalignments between the spectrograms and the ground‑truth pitch contour (black line), particularly noticeable for the CQT at lower frequencies.
Table 6
Respective contribution of various design choices of our model (losses, data augmentations, Toeplitz layer) on its performances.
(a) Training on MIR‑1k
| MIR‑1k | MDB‑ss | PTDB | ||||||
|---|---|---|---|---|---|---|---|---|
| RPA | RCA | RPA | RCA | RPA | RCA | |||
| ✗ | ✗ | ✓ | 1.6 | 6.1 | 0.8 | 11.5 | 0.0 | 11.5 |
| ✗ | ✓ | ✗ | 88.2 | 88.5 | 78.7 | 78.9 | 76.0 | 77.4 |
| ✓ | ✗ | ✗ | 97.3 | 97.6 | 87.2 | 92.2 | 82.2 | 84.1 |
| ✗ | ✓ | ✓ | 97.2 | 97.8 | 95.8 | 96.5 | 88.3 | 90.0 |
| ✓ | ✗ | ✓ | 97.2 | 97.5 | 90.4 | 91.6 | 82.4 | 84.3 |
| ✓ | ✓ | ✗ | 97.1 | 97.5 | 86.9 | 93.6 | 85.9 | 87.7 |
| ✓ | ✓ | ✓ | 97.7 | 98.0 | 94.8 | 95.9 | 87.7 | 90.3 |
| No data augmentations | 97.2 | 97.6 | 93.8 | 94.6 | 88.0 | 90.1 | ||
| No Toeplitz layer | 1.3 | 7.1 | 1.5 | 5.7 | 0.0 | 12.5 | ||
(b) Training on MDB‑stem‑synth
| MIR‑1k | MDB‑ss | PTDB | ||||
|---|---|---|---|---|---|---|
| RPA | RCA | RPA | RCA | RPA | RCA | |
| ✗ | 95.4 | 96.6 | 97.0 | 97.1 | 88.6 | 90.3 |
| ✓ | 94.6 | 96.1 | 97.0 | 97.1 | 88.3 | 89.9 |
(c) Training on PTDB
| MIR‑1k | MDB‑ss | PTDB | ||||
|---|---|---|---|---|---|---|
| RPA | RCA | RPA | RCA | RPA | RCA | |
| ✗ | 95.1 | 96.9 | 95.6 | 96.0 | 89.8 | 91.3 |
| ✓ | 95.6 | 96.9 | 96.3 | 96.6 | 89.7 | 91.2 |
