PESTO: Real‑Time Pitch Estimation with Self‑Supervised Transposition‑Equivariant Objective

Alain Riou; Bernardo Torres; Ben Hayes; Stefan Lattner; Gaëtan Hadjeres; Gaël Richard; Geoffroy Peeters

doi:10.5334/tismir.251

PESTO: Real‑Time Pitch Estimation with Self‑Supervised Transposition‑Equivariant Objective

Transactions of the International Society for Music Information Retrieval

Volume 8 (2025): Issue 1

By: Alain Riou, Bernardo Torres, Ben Hayes, Stefan Lattner, Gaëtan Hadjeres, Gaël Richard and Geoffroy Peeters

Open Access

|Sep 2025

Abstract

In this paper, we introduce PESTO, a self-supervised learning approach for single-pitch estimation using a Siamese architecture. Our model processes individual frames of a Variable-Q Transform (VQT) and predicts pitch distributions. The neural network is designed to be equivariant to translations, notably thanks to a Toeplitz fully-connected layer. In addition, we construct pitch-shifted pairs by translating and cropping the VQT frames and train our model with a novel class-based transposition-equivariant objective, eliminating the need for annotated data. Thanks to this architecture and training objective, our model achieves remarkable performance while being very lightweight (130 k parameters).

Evaluations on music and speech datasets (MIR-1K, MDB-stem-synth, and PTDB) demonstrate that PESTO not only outperforms self-supervised baselines but also competes with supervised methods, exhibiting superior cross-dataset generalization. Finally, we enhance PESTO's practical utility by developing a streamable VQT implementation using cached convolutions. Combined with our model's low latency (less than 10 ms) and minimal parameter count, this makes PESTO particularly suitable for real-time applications.

References

Alonso, J., and Erkut, C. (2021). Explorations of singing voice synthesis using DDSP. In Proceedings of the Sound and Music Computing Conference, volume 2021‑June. (pp. 183–190).
Search in Google Scholar Back to article
Anton, J., Coppock, H., Shukla, P., and Schuller, B. W. (2023). Audio Barlow twins: Self‑supervised audio representation learning. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing – Proceedings, volume 2023‑June.
Search in Google Scholar Back to article
Ardaillon, L., and Roebel, A. (2019). Fully‑convolutional network for pitch estimation of speech signals. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, volume 2019‑September (pp. 2005–2009).
Search in Google Scholar Back to article
Ba, L. J., Kiros, J. R., and Hinton, G. E. (2016). Layer normalization. CoRR. abs/1607.06450
Search in Google Scholar Back to article
Baevski, A., Hsu, W.‑N., Xu, Q., Babu, A., Gu, J., and Auli, M. (2022). Data2vec: A general framework for self‑supervised learning in speech, vision and language. In International Conference on Machine Learning (pp. 1298–1312). PMLR.
Search in Google Scholar Back to article
Bagad, P., Tapaswi, M., Snoek, C. G. M., and Zisserman, A. (2024). The sound of water: Inferring physical properties from pouring liquids. CoRR.
Search in Google Scholar Back to article
Bardes, A., Ponce, J., and LeCun, Y. (2022). VICReg: Variance‑invariance‑covariance regularization for self‑supervised learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25–29, 2022. OpenReview.net.
Search in Google Scholar Back to article
Bargum, A. R., Serafin, S., and Erkut, C. (2024). Reimagining speech: A scoping review of deep learning‑based methods for non‑parallel voice conversion. Frontiers in Signal Processing, 4, 1339159.
Search in Google Scholar Back to article
Best, P., Marxer, R., Paris, S., and Glotin, H. (2025). Temporal evolution of the Mediterranean fin whale song. Scientific Reports, 12(1), 1–12.
Search in Google Scholar Back to article
Bittner, R., Salamon, J., Tierney, M., Mauch, M., Cannam, C., and Bello, J. (2014). MedleyDB: A multitrack dataset for annotation‑intensive MIR research. In Proceedings of the 15th International Society for Music Information Retrieval Conference, ISMIR 2014 (pp. 155–160).
Search in Google Scholar Back to article
Bittner, R. M., Bosch, J. J., Rubinstein, D., Meseguer‑Brocal, G., and Ewert, S. (2022). A lightweight instrument‑agnostic model for polyphonic note transcription and multipitch estimation. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing – Proceedings, volume 2022‑May, (pp. 781–785). Institute of Electrical and Electronics Engineers Inc.
Search in Google Scholar Back to article
Boersma, P. (1993). Accurate short‑term analysis of the fundamental frequency and the harmonics‑to‑noise ratio of a sampled sound. In IFA Proceedings 17 (pp. 97–110).
Search in Google Scholar Back to article
Brown, J. C. (1991). Calculation of a constant Q spectral transform. The Journal of the Acoustical Society of America, 89(1), 425–434.
Search in Google Scholar Back to article
Caillon, A., and Esling, P. (2022, DAFx). Streamable neural audio synthesis with non‑causal convolutions. In Proceedings of the International Conference on Digital Audio Effects, DAFx (Vol. 3, pp. 320–327).
Search in Google Scholar Back to article
Camacho, A., and Harris, J. G. (2008). A sawtooth waveform inspired pitch estimator for speech and music. The Journal of the Acoustical Society of America, 124(3), 1638–1652.
Search in Google Scholar Back to article
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In 37th International Conference on Machine Learning, ICML 2020 (Vol. PartF16814, pp. 1575–1585). International Machine Learning Society (IMLS).
Search in Google Scholar Back to article
Chen, X., and He, K. (2021). Exploring simple Siamese representation learning. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 15745–15753). IEEE Computer Society.
Search in Google Scholar Back to article
Cheuk, K. W., Anderson, H., Agres, K., and Herremans, D. (2020). nnAudio: An on‑the‑fly GPU audio to spectrogram conversion toolbox using 1D convolutional neural networks. IEEE Access, 8, 161981–162003.
Search in Google Scholar Back to article
Dangovski, R., Jing, L., Loh, C., Han, S., Srivastava, A., Cheung, B., Agrawal, P., and Soljačić, M. (2021). Equivariant contrastive learning. In ICLR 2022 – 10th International Conference on Learning Representations.
Search in Google Scholar Back to article
de Cheveigné, A., and Kawahara, H. (2002). YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4), 1917–1930.
Search in Google Scholar Back to article
Devillers, A., and Lefort, M. (2023). Equimod: An equivariance module to improve visual instance discrimination. In 11th International Conference on Learning Representations, ICLR 2023. OpenReview.net.
Search in Google Scholar Back to article
Duan, Z., Pardo, B., and Zhang, C. (2010). Multiple fundamental frequency estimation by modeling spectral peaks and non‑peak regions. IEEE Transactions on Audio, Speech and Language Processing, 18(8), 2121–2133.
Search in Google Scholar Back to article
Dubnowski, J. J., Schafer, R. W., and Rabiner, L. R. (1976). Real‑time digital hardware pitch detector. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(1), 2–8.
Search in Google Scholar Back to article
Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D., and Simonyan, K. (2017). Neural audio synthesis of musical notes with WaveNet autoencoders. In 34th International Conference on Machine Learning, ICML 2017 (Vol. 3, pp. 1771–1780). International Machine Learning Society (IMLS).
Search in Google Scholar Back to article
Engel, J., Swavely, R., Roberts, A., Hanoi, L., Hantrakul, L., and Hawthorne, C. (2020a). Self‑supervised pitch detection by inverse audio synthesis. Workshop on Self‑Supervision in Audio and Speech at the 37th International Conference on Machine Learning (ICML 2020) (pp. 1–9).
Search in Google Scholar Back to article
Engel, J. H., Hantrakul, L., Gu, C., and Roberts, A. (2020b, April 26–30). DDSP: Differentiable digital signal processing. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia. April 26‑30, 2020. OpenReview.net.
Search in Google Scholar Back to article
Esser, P., Rombach, R., and Ommer, B. (2021). Taming transformers for high‑resolution image synthesis. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 12868–12878). IEEE Computer Society.
Search in Google Scholar Back to article
Fabbro, G., Golkov, V., Kemp, T., and Cremers, D. (2020). Speech synthesis and control using differentiable DSP.
Search in Google Scholar Back to article
Falorsi, L., de Haan, P., Davidson, T. R., Cao, N. D., Weiler, M., Forré, P., and Cohen, T. S. (2018). Explorations in homeomorphic variational autoencoding. CoRR, abs/1807.04689
Search in Google Scholar Back to article
Gagneré, A., Essid, S., and Peeters, G. (2024). Adapting pitch‑based self supervised learning models for tempo estimation. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing – Proceedings (pp. 956–960).
Search in Google Scholar Back to article
Ganis, F., Knudsen, E. F., Lyster, S. V. K., Otterbein, R., Südholt, D., and Erkut, C. (2021). Real‑time timbre transfer and sound synthesis using DDSP.
Search in Google Scholar Back to article
Garrido, Q., Najman, L., and LeCun, Y. (2023, July). Self‑supervised learning of split invariant equivariant representations. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett (Eds.), International Conference on Machine Learning, ICML 2023, 23‑29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research (pp. 10975–10996). PMLR.
Search in Google Scholar Back to article
Gfeller, B., Frank, C., Roblek, D., Sharifi, M., Tagliasacchi, M., and Velimirovic, M. (2020). SPICE: Self‑supervised pitch estimation. IEEE/ACM Transactions on Audio Speech and Language Processing, 28, 1118–1128.
Search in Google Scholar Back to article
Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., Doersch, C., Pires, B. A., Guo, Z. D., Azar, M. G., Piot, B., Kavukcuoglu, K., Munos, R., and Valko, M. (2020). Bootstrap your own latent a new approach to self‑supervised learning. In Advances in Neural Information Processing Systems, volume 2020.
Search in Google Scholar Back to article
Hadsell, R., Chopra, S., and LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2, 1735–1742.
Search in Google Scholar Back to article
Hagiwara, M., Miron, M., and Liu, J.‑Y. (2024). ISPA: Inter‑species phonetic alphabet for transcribing animal sounds. In IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 – Workshops, Seoul, Republic of Korea. April 14–19, 2024 (pp. 828–832). IEEE.
Search in Google Scholar Back to article
Han, D., Repetto, R. C., and Jeong, D. (2023). Finding Tori: Self‑supervised learning for analyzing Korean folk song. In 24th International Society for Music Information Retrieval Conference, ISMIR 2023 – Proceedings (pp. 440–447).
Search in Google Scholar Back to article
Hayes, B., Saitis, C., and Fazekas, G. (2023a). Sinusoidal frequency estimation by gradient descent. In ICASSP 2023 – 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 1–5).
Search in Google Scholar Back to article
Hayes, B., Shier, J., Fazekas, G., McPherson, A., and Saitis, C. (2023b). A review of differentiable digital signal processing for music & speech synthesis. Frontiers in Signal Processing.
Search in Google Scholar Back to article
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2016‑December, (pp. 770–778).
Search in Google Scholar Back to article
Hinton, G. E., Krizhevsky, A., and Wang, S. D. (2011). Transforming auto‑encoders. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 6791 LNCS (pp. 44–51).
Search in Google Scholar Back to article
Hsu, C.‑L., and Jang, J.‑S. R. (2010). On the improvement of singing voice separation for monaural recordings using the MIR‑1K dataset. IEEE Transactions on Audio, Speech, and Language Processing, 18(2), 310–319.
Search in Google Scholar Back to article
Huang, S., Li, Q., Anil, C., Bao, X., Oore, S., and Grosse, R. B. (2018). TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) pipeline for musical timbre transfer. In International Conference on Learning Representations.
Search in Google Scholar Back to article
Huang, W.‑C., Violeta, L. P., Liu, S., Shi, J., and Toda, T. (2023). The singing voice conversion challenge 2023.
Search in Google Scholar Back to article
Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1), 73–101.
Search in Google Scholar Back to article
Kim, J. W., Salamon, J., Li, P., and Bello, J. P. (2018). CREPE: A convolutional representation for pitch estimation. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing – Proceedings, volume 2018‑April, (pp. 161–165).
Search in Google Scholar Back to article
Kingma, D. P., and Ba, J. L. (2015). Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015 – Conference Track Proceedings.
Search in Google Scholar Back to article
Kong, Y., Lostanlen, V., Meseguer‑Brocal, G., Wong, S., Lagrange, M., and Hennequin, R. (2024). STONE: Self‑supervised tonality estimator. In Proceedings of the International Society for Music Information Retrieval Conference (pp. 954–961).
Search in Google Scholar Back to article
Li, D., Wu, Y., Li, Q., Zhao, J., Yu, Y., Xia, F., and Li, W. (2022). Playing technique detection by fusing note onset information in Guzheng performance. In Proceedings of the 23rd International Society for Music Information Retrieval Conference, ISMIR 2022.
Search in Google Scholar Back to article
MacGlashan, J., Archer, E., Devlic, A., Seno, T., Sherstan, C., Wurman, P. R., and Stone, P. (2022). Value function decomposition for iterative design of reinforcement learning agents. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information rocessing Systems 2022, NeurIPS 2022, New Orleans, LA, USA.
Search in Google Scholar Back to article
Mauch, M., and Dixon, S. (2014). PYIN: A fundamental frequency estimator using probabilistic threshold distributions. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (Vol. 1, pp. 659–663).
Search in Google Scholar Back to article
McCallum, M. C., Korzeniowski, F., Oramas, S., Gouyon, F., and Ehmann, A. F. (2022). Supervised and unsupervised learning of audio representations for music understanding. In Proceedings of the 23rd International Society for Music Information Retrieval Conference, ISMIR 2022.
Search in Google Scholar Back to article
Meier, P., Schwär, S., Krump, G., and Möller, M. (2023). Evaluating real‑time pitch estimation algorithms for creative music game interaction. In M. Klein, D. Krupka, C. Winter, and V. Wohlgemuth (Eds.), Lecture Notes in Informatics (LNI), Proceedings – Series of the Gesellschaft für Informatik (GI), volume P‑337 of LNI, (pp. 875–884). Gesellschaft für Informatik, Bonn.
Search in Google Scholar Back to article
Morais, G., Davies, M. E., Queiroz, M., and Fuentes, M. (2023). Tempo vs. pitch: Understanding self‑supervised tempo estimation. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing – Proceedings, volume 2023‑June. Institute of Electrical and Electronics Engineers Inc.
Search in Google Scholar Back to article
Morrison, M., Hsieh, C., Pruyne, N., and Pardo, B. (2023). Cross‑domain neural pitch and periodicity estimation. CoRR, abs/2301.1.
Search in Google Scholar Back to article
Niizumi, D., Takeuchi, D., Ohishi, Y., Harada, N., and Kashino, K. (2022). BYOL for audio: Exploring pre‑trained general‑purpose audio representations. IEEE/ACM Transactions on Audio, Speech, and Language Processing (pp. 1–15).
Search in Google Scholar Back to article
Noll, A. M. (1967). Cepstrum pitch determination. The Journal of the Acoustical Society of America, 41(2), 293–309.
Search in Google Scholar Back to article
Oxenham, A. J. (2012). Pitch perception.
Search in Google Scholar Back to article
Pirker, G., Wohlmayr, M., Petrik, S., and Pernkopf, F. (2011). A pitch tracking corpus with evaluation on multipitch tracking scenario. In INTERSPEECH (pp. 1509–1512). ISCA.
Search in Google Scholar Back to article
Poliner, G. E., Ellis, D. P., Ehmann, A. F., Gómez, E., Streich, S., and Ong, B. (2007). Melody transcription from music audio: Approaches and evaluation. IEEE Transactions on Audio, Speech and Language Processing, 15(4), 1247–1256.
Search in Google Scholar Back to article
Quinton, E. (2022). Equivariant self‑supervision for musical tempo estimation. Proceedings of the 23rd International Society for Music Information Retrieval Conference, ISMIR 2022.
Search in Google Scholar Back to article
Riou, A., Lattner, S., Hadjeres, G., and Peeters, G. (2023). PESTO: Pitch estimation with self‑supervised transposition‑equivariant objective. In Proceedings of the 24th International Society for Music Information Retrieval Conference (pp. 535–544). ISMIR.
Search in Google Scholar Back to article
Ross, M. J., Shaffer, H. L., Cohen, A., Freudberg, R. L., and Manley, H. (1974). Average magnitude difference function pitch extractor. IEEE Transactions on Acoustics, Speech, and Signal Processing, 22, 353–362.
Search in Google Scholar Back to article
Saeed, A., Grangier, D., and Zeghidour, N. (2021). Contrastive learning of general‑purpose audio representations. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing – Proceedings, volume 2021‑June, (pp. 3875–3879). Institute of Electrical and Electronics Engineers Inc.
Search in Google Scholar Back to article
Salamon, J., Bittner, R. M., Bonada, J., Bosch, J. J., Gómez, E., and Bello, J. P. (2017). An analysis/synthesis framework for automatic f0 annotation of multitrack datasets. Proceedings of the 18th International Society for Music Information Retrieval Conference, ISMIR 2017 (pp. 71–78).
Search in Google Scholar Back to article
Schörkhuber, C., Klapuri, A., Holighaus, N., and Dörfler, M. (2014). A matlab toolbox for efficient perfect reconstruction time‑frequency transforms with log‑frequency resolution. In Semantic Audio. Audio Engineering Society.
Search in Google Scholar Back to article
Shelhamer, E., Long, J., and Darrell, T. (2017). Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 640–651.
Search in Google Scholar Back to article
Singh, S., Wang, R., and Qiu, Y. (2021). DeepF0: End‑to‑end fundamental frequency estimation for music and speech signals. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing – Proceedings, volume 2021‑June, (pp. 61–65). Institute of Electrical and Electronics Engineers Inc.
Search in Google Scholar Back to article
Sisman, B., Yamagishi, J., King, S., and Li, H. (2021). An overview of voice conversion and its challenges: From statistical modeling to deep learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 132–157.
Search in Google Scholar Back to article
Spijkervet, J., and Burgoyne, J. A. (2021). Contrastive learning of musical representations. Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR 2021.
Search in Google Scholar Back to article
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–1958.
Search in Google Scholar Back to article
Stefani, D., and Turchet, L. (2022). On the challenges of embedded real‑time music information retrieval. In International Conference on Digital Audio Effects (DAFx).
Search in Google Scholar Back to article
Talkin, D. (1995). A robust algorithm for pitch tracking (RAPT).
Search in Google Scholar Back to article
Torres, B., Peeters, G., and Richard, G. (2024). Unsupervised harmonic parameter estimation using differentiable DSP and spectral optimal transport. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2024, Seoul, Republic of Korea. April 14–19, 2024 (pp. 1176–1180). IEEE.
Search in Google Scholar Back to article
Wang, T., and Isola, P. (2020). Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In 37th International Conference on Machine Learning, ICML 2020 (Vol. PartF16814, pp. 9871– 9881). International Machine Learning Society (IMLS).
Search in Google Scholar Back to article
Wang, X., Takaki, S., and Yamagishi, J. (2019). Neural source‑filter waveform models for statistical parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 402–415.
Search in Google Scholar Back to article
Weiß, C., and Peeters, G. (2022). Deep‑learning architectures for multi‑pitch estimation: Towards reliable evaluation. CoRR, abs/2202.0.
Search in Google Scholar Back to article
Winter, R., Bertolini, M., Le, T., Noé, F., and Clevert, D.‑A. (2022). Unsupervised learning of group invariant and equivariant representations. Proceedings of the International Conference on Neural Information Processing Systems.
Search in Google Scholar Back to article
Xu, B., Wang, N., Chen, T., and Li, M. (2015). Empirical evaluation of rectified activations in convolutional network. CoRR, abs/1505.0.
Search in Google Scholar Back to article
Yang, Y., Kartynnik, Y., Li, Y., Tang, J., Li, X., Sung, G., and Grundmann, M. (2024). StreamVC: Real‑time low‑latency voice conversion.
Search in Google Scholar Back to article
Yost, W. A. (2009). Pitch perception. Attention, Perception, and Psychophysics, 71(8), 1701–1715.
Search in Google Scholar Back to article
Zbontar, J., Jing, L., Misra, I., LeCun, Y., and Deny, S. (2021). Barlow twins: Self‑supervised learning via redundancy reduction. 38th International Conference on Machine Learning, ICML 2021.
Search in Google Scholar Back to article

Articles in this issue

DOI: https://doi.org/10.5334/tismir.251 | Journal eISSN: 2514-3298

Journal RSS Feed

Language: English

Submitted on: Jan 9, 2025

Accepted on: Aug 1, 2025

Published on: Sep 9, 2025

Published by: Ubiquity Press

In partnership with: Paradigm Publishing Services

Publication frequency: 1 issue per year

Keywords:

pitch estimation,

self-supervised learning,

equivariance,

real-time,

streamable convolutions,

variable-Q transform,

lightweight,

f0 estimation,

music information retrieval,

Toeplitz matrix

© 2025 Alain Riou, Bernardo Torres, Ben Hayes, Stefan Lattner, Gaëtan Hadjeres, Gaël Richard, Geoffroy Peeters, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.

Volume 8 (2025): Issue 1