
PESTO: Real‑Time Pitch Estimation with Self‑Supervised Transposition‑Equivariant Objective
References
- Alonso, J., and Erkut, C. (2021). Explorations of singing voice synthesis using DDSP. In Proceedings of the Sound and Music Computing Conference,
volume 2021‑June . (pp. 183–190). - Anton, J., Coppock, H., Shukla, P., and Schuller, B. W. (2023). Audio Barlow twins: Self‑supervised audio representation learning. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing – Proceedings,
volume 2023‑June . - Ardaillon, L., and Roebel, A. (2019). Fully‑convolutional network for pitch estimation of speech signals. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH,
volume 2019‑September (pp. 2005–2009). - Ba, L. J., Kiros, J. R., and Hinton, G. E. (2016). Layer normalization. CoRR. abs/1607.06450
- Baevski, A., Hsu, W.‑N., Xu, Q., Babu, A., Gu, J., and Auli, M. (2022). Data2vec: A general framework for self‑supervised learning in speech, vision and language. In International Conference on Machine Learning (pp. 1298–1312). PMLR.
- Bagad, P., Tapaswi, M., Snoek, C. G. M., and Zisserman, A. (2024). The sound of water: Inferring physical properties from pouring liquids. CoRR.
- Bardes, A., Ponce, J., and LeCun, Y. (2022). VICReg: Variance‑invariance‑covariance regularization for self‑supervised learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25–29, 2022. OpenReview.net.
- Bargum, A. R., Serafin, S., and Erkut, C. (2024). Reimagining speech: A scoping review of deep learning‑based methods for non‑parallel voice conversion. Frontiers in Signal Processing, 4, 1339159.
- Best, P., Marxer, R., Paris, S., and Glotin, H. (2025). Temporal evolution of the Mediterranean fin whale song. Scientific Reports, 12(1), 1–12.
- Bittner, R., Salamon, J., Tierney, M., Mauch, M., Cannam, C., and Bello, J. (2014). MedleyDB: A multitrack dataset for annotation‑intensive MIR research. In Proceedings of the 15th International Society for Music Information Retrieval Conference, ISMIR 2014 (pp. 155–160).
- Bittner, R. M., Bosch, J. J., Rubinstein, D., Meseguer‑Brocal, G., and Ewert, S. (2022). A lightweight instrument‑agnostic model for polyphonic note transcription and multipitch estimation. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing – Proceedings,
volume 2022‑May , (pp. 781–785). Institute of Electrical and Electronics Engineers Inc. - Boersma, P. (1993). Accurate short‑term analysis of the fundamental frequency and the harmonics‑to‑noise ratio of a sampled sound. In IFA Proceedings 17 (pp. 97–110).
- Brown, J. C. (1991). Calculation of a constant Q spectral transform. The Journal of the Acoustical Society of America, 89(1), 425–434.
- Caillon, A., and Esling, P. (2022, DAFx). Streamable neural audio synthesis with non‑causal convolutions. In Proceedings of the International Conference on Digital Audio Effects, DAFx (Vol. 3, pp. 320–327).
- Camacho, A., and Harris, J. G. (2008). A sawtooth waveform inspired pitch estimator for speech and music. The Journal of the Acoustical Society of America, 124(3), 1638–1652.
- Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In 37th International Conference on Machine Learning, ICML 2020 (Vol. PartF16814, pp. 1575–1585). International Machine Learning Society (IMLS).
- Chen, X., and He, K. (2021). Exploring simple Siamese representation learning. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 15745–15753). IEEE Computer Society.
- Cheuk, K. W., Anderson, H., Agres, K., and Herremans, D. (2020). nnAudio: An on‑the‑fly GPU audio to spectrogram conversion toolbox using 1D convolutional neural networks. IEEE Access, 8, 161981–162003.
- Dangovski, R., Jing, L., Loh, C., Han, S., Srivastava, A., Cheung, B., Agrawal, P., and Soljačić, M. (2021). Equivariant contrastive learning. In ICLR 2022 – 10th International Conference on Learning Representations.
- de Cheveigné, A., and Kawahara, H. (2002). YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4), 1917–1930.
- Devillers, A., and Lefort, M. (2023). Equimod: An equivariance module to improve visual instance discrimination. In 11th International Conference on Learning Representations, ICLR 2023. OpenReview.net.
- Duan, Z., Pardo, B., and Zhang, C. (2010). Multiple fundamental frequency estimation by modeling spectral peaks and non‑peak regions. IEEE Transactions on Audio, Speech and Language Processing, 18(8), 2121–2133.
- Dubnowski, J. J., Schafer, R. W., and Rabiner, L. R. (1976). Real‑time digital hardware pitch detector. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(1), 2–8.
- Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D., and Simonyan, K. (2017). Neural audio synthesis of musical notes with WaveNet autoencoders. In 34th International Conference on Machine Learning, ICML 2017 (Vol. 3, pp. 1771–1780). International Machine Learning Society (IMLS).
- Engel, J., Swavely, R., Roberts, A., Hanoi, L., Hantrakul, L., and Hawthorne, C. (2020a). Self‑supervised pitch detection by inverse audio synthesis. Workshop on Self‑Supervision in Audio and Speech at the 37th International Conference on Machine Learning (ICML 2020) (pp. 1–9).
- Engel, J. H., Hantrakul, L., Gu, C., and Roberts, A. (2020b, April 26–30). DDSP: Differentiable digital signal processing. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia. April 26‑30, 2020. OpenReview.net.
- Esser, P., Rombach, R., and Ommer, B. (2021). Taming transformers for high‑resolution image synthesis. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 12868–12878). IEEE Computer Society.
- Fabbro, G., Golkov, V., Kemp, T., and Cremers, D. (2020). Speech synthesis and control using differentiable DSP.
- Falorsi, L., de Haan, P., Davidson, T. R., Cao, N. D., Weiler, M., Forré, P., and Cohen, T. S. (2018). Explorations in homeomorphic variational autoencoding. CoRR, abs/1807.04689
- Gagneré, A., Essid, S., and Peeters, G. (2024). Adapting pitch‑based self supervised learning models for tempo estimation. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing – Proceedings (pp. 956–960).
- Ganis, F., Knudsen, E. F., Lyster, S. V. K., Otterbein, R., Südholt, D., and Erkut, C. (2021). Real‑time timbre transfer and sound synthesis using DDSP.
- Garrido, Q., Najman, L., and LeCun, Y. (2023, July). Self‑supervised learning of split invariant equivariant representations. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett (Eds.), International Conference on Machine Learning, ICML 2023, 23‑29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research (pp. 10975–10996). PMLR.
- Gfeller, B., Frank, C., Roblek, D., Sharifi, M., Tagliasacchi, M., and Velimirovic, M. (2020). SPICE: Self‑supervised pitch estimation. IEEE/ACM Transactions on Audio Speech and Language Processing, 28, 1118–1128.
- Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., Doersch, C., Pires, B. A., Guo, Z. D., Azar, M. G., Piot, B., Kavukcuoglu, K., Munos, R., and Valko, M. (2020). Bootstrap your own latent a new approach to self‑supervised learning. In Advances in Neural Information Processing Systems, volume 2020.
- Hadsell, R., Chopra, S., and LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2, 1735–1742.
- Hagiwara, M., Miron, M., and Liu, J.‑Y. (2024). ISPA: Inter‑species phonetic alphabet for transcribing animal sounds. In IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 – Workshops, Seoul, Republic of Korea. April 14–19, 2024 (pp. 828–832). IEEE.
- Han, D., Repetto, R. C., and Jeong, D. (2023). Finding Tori: Self‑supervised learning for analyzing Korean folk song. In 24th International Society for Music Information Retrieval Conference, ISMIR 2023 – Proceedings (pp. 440–447).
- Hayes, B., Saitis, C., and Fazekas, G. (2023a). Sinusoidal frequency estimation by gradient descent. In ICASSP 2023 – 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 1–5).
- Hayes, B., Shier, J., Fazekas, G., McPherson, A., and Saitis, C. (2023b). A review of differentiable digital signal processing for music & speech synthesis. Frontiers in Signal Processing.
- He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
volume 2016‑December , (pp. 770–778). - Hinton, G. E., Krizhevsky, A., and Wang, S. D. (2011). Transforming auto‑encoders. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 6791 LNCS (pp. 44–51).
- Hsu, C.‑L., and Jang, J.‑S. R. (2010). On the improvement of singing voice separation for monaural recordings using the MIR‑1K dataset. IEEE Transactions on Audio, Speech, and Language Processing, 18(2), 310–319.
- Huang, S., Li, Q., Anil, C., Bao, X., Oore, S., and Grosse, R. B. (2018). TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) pipeline for musical timbre transfer. In International Conference on Learning Representations.
- Huang, W.‑C., Violeta, L. P., Liu, S., Shi, J., and Toda, T. (2023). The singing voice conversion challenge 2023.
- Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1), 73–101.
- Kim, J. W., Salamon, J., Li, P., and Bello, J. P. (2018). CREPE: A convolutional representation for pitch estimation. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing – Proceedings,
volume 2018‑April , (pp. 161–165). - Kingma, D. P., and Ba, J. L. (2015). Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015 – Conference Track Proceedings.
- Kong, Y., Lostanlen, V., Meseguer‑Brocal, G., Wong, S., Lagrange, M., and Hennequin, R. (2024). STONE: Self‑supervised tonality estimator. In Proceedings of the International Society for Music Information Retrieval Conference (pp. 954–961).
- Li, D., Wu, Y., Li, Q., Zhao, J., Yu, Y., Xia, F., and Li, W. (2022). Playing technique detection by fusing note onset information in Guzheng performance. In Proceedings of the 23rd International Society for Music Information Retrieval Conference, ISMIR 2022.
- MacGlashan, J., Archer, E., Devlic, A., Seno, T., Sherstan, C., Wurman, P. R., and Stone, P. (2022). Value function decomposition for iterative design of reinforcement learning agents. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information rocessing Systems 2022, NeurIPS 2022, New Orleans, LA, USA.
- Mauch, M., and Dixon, S. (2014). PYIN: A fundamental frequency estimator using probabilistic threshold distributions. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (Vol. 1, pp. 659–663).
- McCallum, M. C., Korzeniowski, F., Oramas, S., Gouyon, F., and Ehmann, A. F. (2022). Supervised and unsupervised learning of audio representations for music understanding. In Proceedings of the 23rd International Society for Music Information Retrieval Conference, ISMIR 2022.
- Meier, P., Schwär, S., Krump, G., and Möller, M. (2023).
Evaluating real‑time pitch estimation algorithms for creative music game interaction . In M. Klein, D. Krupka, C. Winter, and V. Wohlgemuth (Eds.), Lecture Notes in Informatics (LNI), Proceedings – Series of the Gesellschaft für Informatik (GI), volume P‑337 of LNI, (pp. 875–884). Gesellschaft für Informatik, Bonn. - Morais, G., Davies, M. E., Queiroz, M., and Fuentes, M. (2023). Tempo vs. pitch: Understanding self‑supervised tempo estimation. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing – Proceedings,
volume 2023‑June . Institute of Electrical and Electronics Engineers Inc. - Morrison, M., Hsieh, C., Pruyne, N., and Pardo, B. (2023). Cross‑domain neural pitch and periodicity estimation. CoRR, abs/2301.1.
- Niizumi, D., Takeuchi, D., Ohishi, Y., Harada, N., and Kashino, K. (2022). BYOL for audio: Exploring pre‑trained general‑purpose audio representations. IEEE/ACM Transactions on Audio, Speech, and Language Processing (pp. 1–15).
- Noll, A. M. (1967). Cepstrum pitch determination. The Journal of the Acoustical Society of America, 41(2), 293–309.
- Oxenham, A. J. (2012). Pitch perception.
- Pirker, G., Wohlmayr, M., Petrik, S., and Pernkopf, F. (2011).
A pitch tracking corpus with evaluation on multipitch tracking scenario . In INTERSPEECH (pp. 1509–1512). ISCA. - Poliner, G. E., Ellis, D. P., Ehmann, A. F., Gómez, E., Streich, S., and Ong, B. (2007). Melody transcription from music audio: Approaches and evaluation. IEEE Transactions on Audio, Speech and Language Processing, 15(4), 1247–1256.
- Quinton, E. (2022). Equivariant self‑supervision for musical tempo estimation. Proceedings of the 23rd International Society for Music Information Retrieval Conference, ISMIR 2022.
- Riou, A., Lattner, S., Hadjeres, G., and Peeters, G. (2023). PESTO: Pitch estimation with self‑supervised transposition‑equivariant objective. In Proceedings of the 24th International Society for Music Information Retrieval Conference (pp. 535–544). ISMIR.
- Ross, M. J., Shaffer, H. L., Cohen, A., Freudberg, R. L., and Manley, H. (1974). Average magnitude difference function pitch extractor. IEEE Transactions on Acoustics, Speech, and Signal Processing, 22, 353–362.
- Saeed, A., Grangier, D., and Zeghidour, N. (2021). Contrastive learning of general‑purpose audio representations. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing – Proceedings,
volume 2021‑June , (pp. 3875–3879). Institute of Electrical and Electronics Engineers Inc. - Salamon, J., Bittner, R. M., Bonada, J., Bosch, J. J., Gómez, E., and Bello, J. P. (2017). An analysis/synthesis framework for automatic f0 annotation of multitrack datasets. Proceedings of the 18th International Society for Music Information Retrieval Conference, ISMIR 2017 (pp. 71–78).
- Schörkhuber, C., Klapuri, A., Holighaus, N., and Dörfler, M. (2014).
A matlab toolbox for efficient perfect reconstruction time‑frequency transforms with log‑frequency resolution . In Semantic Audio. Audio Engineering Society. - Shelhamer, E., Long, J., and Darrell, T. (2017). Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 640–651.
- Singh, S., Wang, R., and Qiu, Y. (2021). DeepF0: End‑to‑end fundamental frequency estimation for music and speech signals. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing – Proceedings,
volume 2021‑June , (pp. 61–65). Institute of Electrical and Electronics Engineers Inc. - Sisman, B., Yamagishi, J., King, S., and Li, H. (2021). An overview of voice conversion and its challenges: From statistical modeling to deep learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 132–157.
- Spijkervet, J., and Burgoyne, J. A. (2021). Contrastive learning of musical representations. Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR 2021.
- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–1958.
- Stefani, D., and Turchet, L. (2022). On the challenges of embedded real‑time music information retrieval. In International Conference on Digital Audio Effects (DAFx).
- Talkin, D. (1995). A robust algorithm for pitch tracking (RAPT).
- Torres, B., Peeters, G., and Richard, G. (2024). Unsupervised harmonic parameter estimation using differentiable DSP and spectral optimal transport. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2024, Seoul, Republic of Korea. April 14–19, 2024 (pp. 1176–1180). IEEE.
- Wang, T., and Isola, P. (2020). Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In 37th International Conference on Machine Learning, ICML 2020 (Vol. PartF16814, pp. 9871– 9881). International Machine Learning Society (IMLS).
- Wang, X., Takaki, S., and Yamagishi, J. (2019). Neural source‑filter waveform models for statistical parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 402–415.
- Weiß, C., and Peeters, G. (2022). Deep‑learning architectures for multi‑pitch estimation: Towards reliable evaluation. CoRR, abs/2202.0.
- Winter, R., Bertolini, M., Le, T., Noé, F., and Clevert, D.‑A. (2022). Unsupervised learning of group invariant and equivariant representations. Proceedings of the International Conference on Neural Information Processing Systems.
- Xu, B., Wang, N., Chen, T., and Li, M. (2015). Empirical evaluation of rectified activations in convolutional network. CoRR, abs/1505.0.
- Yang, Y., Kartynnik, Y., Li, Y., Tang, J., Li, X., Sung, G., and Grundmann, M. (2024). StreamVC: Real‑time low‑latency voice conversion.
- Yost, W. A. (2009). Pitch perception. Attention, Perception, and Psychophysics, 71(8), 1701–1715.
- Zbontar, J., Jing, L., Misra, I., LeCun, Y., and Deny, S. (2021). Barlow twins: Self‑supervised learning via redundancy reduction. 38th International Conference on Machine Learning, ICML 2021.
DOI: https://doi.org/10.5334/tismir.251 | Journal eISSN: 2514-3298
Language: English
Submitted on: Jan 9, 2025
Accepted on: Aug 1, 2025
Published on: Sep 9, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year
Keywords:
© 2025 Alain Riou, Bernardo Torres, Ben Hayes, Stefan Lattner, Gaëtan Hadjeres, Gaël Richard, Geoffroy Peeters, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.