Skip to main content
Have a personal or library account? Click to login
Audiovisual Singing Voice Separation Cover

References

  1. Afouras, T., Chung, J. S., and Zisserman, A. (2018). The conversation: Deep audio-visual speech enhancement. In Proceedings of the International Conference on Spoken Language Processing (Interspeech). DOI: 10.21437/Interspeech.2018-1400
  2. Bazzica, A., van Gemert, J., Liem, C. C., and Hanjalic, A. (2017). Vision-based detection of acoustic timed events: A case study on clarinet note onsets. arXiv preprint arXiv:1706.09556.
  3. Berenzweig, A., Ellis, D. P., and Lawrence, S. (2002). Using voice segments to improve artist classification of music. In Proceedings of the AES 22nd International Conference: Virtual Synthetic and Entertainment Audio.
  4. Cadalbert, A., Landis, T., Regard, M., and Graves, R. E. (1994). Singing with and without words: Hemispheric asymmetries in motor control. Journal of Clinical and Experimental Neuropsychology, 16(5): 664670. DOI: 10.1080/01688639408402679
  5. Cartwright, M., Pardo, B., and Mysore, G. J. (2018). Crowdsourced pairwise-comparison for source separation evaluation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 606610. IEEE. DOI: 10.1109/ICASSP.2018.8462153
  6. Cartwright, M., Pardo, B., Mysore, G. J., and Hoffman, M. (2016). Fast and easy crowdsourced perceptual audio evaluation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 619623. IEEE. DOI: 10.1109/ICASSP.2016.7471749
  7. Chan, T.-S., Yeh, T.-C., Fan, Z.-C., Chen, H.-W., Su, L., Yang, Y.-H., and Jang, R. (2015). Vocal activity informed singing voice separation with the iKala dataset. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 718722. DOI: 10.1109/ICASSP.2015.7178063
  8. Chandna, P., Miron, M., Janer, J., and Gómez, E. (2017). Monoaural audio source separation using deep convolutional neural networks. In Proceedings of the International Conference on Latent Variable Analysis and Signal Separation, pages 258266. Springer. DOI: 10.1007/978-3-319-53547-0_25
  9. Chen, L., Srivastava, S., Duan, Z., and Xu, C. (2017). Deep cross-modal audio-visual generation. In Proceedings of the ACM Thematic Workshops of Multimedia, pages 349357. DOI: 10.1145/3126686.3126723
  10. Choi, W., Kim, M., Chung, J., and Jung, D. L. S. (2019). Investigating deep neural transformations for spectrogram-based musical source separation. arXiv preprint arXiv:1912.02591.
  11. Choi, W., Kim, M., Chung, J., and Jung, S. (2021). LaSAFT: Latent source attentive frequency transformation for conditioned source separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 171175. IEEE. DOI: 10.1109/ICASSP39728.2021.9413896
  12. Chung, J. S., and Zisserman, A. (2016). Lip reading in the wild. In Proceedings of the Asian Conference on Computer Vision, pages 87103. Springer. DOI: 10.1007/978-3-319-54184-6_6
  13. Connell, L., Cai, Z. G., and Holler, J. (2013). Do you see what I’m singing? Visuospatial movement biases pitch perception. Brain and Cognition, 81(1): 124130. DOI: 10.1016/j.bandc.2012.09.005
  14. Défossez, A., Usunier, N., Bottou, L., and Bach, F. (2019). Demucs: Deep extractor for music sources with extra unlabeled data remixed. arXiv preprint arXiv:1909.01174.
  15. Dinesh, K., Li, B., Liu, X., Duan, Z., and Sharma, G. (2017). Visually informed multi-pitch analysis of string ensembles. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 30213025. DOI: 10.1109/ICASSP.2017.7952711
  16. Duan, Z., Essid, S., Liem, C., Richard, G., and Sharma, G. (2019). Audiovisual analysis of music performances: Overview of an emerging field. IEEE Signal Processing Magazine, 36(1): 6373. DOI: 10.1109/MSP.2018.2875511
  17. Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W. T., and Rubinstein, M. (2018). Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. ACM Transactions on Graphics (TOG), 37(4). DOI: 10.1145/3197517.3201357
  18. Fujihara, H., and Goto, M. (2007). A music information retrieval system based on singing voice timbre. In Proceedings of the International Conference on Music Information Retrieval (ISMIR), pages 467470.
  19. Fujihara, H., Goto, M., Ogata, J., Komatani, K., Ogata, T., and Okuno, H. G. (2006). Automatic synchronization between lyrics and music CD recordings based on Viterbi alignment of segregated vocal signals. In Proceedings of the IEEE International Symposium on Multimedia (ISM), pages 257264. DOI: 10.1109/ISM.2006.38
  20. Gan, C., Huang, D., Zhao, H., Tenenbaum, J. B., and Torralba, A. (2020). Music gesture for visual sound separation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1047810487. DOI: 10.1109/CVPR42600.2020.01049
  21. Gao, R., and Grauman, K. (2019). Co-separating sounds of visual objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), pages 38793888. DOI: 10.1109/ICCV.2019.00398
  22. Gillet, O., and Richard, G. (2006). ENST-Drums: An extensive audio-visual database for drum signals processing. In Proceedings of the International Conference on Music Information Retrieval (ISMIR), pages 156159.
  23. Grell, A., Sundberg, J., Ternström, S., Ptok, M., and Altenmüller, E. (2009). Rapid pitch correction in choir singers. The Journal of the Acoustical Society of America, 126(1): 407413. DOI: 10.1121/1.3147508
  24. Hennequin, R., Khlif, A., Voituret, F., and Moussallam, M. (2019). Spleeter: A fast and state-of-the-art music source separation tool with pre-trained models. Late-Breaking Demo, International Society for Music Information Retrieval Conference (ISMIR). DOI: 10.21105/joss.02154
  25. Hou, J.-C., Wang, S.-S., Lai, Y.-H., Tsao, Y., Chang, H.-W., and Wang, H.-M. (2018). Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Transactions on Emerging Topics in Computational Intelligence, 2(2): 117128. DOI: 10.1109/TETCI.2017.2784878
  26. Hsu, C.-L., Wang, D., Jang, J.-S. R., and Hu, K. (2012). A tandem algorithm for singing pitch extraction and voice separation from music accompaniment. IEEE Transactions on Audio, Speech, and Language Processing, 20(5): 14821491. DOI: 10.1109/TASL.2011.2182510
  27. Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 47004708. DOI: 10.1109/CVPR.2017.243
  28. Huang, P.-S., Chen, S. D., Smaragdis, P., and Hasegawa-Johnson, M. (2012). Singing-voice separation from monaural recordings using robust principal component analysis. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5760. DOI: 10.1109/ICASSP.2012.6287816
  29. Huang, P.-S., Kim, M., Hasegawa-Johnson, M., and Smaragdis, P. (2014). Singing-voice separation from monaural recordings using deep recurrent neural networks. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 477482.
  30. Jansson, A., Humphrey, E., Montecchio, N., Bittner, R., Kumar, A., and Weyde, T. (2017). Singing voice separation with deep U-Net convolutional networks. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR).
  31. King, D. E. (2009). Dlib–ml: A machine learning toolkit. Journal of Machine Learning Research, 10(Jul.): 17551758.
  32. Li, B., Dinesh, K., Duan, Z., and Sharma, G. (2017a). See and listen: Score-informed association of sound tracks to players in chamber music performance videos. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 29062910. DOI: 10.1109/ICASSP.2017.7952688
  33. Li, B., Dinesh, K., Sharma, G., and Duan, Z. (2017b). Video-based vibrato detection and analysis for polyphonic string music. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 123130.
  34. Li, B., Dinesh, K., Xu, C., Sharma, G., and Duan, Z. (2019a). Online audio-visual source association for chamber music performances. Transactions of the International Society for Music Information Retrieval, 2(1). DOI: 10.5334/tismir.25
  35. Li, B., and Kumar, A. (2019). Query by video: Crossmodal music retrieval. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 604611.
  36. Li, B., Liu, X., Dinesh, K., Duan, Z., and Sharma, G. (2019b). Creating a music performance dataset for multimodal music analysis: Challenges, insights, and applications. IEEE Transactions on Multimedia, 21(2): 522535. DOI: 10.1109/TMM.2018.2856090
  37. Li, B., Maezawa, A., and Duan, Z. (2018). Skeleton plays piano: Online generation of pianist body movements from MIDI performance. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR).
  38. Li, B., Xu, C., and Duan, Z. (2017c). Audiovisual source association for string ensembles through multi-modal vibrato analysis. In Proceedings of the Sound and Music Computing (SMC) Conference, pages 159166.
  39. Liu, J.-Y., and Yang, Y.-H. (2018). Denoising autoencoder with recurrent skip connections and residual regression for music source separation. In Proceedings of the IEEE International Conference on Machine Learning and Applications (ICMLA), pages 773778. DOI: 10.1109/ICMLA.2018.00123
  40. Lluis, F., Pons, J., and Serra, X. (2019). End-to-end music source separation: Is it possible in the waveform domain? In Proceedings of the International Conference on Spoken Language Processing (Interspeech). DOI: 10.21437/Interspeech.2019-1177
  41. Lu, R., Duan, Z., and Zhang, C. (2018). Listen and look: Audio–visual matching assisted speech source separation. IEEE Signal Processing Letters, 25(9): 13151319. DOI: 10.1109/LSP.2018.2853566
  42. Lu, R., Duan, Z., and Zhang, C. (2019). Audio–visual deep clustering for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(11): 16971712. DOI: 10.1109/TASLP.2019.2928140
  43. Luo, Y., Chen, Z., Hershey, J. R., Le Roux, J., and Mesgarani, N. (2017). Deep clustering and conventional networks for music separation: Stronger together. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 6165. DOI: 10.1109/ICASSP.2017.7952118
  44. Luo, Y., and Mesgarani, N. (2018). TasNet: Timedomain audio separation network for real-time, single-channel speech separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 696700. DOI: 10.1109/ICASSP.2018.8462116
  45. Mesaros, A., and Virtanen, T. (2010). Recognition of phonemes and words in singing. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 21462149. IEEE. DOI: 10.1109/ICASSP.2010.5495585
  46. Ozerov, A., Philippe, P., Bimbot, F., and Gribonval, R. (2007). Adaptation of Bayesian models for singlechannel source separation and its application to voice/music separation in popular songs. IEEE Transactions on Audio, Speech, and Language Processing, 15(5): 15641578. DOI: 10.1109/TASL.2007.899291
  47. Ozerov, A., Philippe, P., Gribonval, R., and Bimbot, F. (2005). One microphone singing voice separation using source-adapted models. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 9093. IEEE. DOI: 10.1109/ASPAA.2005.1540176
  48. Parekh, S., Essid, S., Ozerov, A., Duong, N., Perez, P., and Richard, G. (2017). Motion informed audio source separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 610. DOI: 10.1109/ICASSP.2017.7951787
  49. Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., and Pantic, M. (2018). End-to-end audiovisual speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 65486552. DOI: 10.1109/ICASSP.2018.8461326
  50. Rafii, Z., and Pardo, B. (2011). A simple music/voice separation method based on the extraction of the repeating musical structure. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 221224. DOI: 10.1109/ICASSP.2011.5946380
  51. Song, X., Kong, Q., Du, X., and Wang, Y. (2021). CatNet: Music source separation system with mix-audio augmentation. arXiv preprint arXiv:2102.09966.
  52. Stoller, D., Ewert, S., and Dixon, S. (2018). Wave-UNet: A multi-scale neural network for end-to-end audio source separation. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 334340.
  53. Stöter, F.-R., Liutkus, A., and Ito, N. (2018). The 2018 signal separation evaluation campaign. In International Conference on Latent Variable Analysis and Signal Separation, pages 293305. Springer. DOI: 10.1007/978-3-319-93764-9_28
  54. Stöter, F.-R., Uhlich, S., Liutkus, A., and Mitsufuji, Y. (2019). Open-Unmix: A reference implementation for music source separation. Journal of Open Source Software. DOI: 10.21105/joss.01667
  55. Takahashi, N., Agrawal, P., Goswami, N., and Mitsufuji, Y. (2018a). PhaseNet: Discretized phase modeling with deep neural networks for audio source separation. In Proceedings of the International Conference on Spoken Language Processing (Interspeech), pages 27132717. DOI: 10.21437/Interspeech.2018-1773
  56. Takahashi, N., Goswami, N., and Mitsufuji, Y. (2018b). MMDenseLSTM: An efficient combination of convolutional and recurrent neural networks for audio source separation. In Proceedings of the International Workshop on Acoustic Signal Enhancement (IWAENC), pages 106110. IEEE. DOI: 10.1109/IWAENC.2018.8521383
  57. Takahashi, N., and Mitsufuji, Y. (2017). Multi-scale multi-band DenseNets for audio source separation. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 2125. DOI: 10.1109/WASPAA.2017.8169987
  58. Takahashi, N., and Mitsufuji, Y. (2021). D3Net: Densely connected multidilated DenseNet for music source separation. arXiv preprint arXiv:2010.01733.
  59. Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 44894497. DOI: 10.1109/ICCV.2015.510
  60. Tsai, W.-H., Ma, C.-H., and Hsu, Y.-P. (2015). Automatic singing performance evaluation using accompanied vocals as reference bases. Journal of Information Science and Engineering, 31(3): 821838.
  61. Tzinis, E., Wisdom, S., Jansen, A., Hershey, S., Remez, T., Ellis, D. P., and Hershey, J. R. (2021a). Into the wild with audioscope: Unsupervised audio-visual separation of on-screen sounds. In Proceedings of the International Conference on Learning Representations (ICLR).
  62. Tzinis, E., Wisdom, S., Remez, T., and Hershey, J. R. (2021b). Improving on-screen sound separation for open domain videos with audio-visual selfattention. arXiv preprint arXiv:2106.09669.
  63. Uhlich, S., Porcu, M., Giron, F., Enenkl, M., Kemp, T., Takahashi, N., and Mitsufuji, Y. (2017). Improving music source separation based on deep neural networks through data augmentation and network blending. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 261265. DOI: 10.1109/ICASSP.2017.7952158
  64. Vembu, S., and Baumann, S. (2005). Separation of vocals from polyphonic audio recordings. In Proceedings of the International Conference on Music Information Retrieval (ISMIR), pages 337344.
  65. Vincent, E., Virtanen, T., and Gannot, S. (2018). Audio Source Separation and Speech Enhancement. John Wiley & Sons. DOI: 10.1002/9781119279860
  66. Zadeh, A., Ma, T., Poria, S., and Morency, L.-P. (2019). WildMix Dataset and Spectro-Temporal Transformer model for monoaural audio source separation. arXiv preprint arXiv:1911.09783.
  67. Zhao, H., Gan, C., Ma, W.-C., and Torralba, A. (2019). The sound of motions. In Proceedings of the International Conference on Computer Vision (ICCV), pages 17351744. DOI: 10.1109/ICCV.2019.00182
  68. Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., and Torralba, A. (2018). The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV), volume 1, pages 587604. DOI: 10.1007/978-3-030-01246-5_35
DOI: https://doi.org/10.5334/tismir.108 | Journal eISSN: 2514-3298
Language: English
Submitted on: Apr 6, 2021
Accepted on: Sep 13, 2021
Published on: Nov 25, 2021
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2021 Bochen Li, Yuxuan Wang, Zhiyao Duan, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.