Audiovisual Singing Voice Separation

Bochen Li; Yuxuan Wang; Zhiyao Duan

doi:10.5334/tismir.108

Audiovisual Singing Voice Separation

Transactions of the International Society for Music Information Retrieval

Volume 4 (2021): Issue 1

By: Bochen Li, Yuxuan Wang and Zhiyao Duan

Open Access

|Nov 2021

Afouras, T., Chung, J. S., and Zisserman, A. (2018). The conversation: Deep audio-visual speech enhancement. In Proceedings of the International Conference on Spoken Language Processing (Interspeech). DOI: 10.21437/Interspeech.2018-1400
Open DOI Search in Google Scholar Back to article
Bazzica, A., van Gemert, J., Liem, C. C., and Hanjalic, A. (2017). Vision-based detection of acoustic timed events: A case study on clarinet note onsets. arXiv preprint arXiv:1706.09556.
Search in Google Scholar Back to article
Berenzweig, A., Ellis, D. P., and Lawrence, S. (2002). Using voice segments to improve artist classification of music. In Proceedings of the AES 22nd International Conference: Virtual Synthetic and Entertainment Audio.
Search in Google Scholar Back to article
Cadalbert, A., Landis, T., Regard, M., and Graves, R. E. (1994). Singing with and without words: Hemispheric asymmetries in motor control. Journal of Clinical and Experimental Neuropsychology, 16(5): 664–670. DOI: 10.1080/01688639408402679
Open DOI Search in Google Scholar Back to article
Cartwright, M., Pardo, B., and Mysore, G. J. (2018). Crowdsourced pairwise-comparison for source separation evaluation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 606–610. IEEE. DOI: 10.1109/ICASSP.2018.8462153
Open DOI Search in Google Scholar Back to article
Cartwright, M., Pardo, B., Mysore, G. J., and Hoffman, M. (2016). Fast and easy crowdsourced perceptual audio evaluation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 619–623. IEEE. DOI: 10.1109/ICASSP.2016.7471749
Open DOI Search in Google Scholar Back to article
Chan, T.-S., Yeh, T.-C., Fan, Z.-C., Chen, H.-W., Su, L., Yang, Y.-H., and Jang, R. (2015). Vocal activity informed singing voice separation with the iKala dataset. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 718–722. DOI: 10.1109/ICASSP.2015.7178063
Open DOI Search in Google Scholar Back to article
Chandna, P., Miron, M., Janer, J., and Gómez, E. (2017). Monoaural audio source separation using deep convolutional neural networks. In Proceedings of the International Conference on Latent Variable Analysis and Signal Separation, pages 258–266. Springer. DOI: 10.1007/978-3-319-53547-0_25
Open DOI Search in Google Scholar Back to article
Chen, L., Srivastava, S., Duan, Z., and Xu, C. (2017). Deep cross-modal audio-visual generation. In Proceedings of the ACM Thematic Workshops of Multimedia, pages 349–357. DOI: 10.1145/3126686.3126723
Open DOI Search in Google Scholar Back to article
Choi, W., Kim, M., Chung, J., and Jung, D. L. S. (2019). Investigating deep neural transformations for spectrogram-based musical source separation. arXiv preprint arXiv:1912.02591.
Search in Google Scholar Back to article
Choi, W., Kim, M., Chung, J., and Jung, S. (2021). LaSAFT: Latent source attentive frequency transformation for conditioned source separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 171–175. IEEE. DOI: 10.1109/ICASSP39728.2021.9413896
Open DOI Search in Google Scholar Back to article
Chung, J. S., and Zisserman, A. (2016). Lip reading in the wild. In Proceedings of the Asian Conference on Computer Vision, pages 87–103. Springer. DOI: 10.1007/978-3-319-54184-6_6
Open DOI Search in Google Scholar Back to article
Connell, L., Cai, Z. G., and Holler, J. (2013). Do you see what I’m singing? Visuospatial movement biases pitch perception. Brain and Cognition, 81(1): 124–130. DOI: 10.1016/j.bandc.2012.09.005
Open DOI Search in Google Scholar Back to article
Défossez, A., Usunier, N., Bottou, L., and Bach, F. (2019). Demucs: Deep extractor for music sources with extra unlabeled data remixed. arXiv preprint arXiv:1909.01174.
Search in Google Scholar Back to article
Dinesh, K., Li, B., Liu, X., Duan, Z., and Sharma, G. (2017). Visually informed multi-pitch analysis of string ensembles. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 3021–3025. DOI: 10.1109/ICASSP.2017.7952711
Open DOI Search in Google Scholar Back to article
Duan, Z., Essid, S., Liem, C., Richard, G., and Sharma, G. (2019). Audiovisual analysis of music performances: Overview of an emerging field. IEEE Signal Processing Magazine, 36(1): 63–73. DOI: 10.1109/MSP.2018.2875511
Open DOI Search in Google Scholar Back to article
Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W. T., and Rubinstein, M. (2018). Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. ACM Transactions on Graphics (TOG), 37(4). DOI: 10.1145/3197517.3201357
Open DOI Search in Google Scholar Back to article
Fujihara, H., and Goto, M. (2007). A music information retrieval system based on singing voice timbre. In Proceedings of the International Conference on Music Information Retrieval (ISMIR), pages 467–470.
Search in Google Scholar Back to article
Fujihara, H., Goto, M., Ogata, J., Komatani, K., Ogata, T., and Okuno, H. G. (2006). Automatic synchronization between lyrics and music CD recordings based on Viterbi alignment of segregated vocal signals. In Proceedings of the IEEE International Symposium on Multimedia (ISM), pages 257–264. DOI: 10.1109/ISM.2006.38
Open DOI Search in Google Scholar Back to article
Gan, C., Huang, D., Zhao, H., Tenenbaum, J. B., and Torralba, A. (2020). Music gesture for visual sound separation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10478–10487. DOI: 10.1109/CVPR42600.2020.01049
Open DOI Search in Google Scholar Back to article
Gao, R., and Grauman, K. (2019). Co-separating sounds of visual objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), pages 3879–3888. DOI: 10.1109/ICCV.2019.00398
Open DOI Search in Google Scholar Back to article
Gillet, O., and Richard, G. (2006). ENST-Drums: An extensive audio-visual database for drum signals processing. In Proceedings of the International Conference on Music Information Retrieval (ISMIR), pages 156–159.
Search in Google Scholar Back to article
Grell, A., Sundberg, J., Ternström, S., Ptok, M., and Altenmüller, E. (2009). Rapid pitch correction in choir singers. The Journal of the Acoustical Society of America, 126(1): 407–413. DOI: 10.1121/1.3147508
Open DOI Search in Google Scholar Back to article
Hennequin, R., Khlif, A., Voituret, F., and Moussallam, M. (2019). Spleeter: A fast and state-of-the-art music source separation tool with pre-trained models. Late-Breaking Demo, International Society for Music Information Retrieval Conference (ISMIR). DOI: 10.21105/joss.02154
Open DOI Search in Google Scholar Back to article
Hou, J.-C., Wang, S.-S., Lai, Y.-H., Tsao, Y., Chang, H.-W., and Wang, H.-M. (2018). Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Transactions on Emerging Topics in Computational Intelligence, 2(2): 117–128. DOI: 10.1109/TETCI.2017.2784878
Open DOI Search in Google Scholar Back to article
Hsu, C.-L., Wang, D., Jang, J.-S. R., and Hu, K. (2012). A tandem algorithm for singing pitch extraction and voice separation from music accompaniment. IEEE Transactions on Audio, Speech, and Language Processing, 20(5): 1482–1491. DOI: 10.1109/TASL.2011.2182510
Open DOI Search in Google Scholar Back to article
Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4700–4708. DOI: 10.1109/CVPR.2017.243
Open DOI Search in Google Scholar Back to article
Huang, P.-S., Chen, S. D., Smaragdis, P., and Hasegawa-Johnson, M. (2012). Singing-voice separation from monaural recordings using robust principal component analysis. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 57–60. DOI: 10.1109/ICASSP.2012.6287816
Open DOI Search in Google Scholar Back to article
Huang, P.-S., Kim, M., Hasegawa-Johnson, M., and Smaragdis, P. (2014). Singing-voice separation from monaural recordings using deep recurrent neural networks. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 477–482.
Search in Google Scholar Back to article
Jansson, A., Humphrey, E., Montecchio, N., Bittner, R., Kumar, A., and Weyde, T. (2017). Singing voice separation with deep U-Net convolutional networks. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR).
Search in Google Scholar Back to article
King, D. E. (2009). Dlib–ml: A machine learning toolkit. Journal of Machine Learning Research, 10(Jul.): 1755–1758.
Search in Google Scholar Back to article
Li, B., Dinesh, K., Duan, Z., and Sharma, G. (2017a). See and listen: Score-informed association of sound tracks to players in chamber music performance videos. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2906–2910. DOI: 10.1109/ICASSP.2017.7952688
Open DOI Search in Google Scholar Back to article
Li, B., Dinesh, K., Sharma, G., and Duan, Z. (2017b). Video-based vibrato detection and analysis for polyphonic string music. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 123–130.
Search in Google Scholar Back to article
Li, B., Dinesh, K., Xu, C., Sharma, G., and Duan, Z. (2019a). Online audio-visual source association for chamber music performances. Transactions of the International Society for Music Information Retrieval, 2(1). DOI: 10.5334/tismir.25
Open DOI Search in Google Scholar Back to article
Li, B., and Kumar, A. (2019). Query by video: Crossmodal music retrieval. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 604–611.
Search in Google Scholar Back to article
Li, B., Liu, X., Dinesh, K., Duan, Z., and Sharma, G. (2019b). Creating a music performance dataset for multimodal music analysis: Challenges, insights, and applications. IEEE Transactions on Multimedia, 21(2): 522–535. DOI: 10.1109/TMM.2018.2856090
Open DOI Search in Google Scholar Back to article
Li, B., Maezawa, A., and Duan, Z. (2018). Skeleton plays piano: Online generation of pianist body movements from MIDI performance. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR).
Search in Google Scholar Back to article
Li, B., Xu, C., and Duan, Z. (2017c). Audiovisual source association for string ensembles through multi-modal vibrato analysis. In Proceedings of the Sound and Music Computing (SMC) Conference, pages 159–166.
Search in Google Scholar Back to article
Liu, J.-Y., and Yang, Y.-H. (2018). Denoising autoencoder with recurrent skip connections and residual regression for music source separation. In Proceedings of the IEEE International Conference on Machine Learning and Applications (ICMLA), pages 773–778. DOI: 10.1109/ICMLA.2018.00123
Open DOI Search in Google Scholar Back to article
Lluis, F., Pons, J., and Serra, X. (2019). End-to-end music source separation: Is it possible in the waveform domain? In Proceedings of the International Conference on Spoken Language Processing (Interspeech). DOI: 10.21437/Interspeech.2019-1177
Open DOI Search in Google Scholar Back to article
Lu, R., Duan, Z., and Zhang, C. (2018). Listen and look: Audio–visual matching assisted speech source separation. IEEE Signal Processing Letters, 25(9): 1315–1319. DOI: 10.1109/LSP.2018.2853566
Open DOI Search in Google Scholar Back to article
Lu, R., Duan, Z., and Zhang, C. (2019). Audio–visual deep clustering for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(11): 1697–1712. DOI: 10.1109/TASLP.2019.2928140
Open DOI Search in Google Scholar Back to article
Luo, Y., Chen, Z., Hershey, J. R., Le Roux, J., and Mesgarani, N. (2017). Deep clustering and conventional networks for music separation: Stronger together. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 61–65. DOI: 10.1109/ICASSP.2017.7952118
Open DOI Search in Google Scholar Back to article
Luo, Y., and Mesgarani, N. (2018). TasNet: Timedomain audio separation network for real-time, single-channel speech separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 696–700. DOI: 10.1109/ICASSP.2018.8462116
Open DOI Search in Google Scholar Back to article
Mesaros, A., and Virtanen, T. (2010). Recognition of phonemes and words in singing. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2146–2149. IEEE. DOI: 10.1109/ICASSP.2010.5495585
Open DOI Search in Google Scholar Back to article
Ozerov, A., Philippe, P., Bimbot, F., and Gribonval, R. (2007). Adaptation of Bayesian models for singlechannel source separation and its application to voice/music separation in popular songs. IEEE Transactions on Audio, Speech, and Language Processing, 15(5): 1564–1578. DOI: 10.1109/TASL.2007.899291
Open DOI Search in Google Scholar Back to article
Ozerov, A., Philippe, P., Gribonval, R., and Bimbot, F. (2005). One microphone singing voice separation using source-adapted models. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 90–93. IEEE. DOI: 10.1109/ASPAA.2005.1540176
Open DOI Search in Google Scholar Back to article
Parekh, S., Essid, S., Ozerov, A., Duong, N., Perez, P., and Richard, G. (2017). Motion informed audio source separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6–10. DOI: 10.1109/ICASSP.2017.7951787
Open DOI Search in Google Scholar Back to article
Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., and Pantic, M. (2018). End-to-end audiovisual speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6548–6552. DOI: 10.1109/ICASSP.2018.8461326
Open DOI Search in Google Scholar Back to article
Rafii, Z., and Pardo, B. (2011). A simple music/voice separation method based on the extraction of the repeating musical structure. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 221–224. DOI: 10.1109/ICASSP.2011.5946380
Open DOI Search in Google Scholar Back to article
Song, X., Kong, Q., Du, X., and Wang, Y. (2021). CatNet: Music source separation system with mix-audio augmentation. arXiv preprint arXiv:2102.09966.
Search in Google Scholar Back to article
Stoller, D., Ewert, S., and Dixon, S. (2018). Wave-UNet: A multi-scale neural network for end-to-end audio source separation. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 334–340.
Search in Google Scholar Back to article
Stöter, F.-R., Liutkus, A., and Ito, N. (2018). The 2018 signal separation evaluation campaign. In International Conference on Latent Variable Analysis and Signal Separation, pages 293–305. Springer. DOI: 10.1007/978-3-319-93764-9_28
Open DOI Search in Google Scholar Back to article
Stöter, F.-R., Uhlich, S., Liutkus, A., and Mitsufuji, Y. (2019). Open-Unmix: A reference implementation for music source separation. Journal of Open Source Software. DOI: 10.21105/joss.01667
Open DOI Search in Google Scholar Back to article
Takahashi, N., Agrawal, P., Goswami, N., and Mitsufuji, Y. (2018a). PhaseNet: Discretized phase modeling with deep neural networks for audio source separation. In Proceedings of the International Conference on Spoken Language Processing (Interspeech), pages 2713–2717. DOI: 10.21437/Interspeech.2018-1773
Open DOI Search in Google Scholar Back to article
Takahashi, N., Goswami, N., and Mitsufuji, Y. (2018b). MMDenseLSTM: An efficient combination of convolutional and recurrent neural networks for audio source separation. In Proceedings of the International Workshop on Acoustic Signal Enhancement (IWAENC), pages 106–110. IEEE. DOI: 10.1109/IWAENC.2018.8521383
Open DOI Search in Google Scholar Back to article
Takahashi, N., and Mitsufuji, Y. (2017). Multi-scale multi-band DenseNets for audio source separation. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 21–25. DOI: 10.1109/WASPAA.2017.8169987
Open DOI Search in Google Scholar Back to article
Takahashi, N., and Mitsufuji, Y. (2021). D3Net: Densely connected multidilated DenseNet for music source separation. arXiv preprint arXiv:2010.01733.
Search in Google Scholar Back to article
Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4489–4497. DOI: 10.1109/ICCV.2015.510
Open DOI Search in Google Scholar Back to article
Tsai, W.-H., Ma, C.-H., and Hsu, Y.-P. (2015). Automatic singing performance evaluation using accompanied vocals as reference bases. Journal of Information Science and Engineering, 31(3): 821–838.
Search in Google Scholar Back to article
Tzinis, E., Wisdom, S., Jansen, A., Hershey, S., Remez, T., Ellis, D. P., and Hershey, J. R. (2021a). Into the wild with audioscope: Unsupervised audio-visual separation of on-screen sounds. In Proceedings of the International Conference on Learning Representations (ICLR).
Search in Google Scholar Back to article
Tzinis, E., Wisdom, S., Remez, T., and Hershey, J. R. (2021b). Improving on-screen sound separation for open domain videos with audio-visual selfattention. arXiv preprint arXiv:2106.09669.
Search in Google Scholar Back to article
Uhlich, S., Porcu, M., Giron, F., Enenkl, M., Kemp, T., Takahashi, N., and Mitsufuji, Y. (2017). Improving music source separation based on deep neural networks through data augmentation and network blending. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 261–265. DOI: 10.1109/ICASSP.2017.7952158
Open DOI Search in Google Scholar Back to article
Vembu, S., and Baumann, S. (2005). Separation of vocals from polyphonic audio recordings. In Proceedings of the International Conference on Music Information Retrieval (ISMIR), pages 337–344.
Search in Google Scholar Back to article
Vincent, E., Virtanen, T., and Gannot, S. (2018). Audio Source Separation and Speech Enhancement. John Wiley & Sons. DOI: 10.1002/9781119279860
Open DOI Search in Google Scholar Back to article
Zadeh, A., Ma, T., Poria, S., and Morency, L.-P. (2019). WildMix Dataset and Spectro-Temporal Transformer model for monoaural audio source separation. arXiv preprint arXiv:1911.09783.
Search in Google Scholar Back to article
Zhao, H., Gan, C., Ma, W.-C., and Torralba, A. (2019). The sound of motions. In Proceedings of the International Conference on Computer Vision (ICCV), pages 1735–1744. DOI: 10.1109/ICCV.2019.00182
Open DOI Search in Google Scholar Back to article
Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., and Torralba, A. (2018). The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV), volume 1, pages 587–604. DOI: 10.1007/978-3-030-01246-5_35
Open DOI Search in Google Scholar Back to article