References
- Afouras, T., Chung, J. S., and Zisserman, A. (2018). The conversation: Deep audio-visual speech enhancement. In Proceedings of the International Conference on Spoken Language Processing (Interspeech). DOI: 10.21437/Interspeech.2018-1400
- Bazzica, A., van Gemert, J., Liem, C. C., and Hanjalic, A. (2017). Vision-based detection of acoustic timed events: A case study on clarinet note onsets. arXiv preprint arXiv:1706.09556.
- Berenzweig, A., Ellis, D. P., and Lawrence, S. (2002). Using voice segments to improve artist classification of music. In Proceedings of the AES 22nd International Conference: Virtual Synthetic and Entertainment Audio.
- Cadalbert, A., Landis, T., Regard, M., and Graves, R. E. (1994). Singing with and without words: Hemispheric asymmetries in motor control. Journal of Clinical and Experimental Neuropsychology, 16(5): 664–670. DOI: 10.1080/01688639408402679
- Cartwright, M., Pardo, B., and Mysore, G. J. (2018). Crowdsourced pairwise-comparison for source separation evaluation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 606–610.
IEEE . DOI: 10.1109/ICASSP.2018.8462153 - Cartwright, M., Pardo, B., Mysore, G. J., and Hoffman, M. (2016). Fast and easy crowdsourced perceptual audio evaluation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 619–623.
IEEE . DOI: 10.1109/ICASSP.2016.7471749 - Chan, T.-S., Yeh, T.-C., Fan, Z.-C., Chen, H.-W., Su, L., Yang, Y.-H., and Jang, R. (2015). Vocal activity informed singing voice separation with the iKala dataset. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 718–722. DOI: 10.1109/ICASSP.2015.7178063
- Chandna, P., Miron, M., Janer, J., and Gómez, E. (2017). Monoaural audio source separation using deep convolutional neural networks. In Proceedings of the International Conference on Latent Variable Analysis and Signal Separation, pages 258–266.
Springer . DOI: 10.1007/978-3-319-53547-0_25 - Chen, L., Srivastava, S., Duan, Z., and Xu, C. (2017). Deep cross-modal audio-visual generation. In Proceedings of the ACM Thematic Workshops of Multimedia, pages 349–357. DOI: 10.1145/3126686.3126723
- Choi, W., Kim, M., Chung, J., and Jung, D. L. S. (2019). Investigating deep neural transformations for spectrogram-based musical source separation. arXiv preprint arXiv:1912.02591.
- Choi, W., Kim, M., Chung, J., and Jung, S. (2021). LaSAFT: Latent source attentive frequency transformation for conditioned source separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 171–175.
IEEE . DOI: 10.1109/ICASSP39728.2021.9413896 - Chung, J. S., and Zisserman, A. (2016). Lip reading in the wild. In Proceedings of the Asian Conference on Computer Vision, pages 87–103.
Springer . DOI: 10.1007/978-3-319-54184-6_6 - Connell, L., Cai, Z. G., and Holler, J. (2013). Do you see what I’m singing? Visuospatial movement biases pitch perception. Brain and Cognition, 81(1): 124–130. DOI: 10.1016/j.bandc.2012.09.005
- Défossez, A., Usunier, N., Bottou, L., and Bach, F. (2019). Demucs: Deep extractor for music sources with extra unlabeled data remixed. arXiv preprint arXiv:1909.01174.
- Dinesh, K., Li, B., Liu, X., Duan, Z., and Sharma, G. (2017). Visually informed multi-pitch analysis of string ensembles. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 3021–3025. DOI: 10.1109/ICASSP.2017.7952711
- Duan, Z., Essid, S., Liem, C., Richard, G., and Sharma, G. (2019). Audiovisual analysis of music performances: Overview of an emerging field. IEEE Signal Processing Magazine, 36(1): 63–73. DOI: 10.1109/MSP.2018.2875511
- Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W. T., and Rubinstein, M. (2018). Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. ACM Transactions on Graphics (TOG), 37(4). DOI: 10.1145/3197517.3201357
- Fujihara, H., and Goto, M. (2007). A music information retrieval system based on singing voice timbre. In Proceedings of the International Conference on Music Information Retrieval (ISMIR), pages 467–470.
- Fujihara, H., Goto, M., Ogata, J., Komatani, K., Ogata, T., and Okuno, H. G. (2006). Automatic synchronization between lyrics and music CD recordings based on Viterbi alignment of segregated vocal signals. In Proceedings of the IEEE International Symposium on Multimedia (ISM), pages 257–264. DOI: 10.1109/ISM.2006.38
- Gan, C., Huang, D., Zhao, H., Tenenbaum, J. B., and Torralba, A. (2020). Music gesture for visual sound separation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10478–10487. DOI: 10.1109/CVPR42600.2020.01049
- Gao, R., and Grauman, K. (2019). Co-separating sounds of visual objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), pages 3879–3888. DOI: 10.1109/ICCV.2019.00398
- Gillet, O., and Richard, G. (2006). ENST-Drums: An extensive audio-visual database for drum signals processing. In Proceedings of the International Conference on Music Information Retrieval (ISMIR), pages 156–159.
- Grell, A., Sundberg, J., Ternström, S., Ptok, M., and Altenmüller, E. (2009). Rapid pitch correction in choir singers. The Journal of the Acoustical Society of America, 126(1): 407–413. DOI: 10.1121/1.3147508
- Hennequin, R., Khlif, A., Voituret, F., and Moussallam, M. (2019). Spleeter: A fast and state-of-the-art music source separation tool with pre-trained models. Late-Breaking Demo, International Society for Music Information Retrieval Conference (ISMIR). DOI: 10.21105/joss.02154
- Hou, J.-C., Wang, S.-S., Lai, Y.-H., Tsao, Y., Chang, H.-W., and Wang, H.-M. (2018). Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Transactions on Emerging Topics in Computational Intelligence, 2(2): 117–128. DOI: 10.1109/TETCI.2017.2784878
- Hsu, C.-L., Wang, D., Jang, J.-S. R., and Hu, K. (2012). A tandem algorithm for singing pitch extraction and voice separation from music accompaniment. IEEE Transactions on Audio, Speech, and Language Processing, 20(5): 1482–1491. DOI: 10.1109/TASL.2011.2182510
- Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4700–4708. DOI: 10.1109/CVPR.2017.243
- Huang, P.-S., Chen, S. D., Smaragdis, P., and Hasegawa-Johnson, M. (2012). Singing-voice separation from monaural recordings using robust principal component analysis. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 57–60. DOI: 10.1109/ICASSP.2012.6287816
- Huang, P.-S., Kim, M., Hasegawa-Johnson, M., and Smaragdis, P. (2014). Singing-voice separation from monaural recordings using deep recurrent neural networks. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 477–482.
- Jansson, A., Humphrey, E., Montecchio, N., Bittner, R., Kumar, A., and Weyde, T. (2017). Singing voice separation with deep U-Net convolutional networks. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR).
- King, D. E. (2009). Dlib–ml: A machine learning toolkit. Journal of Machine Learning Research, 10(Jul.): 1755–1758.
- Li, B., Dinesh, K., Duan, Z., and Sharma, G. (2017a). See and listen: Score-informed association of sound tracks to players in chamber music performance videos. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2906–2910. DOI: 10.1109/ICASSP.2017.7952688
- Li, B., Dinesh, K., Sharma, G., and Duan, Z. (2017b). Video-based vibrato detection and analysis for polyphonic string music. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 123–130.
- Li, B., Dinesh, K., Xu, C., Sharma, G., and Duan, Z. (2019a). Online audio-visual source association for chamber music performances. Transactions of the International Society for Music Information Retrieval, 2(1). DOI: 10.5334/tismir.25
- Li, B., and Kumar, A. (2019). Query by video: Crossmodal music retrieval. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 604–611.
- Li, B., Liu, X., Dinesh, K., Duan, Z., and Sharma, G. (2019b). Creating a music performance dataset for multimodal music analysis: Challenges, insights, and applications. IEEE Transactions on Multimedia, 21(2): 522–535. DOI: 10.1109/TMM.2018.2856090
- Li, B., Maezawa, A., and Duan, Z. (2018). Skeleton plays piano: Online generation of pianist body movements from MIDI performance. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR).
- Li, B., Xu, C., and Duan, Z. (2017c). Audiovisual source association for string ensembles through multi-modal vibrato analysis. In Proceedings of the Sound and Music Computing (SMC) Conference, pages 159–166.
- Liu, J.-Y., and Yang, Y.-H. (2018). Denoising autoencoder with recurrent skip connections and residual regression for music source separation. In Proceedings of the IEEE International Conference on Machine Learning and Applications (ICMLA), pages 773–778. DOI: 10.1109/ICMLA.2018.00123
- Lluis, F., Pons, J., and Serra, X. (2019). End-to-end music source separation: Is it possible in the waveform domain? In Proceedings of the International Conference on Spoken Language Processing (Interspeech). DOI: 10.21437/Interspeech.2019-1177
- Lu, R., Duan, Z., and Zhang, C. (2018). Listen and look: Audio–visual matching assisted speech source separation. IEEE Signal Processing Letters, 25(9): 1315–1319. DOI: 10.1109/LSP.2018.2853566
- Lu, R., Duan, Z., and Zhang, C. (2019). Audio–visual deep clustering for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(11): 1697–1712. DOI: 10.1109/TASLP.2019.2928140
- Luo, Y., Chen, Z., Hershey, J. R., Le Roux, J., and Mesgarani, N. (2017). Deep clustering and conventional networks for music separation: Stronger together. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 61–65. DOI: 10.1109/ICASSP.2017.7952118
- Luo, Y., and Mesgarani, N. (2018). TasNet: Timedomain audio separation network for real-time, single-channel speech separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 696–700. DOI: 10.1109/ICASSP.2018.8462116
- Mesaros, A., and Virtanen, T. (2010). Recognition of phonemes and words in singing. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2146–2149.
IEEE . DOI: 10.1109/ICASSP.2010.5495585 - Ozerov, A., Philippe, P., Bimbot, F., and Gribonval, R. (2007). Adaptation of Bayesian models for singlechannel source separation and its application to voice/music separation in popular songs. IEEE Transactions on Audio, Speech, and Language Processing, 15(5): 1564–1578. DOI: 10.1109/TASL.2007.899291
- Ozerov, A., Philippe, P., Gribonval, R., and Bimbot, F. (2005).
One microphone singing voice separation using source-adapted models . In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 90–93. IEEE. DOI: 10.1109/ASPAA.2005.1540176 - Parekh, S., Essid, S., Ozerov, A., Duong, N., Perez, P., and Richard, G. (2017). Motion informed audio source separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6–10. DOI: 10.1109/ICASSP.2017.7951787
- Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., and Pantic, M. (2018). End-to-end audiovisual speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6548–6552. DOI: 10.1109/ICASSP.2018.8461326
- Rafii, Z., and Pardo, B. (2011). A simple music/voice separation method based on the extraction of the repeating musical structure. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 221–224. DOI: 10.1109/ICASSP.2011.5946380
- Song, X., Kong, Q., Du, X., and Wang, Y. (2021). CatNet: Music source separation system with mix-audio augmentation. arXiv preprint arXiv:2102.09966.
- Stoller, D., Ewert, S., and Dixon, S. (2018). Wave-UNet: A multi-scale neural network for end-to-end audio source separation. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 334–340.
- Stöter, F.-R., Liutkus, A., and Ito, N. (2018). The 2018 signal separation evaluation campaign. In International Conference on Latent Variable Analysis and Signal Separation, pages 293–305.
Springer . DOI: 10.1007/978-3-319-93764-9_28 - Stöter, F.-R., Uhlich, S., Liutkus, A., and Mitsufuji, Y. (2019). Open-Unmix: A reference implementation for music source separation. Journal of Open Source Software. DOI: 10.21105/joss.01667
- Takahashi, N., Agrawal, P., Goswami, N., and Mitsufuji, Y. (2018a). PhaseNet: Discretized phase modeling with deep neural networks for audio source separation. In Proceedings of the International Conference on Spoken Language Processing (Interspeech), pages 2713–2717. DOI: 10.21437/Interspeech.2018-1773
- Takahashi, N., Goswami, N., and Mitsufuji, Y. (2018b).
MMDenseLSTM: An efficient combination of convolutional and recurrent neural networks for audio source separation . In Proceedings of the International Workshop on Acoustic Signal Enhancement (IWAENC), pages 106–110. IEEE. DOI: 10.1109/IWAENC.2018.8521383 - Takahashi, N., and Mitsufuji, Y. (2017). Multi-scale multi-band DenseNets for audio source separation. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 21–25. DOI: 10.1109/WASPAA.2017.8169987
- Takahashi, N., and Mitsufuji, Y. (2021). D3Net: Densely connected multidilated DenseNet for music source separation. arXiv preprint arXiv:2010.01733.
- Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4489–4497. DOI: 10.1109/ICCV.2015.510
- Tsai, W.-H., Ma, C.-H., and Hsu, Y.-P. (2015). Automatic singing performance evaluation using accompanied vocals as reference bases. Journal of Information Science and Engineering, 31(3): 821–838.
- Tzinis, E., Wisdom, S., Jansen, A., Hershey, S., Remez, T., Ellis, D. P., and Hershey, J. R. (2021a). Into the wild with audioscope: Unsupervised audio-visual separation of on-screen sounds. In Proceedings of the International Conference on Learning Representations (ICLR).
- Tzinis, E., Wisdom, S., Remez, T., and Hershey, J. R. (2021b). Improving on-screen sound separation for open domain videos with audio-visual selfattention. arXiv preprint arXiv:2106.09669.
- Uhlich, S., Porcu, M., Giron, F., Enenkl, M., Kemp, T., Takahashi, N., and Mitsufuji, Y. (2017). Improving music source separation based on deep neural networks through data augmentation and network blending. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 261–265. DOI: 10.1109/ICASSP.2017.7952158
- Vembu, S., and Baumann, S. (2005). Separation of vocals from polyphonic audio recordings. In Proceedings of the International Conference on Music Information Retrieval (ISMIR), pages 337–344.
- Vincent, E., Virtanen, T., and Gannot, S. (2018). Audio Source Separation and Speech Enhancement. John Wiley & Sons. DOI: 10.1002/9781119279860
- Zadeh, A., Ma, T., Poria, S., and Morency, L.-P. (2019). WildMix Dataset and Spectro-Temporal Transformer model for monoaural audio source separation. arXiv preprint arXiv:1911.09783.
- Zhao, H., Gan, C., Ma, W.-C., and Torralba, A. (2019). The sound of motions. In Proceedings of the International Conference on Computer Vision (ICCV), pages 1735–1744. DOI: 10.1109/ICCV.2019.00182
- Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., and Torralba, A. (2018). The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV), volume 1, pages 587–604. DOI: 10.1007/978-3-030-01246-5_35
DOI: https://doi.org/10.5334/tismir.108 | Journal eISSN: 2514-3298
Language: English
Submitted on: Apr 6, 2021
Accepted on: Sep 13, 2021
Published on: Nov 25, 2021
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year
© 2021 Bochen Li, Yuxuan Wang, Zhiyao Duan, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.
