Audiovisual Singing Voice Separation

Bochen Li; Yuxuan Wang; Zhiyao Duan

doi:10.5334/tismir.108

Audiovisual Singing Voice Separation

Transactions of the International Society for Music Information Retrieval

Volume 4 (2021): Issue 1

By: Bochen Li, Yuxuan Wang and Zhiyao Duan

Open Access

|Nov 2021

Figures & Tables

**(a)** The audio subnetwork. Downsample/upsample are applied to both time and frequency dimensions in the outer layers (marked by *), while they are only applied to the frequency dimension in the inner layers. **(b)** The video subnetwork. **(c)** The audiovisual fusion.

Table 1

Comparison of model size of different methods.

Method	# Parameters (×10⁶)
UMX	8.5
Spleeter	19.7
Demucs	38
MMDenseLSTM	1.22
AVDCNN	11.3
Proposed	2.05

A sample photo and floor plan of the sound booth for the recording process of the URSing dataset.

Examples of video frames of the URSing dataset and cropped mouth region pictures as the input to the video branch of the proposed method.

The SDR (dB) comparison on separated solo vocals with different methods on different evaluation sets. (“v+” denotes songs where accompaniments contain vocal components.)

One 10-sec example comparing vocal separation results from different methods on a song excerpt with strong backing vocals from the Audition-RandMix dataset. The four spectrograms from top to bottom are the original mixture, ground-truth vocal, audio-based vocal separation result from Takahashi et al. (2018b), and audiovisual vocal separation result from the proposed method. One mouth frame is shown for each second.

The SDR (dB) comparison on the separated solo vocal from the audiovisual method using different video front-end models.

One sample frame of an *a cappella* song for subjective evaluation.

Statistics of the 26 subjects’ musical background related to the subjective evaluation.

The subjective ratings of the separation quality in response to the three questions. Each error bar shows mean ± standard deviation.

The SDR (dB) comparison on the separated solo vocal of the proposed audiovisual method with non-informative visual inputs.

References

Authors

Metrics

Articles in this issue

DOI: https://doi.org/10.5334/tismir.108 | Journal eISSN: 2514-3298

Journal RSS Feed

Language: English

Submitted on: Apr 6, 2021

Accepted on: Sep 13, 2021

Published on: Nov 25, 2021

Published by: Ubiquity Press

In partnership with: Paradigm Publishing Services

Publication frequency: 1 issue per year

Keywords:

Source separation,

audiovisual analysis,

singing performance

© 2021 Bochen Li, Yuxuan Wang, Zhiyao Duan, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.

Volume 4 (2021): Issue 1

Audiovisual Singing Voice Separation

Figures & Tables

Figure 1

Table 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

Figure 9

Figure 10

Paradigm

My account