Skip to main content
Have a personal or library account? Click to login
Audiovisual Singing Voice Separation Cover

Figures & Tables

Figure 1

(a) The audio subnetwork. Downsample/upsample are applied to both time and frequency dimensions in the outer layers (marked by *), while they are only applied to the frequency dimension in the inner layers. (b) The video subnetwork. (c) The audiovisual fusion.

Table 1

Comparison of model size of different methods.

Method# Parameters (×106)
UMX8.5
Spleeter19.7
Demucs38
MMDenseLSTM1.22
AVDCNN11.3
Proposed2.05
Figure 2

A sample photo and floor plan of the sound booth for the recording process of the URSing dataset.

Figure 3

Examples of video frames of the URSing dataset and cropped mouth region pictures as the input to the video branch of the proposed method.

Figure 4

The SDR (dB) comparison on separated solo vocals with different methods on different evaluation sets. (“v+” denotes songs where accompaniments contain vocal components.)

Figure 5

One 10-sec example comparing vocal separation results from different methods on a song excerpt with strong backing vocals from the Audition-RandMix dataset. The four spectrograms from top to bottom are the original mixture, ground-truth vocal, audio-based vocal separation result from Takahashi et al. (2018b), and audiovisual vocal separation result from the proposed method. One mouth frame is shown for each second.

Figure 6

The SDR (dB) comparison on the separated solo vocal from the audiovisual method using different video front-end models.

Figure 7

One sample frame of an a cappella song for subjective evaluation.

Figure 8

Statistics of the 26 subjects’ musical background related to the subjective evaluation.

Figure 9

The subjective ratings of the separation quality in response to the three questions. Each error bar shows mean ± standard deviation.

Figure 10

The SDR (dB) comparison on the separated solo vocal of the proposed audiovisual method with non-informative visual inputs.

DOI: https://doi.org/10.5334/tismir.108 | Journal eISSN: 2514-3298
Language: English
Submitted on: Apr 6, 2021
Accepted on: Sep 13, 2021
Published on: Nov 25, 2021
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2021 Bochen Li, Yuxuan Wang, Zhiyao Duan, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.