
Figure 1
(a) The audio subnetwork. Downsample/upsample are applied to both time and frequency dimensions in the outer layers (marked by *), while they are only applied to the frequency dimension in the inner layers. (b) The video subnetwork. (c) The audiovisual fusion.
Table 1
Comparison of model size of different methods.
| Method | # Parameters (×106) |
|---|---|
| UMX | 8.5 |
| Spleeter | 19.7 |
| Demucs | 38 |
| MMDenseLSTM | 1.22 |
| AVDCNN | 11.3 |
| Proposed | 2.05 |

Figure 2
A sample photo and floor plan of the sound booth for the recording process of the URSing dataset.

Figure 3
Examples of video frames of the URSing dataset and cropped mouth region pictures as the input to the video branch of the proposed method.

Figure 4
The SDR (dB) comparison on separated solo vocals with different methods on different evaluation sets. (“v+” denotes songs where accompaniments contain vocal components.)

Figure 5
One 10-sec example comparing vocal separation results from different methods on a song excerpt with strong backing vocals from the Audition-RandMix dataset. The four spectrograms from top to bottom are the original mixture, ground-truth vocal, audio-based vocal separation result from Takahashi et al. (2018b), and audiovisual vocal separation result from the proposed method. One mouth frame is shown for each second.

Figure 6
The SDR (dB) comparison on the separated solo vocal from the audiovisual method using different video front-end models.

Figure 7
One sample frame of an a cappella song for subjective evaluation.

Figure 8
Statistics of the 26 subjects’ musical background related to the subjective evaluation.

Figure 9
The subjective ratings of the separation quality in response to the three questions. Each error bar shows mean ± standard deviation.

Figure 10
The SDR (dB) comparison on the separated solo vocal of the proposed audiovisual method with non-informative visual inputs.
