Skip to main content
Have a personal or library account? Click to login

Figures & Tables

Figure 1

Comparison of validation loss when training the same model on a small dataset (red), a large dataset with errors (purple) and the same large dataset once the errors have been corrected (green). All experiments were evaluated on the same validation set.

Figure 2

Statistics collected during our internal data cleaning activity. The values in the rows are normalized so that they sum to 1. For example, in all the errors we found in our internal data, the chance that a guitar was labeled as bass is 32%.

Figure 3

The process of cleaning the stems of one song in the noisy dataset using the proposed robust baseline model. We propose two different methods: filtered and redistributed.

Table 1

Final LabelNoise leaderboard (models trained only on SDXDB23_LabelNoise; top-5).

RankParticipantPrizeGlobal SDR (dB)Submissions to LabelNoise
MeanBassDrumsOtherVocals1st phase2nd phase
Submissions
    1.CCOM1st7.468.127.995.348.37750
    2.subatomicseer2nd6.606.677.034.618.076533
    3.kuielab3rd6.516.716.714.827.829925
    4.aim-less6.446.757.194.567.281022
    5.yang_tong6.336.297.463.947.65-2
Baselines
UMX3.013.772.841.623.83
Demucs4.845.555.682.895.23
MDX-Net3.494.262.842.424.42
Table 2

Final Bleeding leaderboard (models trained only on SDXDB23_Bleeding; top-5).

RankParticipantPrizeGlobal SDR (dB)Submissions to Bleeding
MeanBassDrumsOtherVocals1st phase2nd phase
Submissions
    1.kuielab1st6.586.986.654.967.749913
    2.ZFTurbo2nd6.386.946.864.627.12324
    3.subatomicseer3rd6.316.336.864.597.476511
    4.CCOM6.206.346.324.287.87717
    5.alina_porechina5.876.016.104.097.3099118
Baselines
UMX3.613.903.852.504.17
Demucs5.335.905.563.696.19
MDX-Net3.564.002.302.655.29
Table 3

Final Standard leaderboard (models trained on any data; top-5).

RankParticipantPrizeGlobal SDR (dB)Submissions to Standard
MeanBassDrumsOtherVocals1st phase2nd phase
Submissions
    1.SAMI-ByteDance9.9711.1510.277.0811.36135
    2.ZFTurbo1st9.269.949.537.0510.513224
    3.kimberley_jensen2nd9.1810.069.476.8010.4086134
    4.kuielab3rd8.979.729.436.7210.019954
    5.alina_porechina8.639.929.296.239.0799172
Baselines
UMX-L6.526.626.844.897.73
BSRNN6.145.636.534.437.98
X-UMX-M6.305.856.874.428.04
Table 4

Results of our iterative refinement baseline. We use a source separation algorithm trained on corrupted data to improve the dataset: training the same model on the improved data increases the separation quality.

Global SDR (dB)
MeanBassDrumsOtherVocals
MoisesDB (203 songs)
    Original dataset4.434.655.063.025.00
    Improved dataset (redistributed)4.274.684.932.724.75
    Improved dataset (filtered)4.465.075.162.774.86
SDXDB23_LabelNoise
    Original dataset3.013.762.831.623.82
    Improved dataset (redistributed)3.444.003.811.864.08
    Improved dataset (filtered)3.904.574.572.224.25
SDXDB23_Bleeding
    Original dataset3.603.903.842.504.17
    Improved dataset (redistributed)3.593.734.072.404.17
    Improved dataset (filtered)4.094.654.762.524.44
Table 5

(Team ZFTurbo) Separation performance varying the number of shifts and overlap during the inference of HTDemucs. Increasing both lead to higher performance, with marginal improvements for very high parameter values.

Global SDR (dB)
ShiftsOverlap RatioMeanBassDrumsOtherVocals
20.59.4312.1511.355.818.40
40.759.4712.2211.405.848.41
10.959.4812.2411.415.848.43
100.959.4912.2511.415.858.43
Table 6

(Team ZFTurbo) SDR scores for the final ensemble on our MultiSong dataset (Solovyev et al., 2023) and on MDXDB21. We report separately the scores visible during the competition (only on 18 songs) and at the end (on 27 songs).

Global SDR (dB)
DatasetMeanBassDrumsOtherVocals
MultiSong MVSep10.1112.6811.686.679.62
MDXDB21 (18 songs)9.419.879.527.4310.81
MDXDB21 (27 songs)9.259.949.537.0510.51
Table 7

(Team subatomicseer) Our scores on the LabelNoise leaderboard.

Global SDR (dB)
Model (Mean Teacher loss)MeanBassDrumsOtherVocals
WHTDemucs (V1)5.936.415.734.427.17
DTUNet (V2)5.935.846.714.107.08
Blend6.606.707.034.618.07
Table 8

(Team subatomicseer) Our scores on the Bleeding leaderboard.

Global SDR (dB)
Model (Mean Teacher)MeanBassDrumsOtherVocals
WHTDemucs (V1)5.865.905.614.687.25
DTUNet (V2)5.625.376.183.927.00
Blend6.316.336.864.597.47
Table 9

(Team subatomicseer) Performance of the individual models of our ensemble on our validation set. Please note that HTDemucs is trained with more data than our internal models.

Global SDR (dB)
Model (Training Songs)MeanBassDrumsOtherVocals
DTUNet (347)8.798.7510.656.768.99
BSRNN (347)8.658.0610.806.389.37
HTDemucs (800)9.199.6810.767.179.15
Table 10

(Team CCOM) Performance of HTDemucs using our approach. The baseline is trained on SDXDB23_LabelNoise, then we train a model using loss truncation only. We use this model to filter the dataset (denoted with 1st in the Table) and train a new model. Finally, we repeat the dataset filtering (denoted with 2nd) and fine-tune the model to obtain the best performance.

Global SDR (dB)
Training SetupMeanBassDrumsOtherVocals
Baseline4.965.075.763.145.85
With loss truncation6.266.946.624.457.09
With filtered data (1st)6.897.347.584.887.74
With filtered data (2nd)7.468.127.995.348.37
Table 11

(Team kuielab) Ablation study on loss truncation. Please note that these are the scores of an individual TFC-TDF-UNet v3 model, not of the final ensemble.

Global SDR (dB)
TaskLoss TruncationMeanBassDrumsOtherVocals
LabelNoiseNo5.055.315.313.456.12
LabelNoiseYes6.266.436.384.647.58
BleedingNo5.806.115.864.366.87
BleedingYes6.226.586.204.697.41
Table 12

(Team kuielab) Comparison of TFC-TDF-UNets v2 and v3 on the MUSDB18-HQ benchmark. Speed denotes the relative GPU inference speed with respect to real-time on the challenge evaluation server.

Global SDR (dB)
ModelMeanBassDrumsOtherVocalsSpeed
v27.036.856.875.448.9612.8x
v37.907.368.816.199.2215.0x
Figure 4

Results of the listening test.

Figure 5

Results of the listening test by assessor category.

Figure 6

Results of the listening test on bass removal and extraction.

Figure 7

Results of the listening test on drum removal and extraction.

Figure 8

Results of the listening test on other removal and extraction.

Figure 9

Results of the listening test on vocal removal and extraction.

Table 13

Final ranking obtained with TrueSkill. We used the default parameters for each player (μ = 25 and σ = 8.33). We report the average SDR score on leaderboard Standard as reference.

ModelμσSDR (Mean)
1kimberley_jensen24.7930.7799.18
2ZFTurbo24.3620.7799.26
3SAMI-ByteDance24.0110.7799.97
Figure 10

Correlation between the SDR scores and the results of the listening test.

DOI: https://doi.org/10.5334/tismir.171 | Journal eISSN: 2514-3298
Language: English
Submitted on: Aug 22, 2023
Accepted on: Feb 13, 2024
Published on: Apr 18, 2024
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2024 Giorgio Fabbro, Stefan Uhlich, Chieh-Hsin Lai, Woosung Choi, Marco Martínez-Ramírez, Weihsiang Liao, Igor Gadelha, Geraldo Ramos, Eddie Hsu, Hugo Rodrigues, Fabian-Robert Stöter, Alexandre Défossez, Yi Luo, Jianwei Yu, Dipam Chakraborty, Sharada Mohanty, Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva, Nabarun Goswami, Tatsuya Harada, Minseok Kim, Jun Hyung Lee, Yuanliang Dong, Xinran Zhang, Jiafeng Liu, Yuki Mitsufuji, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.