
Figure 1
Comparison of validation loss when training the same model on a small dataset (red), a large dataset with errors (purple) and the same large dataset once the errors have been corrected (green). All experiments were evaluated on the same validation set.

Figure 2
Statistics collected during our internal data cleaning activity. The values in the rows are normalized so that they sum to 1. For example, in all the errors we found in our internal data, the chance that a guitar was labeled as bass is 32%.

Figure 3
The process of cleaning the stems of one song in the noisy dataset using the proposed robust baseline model. We propose two different methods: filtered and redistributed.
Table 1
Final LabelNoise leaderboard (models trained only on SDXDB23_LabelNoise; top-5).
| Rank | Participant | Prize | Global SDR (dB) | Submissions to LabelNoise | |||||
| Mean | Bass | Drums | Other | Vocals | 1st phase | 2nd phase | |||
| Submissions | |||||||||
| 1. | CCOM | 1st | 7.46 | 8.12 | 7.99 | 5.34 | 8.37 | 7 | 50 |
| 2. | subatomicseer | 2nd | 6.60 | 6.67 | 7.03 | 4.61 | 8.07 | 65 | 33 |
| 3. | kuielab | 3rd | 6.51 | 6.71 | 6.71 | 4.82 | 7.82 | 99 | 25 |
| 4. | aim-less | 6.44 | 6.75 | 7.19 | 4.56 | 7.28 | 10 | 22 | |
| 5. | yang_tong | 6.33 | 6.29 | 7.46 | 3.94 | 7.65 | - | 2 | |
| Baselines | |||||||||
| UMX | 3.01 | 3.77 | 2.84 | 1.62 | 3.83 | ||||
| Demucs | 4.84 | 5.55 | 5.68 | 2.89 | 5.23 | ||||
| MDX-Net | 3.49 | 4.26 | 2.84 | 2.42 | 4.42 | ||||
Table 2
Final Bleeding leaderboard (models trained only on SDXDB23_Bleeding; top-5).
| Rank | Participant | Prize | Global SDR (dB) | Submissions to Bleeding | |||||
| Mean | Bass | Drums | Other | Vocals | 1st phase | 2nd phase | |||
| Submissions | |||||||||
| 1. | kuielab | 1st | 6.58 | 6.98 | 6.65 | 4.96 | 7.74 | 99 | 13 |
| 2. | ZFTurbo | 2nd | 6.38 | 6.94 | 6.86 | 4.62 | 7.12 | 32 | 4 |
| 3. | subatomicseer | 3rd | 6.31 | 6.33 | 6.86 | 4.59 | 7.47 | 65 | 11 |
| 4. | CCOM | 6.20 | 6.34 | 6.32 | 4.28 | 7.87 | 7 | 17 | |
| 5. | alina_porechina | 5.87 | 6.01 | 6.10 | 4.09 | 7.30 | 99 | 118 | |
| Baselines | |||||||||
| UMX | 3.61 | 3.90 | 3.85 | 2.50 | 4.17 | ||||
| Demucs | 5.33 | 5.90 | 5.56 | 3.69 | 6.19 | ||||
| MDX-Net | 3.56 | 4.00 | 2.30 | 2.65 | 5.29 | ||||
Table 3
Final Standard leaderboard (models trained on any data; top-5).
| Rank | Participant | Prize | Global SDR (dB) | Submissions to Standard | |||||
| Mean | Bass | Drums | Other | Vocals | 1st phase | 2nd phase | |||
| Submissions | |||||||||
| 1. | SAMI-ByteDance | 9.97 | 11.15 | 10.27 | 7.08 | 11.36 | 13 | 5 | |
| 2. | ZFTurbo | 1st | 9.26 | 9.94 | 9.53 | 7.05 | 10.51 | 32 | 24 |
| 3. | kimberley_jensen | 2nd | 9.18 | 10.06 | 9.47 | 6.80 | 10.40 | 86 | 134 |
| 4. | kuielab | 3rd | 8.97 | 9.72 | 9.43 | 6.72 | 10.01 | 99 | 54 |
| 5. | alina_porechina | 8.63 | 9.92 | 9.29 | 6.23 | 9.07 | 99 | 172 | |
| Baselines | |||||||||
| UMX-L | 6.52 | 6.62 | 6.84 | 4.89 | 7.73 | ||||
| BSRNN | 6.14 | 5.63 | 6.53 | 4.43 | 7.98 | ||||
| X-UMX-M | 6.30 | 5.85 | 6.87 | 4.42 | 8.04 | ||||
Table 4
Results of our iterative refinement baseline. We use a source separation algorithm trained on corrupted data to improve the dataset: training the same model on the improved data increases the separation quality.
| Global SDR (dB) | |||||
| Mean | Bass | Drums | Other | Vocals | |
| MoisesDB (203 songs) | |||||
| Original dataset | 4.43 | 4.65 | 5.06 | 3.02 | 5.00 |
| Improved dataset (redistributed) | 4.27 | 4.68 | 4.93 | 2.72 | 4.75 |
| Improved dataset (filtered) | 4.46 | 5.07 | 5.16 | 2.77 | 4.86 |
| SDXDB23_LabelNoise | |||||
| Original dataset | 3.01 | 3.76 | 2.83 | 1.62 | 3.82 |
| Improved dataset (redistributed) | 3.44 | 4.00 | 3.81 | 1.86 | 4.08 |
| Improved dataset (filtered) | 3.90 | 4.57 | 4.57 | 2.22 | 4.25 |
| SDXDB23_Bleeding | |||||
| Original dataset | 3.60 | 3.90 | 3.84 | 2.50 | 4.17 |
| Improved dataset (redistributed) | 3.59 | 3.73 | 4.07 | 2.40 | 4.17 |
| Improved dataset (filtered) | 4.09 | 4.65 | 4.76 | 2.52 | 4.44 |
Table 5
(Team ZFTurbo) Separation performance varying the number of shifts and overlap during the inference of HTDemucs. Increasing both lead to higher performance, with marginal improvements for very high parameter values.
| Global SDR (dB) | ||||||
| Shifts | Overlap Ratio | Mean | Bass | Drums | Other | Vocals |
| 2 | 0.5 | 9.43 | 12.15 | 11.35 | 5.81 | 8.40 |
| 4 | 0.75 | 9.47 | 12.22 | 11.40 | 5.84 | 8.41 |
| 1 | 0.95 | 9.48 | 12.24 | 11.41 | 5.84 | 8.43 |
| 10 | 0.95 | 9.49 | 12.25 | 11.41 | 5.85 | 8.43 |
Table 6
(Team ZFTurbo) SDR scores for the final ensemble on our MultiSong dataset (Solovyev et al., 2023) and on MDXDB21. We report separately the scores visible during the competition (only on 18 songs) and at the end (on 27 songs).
| Global SDR (dB) | |||||
| Dataset | Mean | Bass | Drums | Other | Vocals |
| MultiSong MVSep | 10.11 | 12.68 | 11.68 | 6.67 | 9.62 |
| MDXDB21 (18 songs) | 9.41 | 9.87 | 9.52 | 7.43 | 10.81 |
| MDXDB21 (27 songs) | 9.25 | 9.94 | 9.53 | 7.05 | 10.51 |
Table 7
(Team subatomicseer) Our scores on the LabelNoise leaderboard.
| Global SDR (dB) | |||||
| Model (Mean Teacher loss) | Mean | Bass | Drums | Other | Vocals |
| WHTDemucs (V1) | 5.93 | 6.41 | 5.73 | 4.42 | 7.17 |
| DTUNet (V2) | 5.93 | 5.84 | 6.71 | 4.10 | 7.08 |
| Blend | 6.60 | 6.70 | 7.03 | 4.61 | 8.07 |
Table 8
(Team subatomicseer) Our scores on the Bleeding leaderboard.
| Global SDR (dB) | |||||
| Model (Mean Teacher) | Mean | Bass | Drums | Other | Vocals |
| WHTDemucs (V1) | 5.86 | 5.90 | 5.61 | 4.68 | 7.25 |
| DTUNet (V2) | 5.62 | 5.37 | 6.18 | 3.92 | 7.00 |
| Blend | 6.31 | 6.33 | 6.86 | 4.59 | 7.47 |
Table 9
(Team subatomicseer) Performance of the individual models of our ensemble on our validation set. Please note that HTDemucs is trained with more data than our internal models.
| Global SDR (dB) | |||||
| Model (Training Songs) | Mean | Bass | Drums | Other | Vocals |
| DTUNet (347) | 8.79 | 8.75 | 10.65 | 6.76 | 8.99 |
| BSRNN (347) | 8.65 | 8.06 | 10.80 | 6.38 | 9.37 |
| HTDemucs (800) | 9.19 | 9.68 | 10.76 | 7.17 | 9.15 |
Table 10
(Team CCOM) Performance of HTDemucs using our approach. The baseline is trained on SDXDB23_LabelNoise, then we train a model using loss truncation only. We use this model to filter the dataset (denoted with 1st in the Table) and train a new model. Finally, we repeat the dataset filtering (denoted with 2nd) and fine-tune the model to obtain the best performance.
| Global SDR (dB) | |||||
| Training Setup | Mean | Bass | Drums | Other | Vocals |
| Baseline | 4.96 | 5.07 | 5.76 | 3.14 | 5.85 |
| With loss truncation | 6.26 | 6.94 | 6.62 | 4.45 | 7.09 |
| With filtered data (1st) | 6.89 | 7.34 | 7.58 | 4.88 | 7.74 |
| With filtered data (2nd) | 7.46 | 8.12 | 7.99 | 5.34 | 8.37 |
Table 11
(Team kuielab) Ablation study on loss truncation. Please note that these are the scores of an individual TFC-TDF-UNet v3 model, not of the final ensemble.
| Global SDR (dB) | ||||||
| Task | Loss Truncation | Mean | Bass | Drums | Other | Vocals |
| LabelNoise | No | 5.05 | 5.31 | 5.31 | 3.45 | 6.12 |
| LabelNoise | Yes | 6.26 | 6.43 | 6.38 | 4.64 | 7.58 |
| Bleeding | No | 5.80 | 6.11 | 5.86 | 4.36 | 6.87 |
| Bleeding | Yes | 6.22 | 6.58 | 6.20 | 4.69 | 7.41 |
Table 12
(Team kuielab) Comparison of TFC-TDF-UNets v2 and v3 on the MUSDB18-HQ benchmark. Speed denotes the relative GPU inference speed with respect to real-time on the challenge evaluation server.
| Global SDR (dB) | ||||||
| Model | Mean | Bass | Drums | Other | Vocals | Speed |
| v2 | 7.03 | 6.85 | 6.87 | 5.44 | 8.96 | 12.8x |
| v3 | 7.90 | 7.36 | 8.81 | 6.19 | 9.22 | 15.0x |

Figure 4
Results of the listening test.

Figure 5
Results of the listening test by assessor category.

Figure 6
Results of the listening test on bass removal and extraction.

Figure 7
Results of the listening test on drum removal and extraction.

Figure 8
Results of the listening test on other removal and extraction.

Figure 9
Results of the listening test on vocal removal and extraction.
Table 13
Final ranking obtained with TrueSkill. We used the default parameters for each player (μ = 25 and σ = 8.33). We report the average SDR score on leaderboard Standard as reference.
| Model | μ | σ | SDR (Mean) | |
| 1 | kimberley_jensen | 24.793 | 0.779 | 9.18 |
| 2 | ZFTurbo | 24.362 | 0.779 | 9.26 |
| 3 | SAMI-ByteDance | 24.011 | 0.779 | 9.97 |

Figure 10
Correlation between the SDR scores and the results of the listening test.
