Table 1
Number of instances for each genre on the train, validation and test subsets. The percentage of elements for each genre is also shown.
| Genre | Train | Val | Test | % |
|---|---|---|---|---|
| Blues | 518 | 120 | 190 | 2.68 |
| Country | 1351 | 243 | 194 | 5.78 |
| Electronic | 3434 | 725 | 733 | 15.81 |
| Folk | 858 | 164 | 136 | 3.74 |
| Jazz | 1844 | 373 | 462 | 8.66 |
| Latin | 390 | 83 | 83 | 1.80 |
| Metal | 1749 | 512 | 375 | 8.52 |
| New Age | 158 | 71 | 38 | 0.86 |
| Pop | 2333 | 644 | 466 | 11.13 |
| Punk | 487 | 132 | 96 | 2.31 |
| Rap | 1932 | 380 | 381 | 8.71 |
| Reggae | 1249 | 190 | 266 | 5.51 |
| RnB | 1223 | 222 | 396 | 5.95 |
| Rock | 3694 | 709 | 829 | 16.91 |
| World | 331 | 123 | 46 | 1.62 |

Figure 1
Scheme of two CNNs (Top: audio, Bottom: visual).

Figure 2
Scheme of the multimodal feature space network. The previously learned features from different modalities are mapped to the same space.
Table 2
Genre classification experiments in terms of macro precision, recall, and f-measure. Every experiment was run 3 times and mean and standard deviation of the results are reported.
| Input | Model | P | R | F1 |
|---|---|---|---|---|
| Audio | CNN_AUDIO | 0.385±0.006 | 0.341±0.001 | 0.336±0.002 |
| MM_AUDIO | 0.406±0.001 | 0.342±0.003 | 0.334±0.003 | |
| CNN_AUDIO + MM_AUDIO | 0.389±0.005 | 0.350±0.002 | 0.346±0.002 | |
| Video | CNN_VISUAL | 0.291±0.016 | 0.260±0.006 | 0.255±0.003 |
| MM_VISUAL | 0.264±0.005 | 0.241±0.002 | 0.239±0.002 | |
| CNN_VISUAL + MM_VISUAL | 0.271±0.001 | 0.248±0.003 | 0.245±0.003 | |
| A + V | CNN_AUDIO + CNN_VISUAL | 0.485±0.005 | 0.413±0.005 | 0.425±0.005 |
| MM_AUDIO + MM_VISUAL | 0.467±0.007 | 0.393±0.003 | 0.400±0.004 | |
| ALL | 0.477±0.010 | 0.413±0.002 | 0.427±0.000 |
Table 3
Detailed results of the genre classification task. Human annotated results on the left, and our best models on the right (CNN_AUDIO + MM_AUDIO, CNN_VISUAL, and ALL respectively).
| Genre | Human Annotator | Neural Model | ||||
|---|---|---|---|---|---|---|
| Audio | Visual | A + V | Audio | Visual | A + V | |
| Blues | 0 | 0.50 | 0.67 | 0.05 | 0.36 | 0.42 |
| Country | 0.40 | 0.60 | 0.31 | 0.37 | 0.21 | 0.40 |
| Electronic | 0.62 | 0.44 | 0.67 | 0.64 | 0.44 | 0.68 |
| Folk | 0 | 0.33 | 0 | 0.13 | 0.23 | 0.28 |
| Jazz | 0.62 | 0.38 | 0.67 | 0.47 | 0.27 | 0.49 |
| Latin | 0.33 | 0.33 | 0.40 | 0.17 | 0.08 | 0.13 |
| Metal | 0.80 | 0.43 | 0.71 | 0.69 | 0.49 | 0.73 |
| New Age | 0 | 0 | 0 | 0 | 0.12 | 0.10 |
| Pop | 0.43 | 0.46 | 0.42 | 0.39 | 0.43 | 0.49 |
| Punk | 0.44 | 0.29 | 0.46 | 0.04 | 0 | 0.30 |
| Rap | 0.74 | 0.29 | 0.88 | 0.73 | 0.39 | 0.73 |
| Reggae | 0.67 | 0 | 0.80 | 0.51 | 0.34 | 0.55 |
| RnB | 0.55 | 0 | 0.46 | 0.45 | 0.31 | 0.51 |
| Rock | 0.58 | 0.40 | 0.40 | 0.54 | 0.20 | 0.58 |
| World | 0 | 0.33 | 0 | 0 | 0 | 0.03 |
| Average | 0.41 | 0.32 | 0.46 | 0.35 | 0.25 | 0.43 |

Figure 3
Confusion matrices of the three settings from the classification with the Neural Network models (CNN_Audio + MM_Audio, CNN_Visual and ALL) and the human annotator.

Figure 4
Heavy Metal and New Age album covers.

Figure 5
Examples of heatmaps for different genre classes. The genres on the left column are the ground truth ones.

Figure 6
Heatmap for Metal genre class of a Metal (top) and a New Age (bottom) album with horns.
Table 4
Top-10 most and least represented genres.
| Genre | % of albums | Genre | % of albums |
|---|---|---|---|
| Pop | 84.38 | Tributes | 0.10 |
| Rock | 55.29 | Harmonica Blues | 0.10 |
| Alternative Rock | 27.69 | Concertos | 0.10 |
| World Music | 19.31 | Bass | 0.06 |
| Jazz | 14.73 | European Jazz | 0.06 |
| Dance & Electronic | 12.23 | Piano Blues | 0.06 |
| Metal | 11.50 | Norway | 0.06 |
| Indie & Lo-Fi | 10.45 | Slide Guitar | 0.06 |
| R&B | 10.10 | East Coast Blues | 0.06 |
| Folk | 9.69 | Girl Groups | 0.06 |
Table 5
Filter and max pooling sizes applied to the different layers of the three audio CNN approaches used for multi-label classification.
| CNN layer | Filter | Max pooling |
|---|---|---|
| 1 | 3×3 | (4,2) |
| 2 | 3×3 | (4,2) |
| 3 | 3×3 | (4,1) |
| 4 | 1×1 | (4,5) |
| 1 | 4×96 | (4,1) |
| 2 | 4×1 | (4,1) |
| 3 | 4×1 | (4,1) |
| 4 | 1×1 | – |
| 1 | 4×70 | (4,4) |
| 2 | 4×6 | (4,1) |
| 3 | 4×1 | (4,1) |
| 4 | 1×1 | – |
Table 6
Results for Multi-label Music Genre Classification of Albums Number of network hyperparameters, epoch training time, AUC-ROC, and aggregated diversity at N = 1, 3, 5 for different settings and modalities.
| Modality | Target | Settings | Params | Time | AUC | ADiv@1 | ADiv@3 | ADiv@5 |
|---|---|---|---|---|---|---|---|---|
| AUDIO | LOGISTIC | TIMBRE-MLP | 0.01M | 1s | 0.792 | 0.04 | 0.14 | 0.22 |
| AUDIO | LOGISTIC | LOW-3×3 | 0.5M | 390s | 0.859 | 0.14 | 0.34 | 0.54 |
| AUDIO | LOGISTIC | HIGH-3×3 | 16.5M | 2280s | 0.840 | 0.20 | 0.43 | 0.69 |
| AUDIO | LOGISTIC | LOW-4×96 | 0.2M | 140s | 0.851 | 0.14 | 0.32 | 0.48 |
| AUDIO | LOGISTIC | HIGH-4×96 | 5M | 260s | 0.862 | 0.12 | 0.33 | 0.48 |
| AUDIO | LOGISTIC | LOW-4×70 | 0.35M | 200s | 0.871 | 0.05 | 0.16 | 0.34 |
| AUDIO | LOGISTIC | HIGH-4×70 | 7.5M | 600s | 0.849 | 0.08 | 0.23 | 0.38 |
| AUDIO | COSINE | LOW-3×3 | 0.33M | 400s | 0.864 | 0.26 | 0.47 | 0.65 |
| AUDIO | COSINE | HIGH-3×3 | 15.5M | 2200s | 0.881 | 0.30 | 0.54 | 0.69 |
| AUDIO | COSINE | LOW-4×96 | 0.15M | 135s | 0.860 | 0.19 | 0.40 | 0.52 |
| AUDIO | COSINE | HIGH-4×96 | 4M | 250s | 0.884 | 0.35 | 0.59 | 0.75 |
| AUDIO | COSINE | LOW-4×70 | 0.3M | 190s | 0.868 | 0.26 | 0.51 | 0.68 |
| AUDIO (A) | COSINE | HIGH-4×70 | 6.5M | 590s | 0.888 | 0.35 | 0.60 | 0.74 |
| TEXT | LOGISTIC | VSM | 25M | 11s | 0.905 | 0.08 | 0.20 | 0.37 |
| TEXT | LOGISTIC | VSM+SEM | 25M | 11s | 0.916 | 0.10 | 0.25 | 0.44 |
| TEXT | COSINE | VSM | 25M | 11s | 0.901 | 0.53 | 0.44 | 0.90 |
| TEXT (T) | COSINE | VSM + SEM | 25M | 11s | 0.917 | 0.42 | 0.70 | 0.85 |
| IMAGE (I) | LOGISTIC | RESNET | 1.7M | 4009s | 0.743 | 0.06 | 0.15 | 0.27 |
| A + T | LOGISTIC | MLP | 1.5M | 2s | 0.923 | 0.10 | 0.40 | 0.64 |
| A + I | LOGISTIC | MLP | 1.5M | 2s | 0.900 | 0.10 | 0.38 | 0.66 |
| T + I | LOGISTIC | MLP | 1.5M | 2s | 0.921 | 0.10 | 0.37 | 0.63 |
| A + T + I | LOGISTIC | MLP | 2M | 2s | 0.936 | 0.11 | 0.39 | 0.66 |
| A + T | COSINE | MLP | 0.3M | 2s | 0.930 | 0.43 | 0.74 | 0.86 |
| A + I | COSINE | MLP | 0.3M | 2s | 0.896 | 0.32 | 0.57 | 0.76 |
| T + I | COSINE | MLP | 0.3M | 2s | 0.919 | 0.43 | 0.74 | 0.85 |
| A + T + I | COSINE | MLP | 0.4M | 2s | 0.931 | 0.42 | 0.72 | 0.86 |

Figure 7
t-SNE of PPMI album factors from the single-parent-label subset.
Table 7
Top-3 genre predictions in albums from the test set for LOGISTIC and COSINE audio-based approaches.
| Amazon ID | LOGISTIC | COSINE |
|---|---|---|
| B00002SWJF | Pop, Dance & Electronic, Rock | Dance & Electronic, Dance Pop, Electronica |
| B00006FX4G | Pop, Rock, Alternative Rock | Rock, Alternative Rock, Pop |
| B000000PLF | Pop, Jazz, Rock | Jazz, Bebop, Modern Postbebop |
| B00005YQOV | Pop, Jazz, Bebop | Jazz, Cool Jazz, Bebop |
| B0000026BS | Jazz, Pop, Bebop | Jazz, Bebop, Cool Jazz |
| B0000006PK | Pop, Jazz, Bebop | Jazz, Bebop, Cool Jazz |
| B00005O6NI | Pop, Rock, World Music | Blues, Traditional Blues, Acoustic Blues |
| B000BPYKLY | Pop, Jazz, R&B | Smooth Jazz, Soul-Jazz & Boogaloo, Jazz |
| B000007U2R | Pop, Dance & Electronic, Dance Pop | Dance Pop, Dance & Electronic, Electronica |
| B002LSPVJO | Rock, Pop, Alternative Rock | Rock, Alternative Rock, Metal |
| B000007WE5 | Pop, Rock, Country | Rock, Pop, Singer-Songwriters |
| B001IUC4AA | Dance & Electronic, Pop, Dance Pop | Dance & Electronic, Dance Pop, Electronica |
| B000SFJW2O | Pop, Rock, World Music | Pop, Rock, World Music |
| B000002NE8 | Pop, Rock, Dance & Electronic | Dance & Electronic, Dance Pop, Electronica |
| B000002GUJ | Rock, Pop, Alternative Rock | Alternative Rock, Indie & Lo-Fi, Indie Rock |
| B00004T0QB | Pop, Rock, Alternative Rock | Rock, Alternative Rock, Pop |
| B0000520XS | Pop, Rock, Folk | Singer-Songwriters, Contemporary Folk, Folk |
| B000EQ47W2 | Pop, Rock, Alternative Rock | Metal, Pop Metal, Rock |
| B00000258F | Pop, Rock, Jazz | Smooth Jazz, Soul-Jazz & Boogaloo, Jazz |
| B000003748 | Pop, Rock, Alternative Rock | Alternative Rock, Rock, American Alternative |

Figure 8
t-SNE visualization of image vectors from the single-parent-label subset.
