Skip to main content
Have a personal or library account? Click to login
Multimodal Deep Learning for Music Genre Classification Cover
Open Access
|Sep 2018

Figures & Tables

Table 1

Number of instances for each genre on the train, validation and test subsets. The percentage of elements for each genre is also shown.

GenreTrainValTest%
Blues5181201902.68
Country13512431945.78
Electronic343472573315.81
Folk8581641363.74
Jazz18443734628.66
Latin39083831.80
Metal17495123758.52
New Age15871380.86
Pop233364446611.13
Punk487132962.31
Rap19323803818.71
Reggae12491902665.51
RnB12232223965.95
Rock369470982916.91
World331123461.62
Figure 1

Scheme of two CNNs (Top: audio, Bottom: visual).

Figure 2

Scheme of the multimodal feature space network. The previously learned features from different modalities are mapped to the same space.

Table 2

Genre classification experiments in terms of macro precision, recall, and f-measure. Every experiment was run 3 times and mean and standard deviation of the results are reported.

InputModelPRF1
AudioCNN_AUDIO0.385±0.0060.341±0.0010.336±0.002
MM_AUDIO0.406±0.0010.342±0.0030.334±0.003
CNN_AUDIO + MM_AUDIO0.389±0.0050.350±0.0020.346±0.002
VideoCNN_VISUAL0.291±0.0160.260±0.0060.255±0.003
MM_VISUAL0.264±0.0050.241±0.0020.239±0.002
CNN_VISUAL + MM_VISUAL0.271±0.0010.248±0.0030.245±0.003
A + VCNN_AUDIO + CNN_VISUAL0.485±0.0050.413±0.0050.425±0.005
MM_AUDIO + MM_VISUAL0.467±0.0070.393±0.0030.400±0.004
ALL0.477±0.0100.413±0.0020.427±0.000
Table 3

Detailed results of the genre classification task. Human annotated results on the left, and our best models on the right (CNN_AUDIO + MM_AUDIO, CNN_VISUAL, and ALL respectively).

GenreHuman AnnotatorNeural Model
AudioVisualA + VAudioVisualA + V
Blues00.500.670.050.360.42
Country0.400.600.310.370.210.40
Electronic0.620.440.670.640.440.68
Folk00.3300.130.230.28
Jazz0.620.380.670.470.270.49
Latin0.330.330.400.170.080.13
Metal0.800.430.710.690.490.73
New Age00000.120.10
Pop0.430.460.420.390.430.49
Punk0.440.290.460.0400.30
Rap0.740.290.880.730.390.73
Reggae0.6700.800.510.340.55
RnB0.5500.460.450.310.51
Rock0.580.400.400.540.200.58
World00.330000.03
Average0.410.320.460.350.250.43
Figure 3

Confusion matrices of the three settings from the classification with the Neural Network models (CNN_Audio + MM_Audio, CNN_Visual and ALL) and the human annotator.

Figure 4

Heavy Metal and New Age album covers.

Figure 5

Examples of heatmaps for different genre classes. The genres on the left column are the ground truth ones.

Figure 6

Heatmap for Metal genre class of a Metal (top) and a New Age (bottom) album with horns.

Table 4

Top-10 most and least represented genres.

Genre% of albumsGenre% of albums
Pop84.38Tributes0.10
Rock55.29Harmonica Blues0.10
Alternative Rock27.69Concertos0.10
World Music19.31Bass0.06
Jazz14.73European Jazz0.06
Dance & Electronic12.23Piano Blues0.06
Metal11.50Norway0.06
Indie & Lo-Fi10.45Slide Guitar0.06
R&B10.10East Coast Blues0.06
Folk9.69Girl Groups0.06
Table 5

Filter and max pooling sizes applied to the different layers of the three audio CNN approaches used for multi-label classification.

CNN layerFilterMax pooling
13×3(4,2)
23×3(4,2)
33×3(4,1)
41×1(4,5)
14×96(4,1)
24×1(4,1)
34×1(4,1)
41×1
14×70(4,4)
24×6(4,1)
34×1(4,1)
41×1
Table 6

Results for Multi-label Music Genre Classification of Albums Number of network hyperparameters, epoch training time, AUC-ROC, and aggregated diversity at N = 1, 3, 5 for different settings and modalities.

ModalityTargetSettingsParamsTimeAUCADiv@1ADiv@3ADiv@5
AUDIOLOGISTICTIMBRE-MLP0.01M1s0.7920.040.140.22
AUDIOLOGISTICLOW-3×30.5M390s0.8590.140.340.54
AUDIOLOGISTICHIGH-3×316.5M2280s0.8400.200.430.69
AUDIOLOGISTICLOW-4×960.2M140s0.8510.140.320.48
AUDIOLOGISTICHIGH-4×965M260s0.8620.120.330.48
AUDIOLOGISTICLOW-4×700.35M200s0.8710.050.160.34
AUDIOLOGISTICHIGH-4×707.5M600s0.8490.080.230.38
AUDIOCOSINELOW-3×30.33M400s0.8640.260.470.65
AUDIOCOSINEHIGH-3×315.5M2200s0.8810.300.540.69
AUDIOCOSINELOW-4×960.15M135s0.8600.190.400.52
AUDIOCOSINEHIGH-4×964M250s0.8840.350.590.75
AUDIOCOSINELOW-4×700.3M190s0.8680.260.510.68
AUDIO (A)COSINEHIGH-4×706.5M590s0.8880.350.600.74
TEXTLOGISTICVSM25M11s0.9050.080.200.37
TEXTLOGISTICVSM+SEM25M11s0.9160.100.250.44
TEXTCOSINEVSM25M11s0.9010.530.440.90
TEXT (T)COSINEVSM + SEM25M11s0.9170.420.700.85
IMAGE (I)LOGISTICRESNET1.7M4009s0.7430.060.150.27
A + TLOGISTICMLP1.5M2s0.9230.100.400.64
A + ILOGISTICMLP1.5M2s0.9000.100.380.66
T + ILOGISTICMLP1.5M2s0.9210.100.370.63
A + T + ILOGISTICMLP2M2s0.9360.110.390.66
A + TCOSINEMLP0.3M2s0.9300.430.740.86
A + ICOSINEMLP0.3M2s0.8960.320.570.76
T + ICOSINEMLP0.3M2s0.9190.430.740.85
A + T + ICOSINEMLP0.4M2s0.9310.420.720.86
Figure 7

t-SNE of PPMI album factors from the single-parent-label subset.

Table 7

Top-3 genre predictions in albums from the test set for LOGISTIC and COSINE audio-based approaches.

Amazon IDLOGISTICCOSINE
B00002SWJFPop, Dance & Electronic, RockDance & Electronic, Dance Pop, Electronica
B00006FX4GPop, Rock, Alternative RockRock, Alternative Rock, Pop
B000000PLFPop, Jazz, RockJazz, Bebop, Modern Postbebop
B00005YQOVPop, Jazz, BebopJazz, Cool Jazz, Bebop
B0000026BSJazz, Pop, BebopJazz, Bebop, Cool Jazz
B0000006PKPop, Jazz, BebopJazz, Bebop, Cool Jazz
B00005O6NIPop, Rock, World MusicBlues, Traditional Blues, Acoustic Blues
B000BPYKLYPop, Jazz, R&BSmooth Jazz, Soul-Jazz & Boogaloo, Jazz
B000007U2RPop, Dance & Electronic, Dance PopDance Pop, Dance & Electronic, Electronica
B002LSPVJORock, Pop, Alternative RockRock, Alternative Rock, Metal
B000007WE5Pop, Rock, CountryRock, Pop, Singer-Songwriters
B001IUC4AADance & Electronic, Pop, Dance PopDance & Electronic, Dance Pop, Electronica
B000SFJW2OPop, Rock, World MusicPop, Rock, World Music
B000002NE8Pop, Rock, Dance & ElectronicDance & Electronic, Dance Pop, Electronica
B000002GUJRock, Pop, Alternative RockAlternative Rock, Indie & Lo-Fi, Indie Rock
B00004T0QBPop, Rock, Alternative RockRock, Alternative Rock, Pop
B0000520XSPop, Rock, FolkSinger-Songwriters, Contemporary Folk, Folk
B000EQ47W2Pop, Rock, Alternative RockMetal, Pop Metal, Rock
B00000258FPop, Rock, JazzSmooth Jazz, Soul-Jazz & Boogaloo, Jazz
B000003748Pop, Rock, Alternative RockAlternative Rock, Rock, American Alternative
Figure 8

t-SNE visualization of image vectors from the single-parent-label subset.

DOI: https://doi.org/10.5334/tismir.10 | Journal eISSN: 2514-3298
Language: English
Submitted on: Jan 20, 2018
Accepted on: May 1, 2018
Published on: Sep 4, 2018
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2018 Sergio Oramas, Francesco Barbieri, Oriol Nieto, Xavier Serra, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.