Multimodal Deep Learning for Music Genre Classification

Genre	Train	Val	Test	%
Blues	518	120	190	2.68
Country	1351	243	194	5.78
Electronic	3434	725	733	15.81
Folk	858	164	136	3.74
Jazz	1844	373	462	8.66
Latin	390	83	83	1.80
Metal	1749	512	375	8.52
New Age	158	71	38	0.86
Pop	2333	644	466	11.13
Punk	487	132	96	2.31
Rap	1932	380	381	8.71
Reggae	1249	190	266	5.51
RnB	1223	222	396	5.95
Rock	3694	709	829	16.91
World	331	123	46	1.62

Scheme of two CNNs (*Top*: audio, *Bottom*: visual).

Scheme of the multimodal feature space network. The previously learned features from different modalities are mapped to the same space.

Table 2

Genre classification experiments in terms of macro precision, recall, and f-measure. Every experiment was run 3 times and mean and standard deviation of the results are reported.

Input	Model	P	R	F1
Audio	CNN_AUDIO	0.385±0.006	0.341±0.001	0.336±0.002
	MM_AUDIO	0.406±0.001	0.342±0.003	0.334±0.003
	CNN_AUDIO + MM_AUDIO	0.389±0.005	0.350±0.002	0.346±0.002
Video	CNN_VISUAL	0.291±0.016	0.260±0.006	0.255±0.003
	MM_VISUAL	0.264±0.005	0.241±0.002	0.239±0.002
	CNN_VISUAL + MM_VISUAL	0.271±0.001	0.248±0.003	0.245±0.003
A + V	CNN_AUDIO + CNN_VISUAL	0.485±0.005	0.413±0.005	0.425±0.005
	MM_AUDIO + MM_VISUAL	0.467±0.007	0.393±0.003	0.400±0.004
	ALL	0.477±0.010	0.413±0.002	0.427±0.000

Table 3

Detailed results of the genre classification task. Human annotated results on the left, and our best models on the right (CNN_AUDIO + MM_AUDIO, CNN_VISUAL, and ALL respectively).

Genre	Human Annotator			Neural Model
Genre	Audio	Visual	A + V	Audio	Visual	A + V
Blues	0	0.50	0.67	0.05	0.36	0.42
Country	0.40	0.60	0.31	0.37	0.21	0.40
Electronic	0.62	0.44	0.67	0.64	0.44	0.68
Folk	0	0.33	0	0.13	0.23	0.28
Jazz	0.62	0.38	0.67	0.47	0.27	0.49
Latin	0.33	0.33	0.40	0.17	0.08	0.13
Metal	0.80	0.43	0.71	0.69	0.49	0.73
New Age	0	0	0	0	0.12	0.10
Pop	0.43	0.46	0.42	0.39	0.43	0.49
Punk	0.44	0.29	0.46	0.04	0	0.30
Rap	0.74	0.29	0.88	0.73	0.39	0.73
Reggae	0.67	0	0.80	0.51	0.34	0.55
RnB	0.55	0	0.46	0.45	0.31	0.51
Rock	0.58	0.40	0.40	0.54	0.20	0.58
World	0	0.33	0	0	0	0.03
Average	0.41	0.32	0.46	0.35	0.25	0.43

Confusion matrices of the three settings from the classification with the Neural Network models (CNN_Audio + MM_Audio, CNN_Visual and ALL) and the human annotator.

Examples of heatmaps for different genre classes. The genres on the left column are the ground truth ones.

Heatmap for Metal genre class of a Metal (top) and a New Age (bottom) album with horns.

Table 4

Top-10 most and least represented genres.

Genre	% of albums	Genre	% of albums
Pop	84.38	Tributes	0.10
Rock	55.29	Harmonica Blues	0.10
Alternative Rock	27.69	Concertos	0.10
World Music	19.31	Bass	0.06
Jazz	14.73	European Jazz	0.06
Dance & Electronic	12.23	Piano Blues	0.06
Metal	11.50	Norway	0.06
Indie & Lo-Fi	10.45	Slide Guitar	0.06
R&B	10.10	East Coast Blues	0.06
Folk	9.69	Girl Groups	0.06

Table 5

Filter and max pooling sizes applied to the different layers of the three audio CNN approaches used for multi-label classification.

CNN layer	Filter	Max pooling
1	3×3	(4,2)
2	3×3	(4,2)
3	3×3	(4,1)
4	1×1	(4,5)
1	4×96	(4,1)
2	4×1	(4,1)
3	4×1	(4,1)
4	1×1	–
1	4×70	(4,4)
2	4×6	(4,1)
3	4×1	(4,1)
4	1×1	–

Table 6

Results for Multi-label Music Genre Classification of Albums Number of network hyperparameters, epoch training time, AUC-ROC, and aggregated diversity at N = 1, 3, 5 for different settings and modalities.

Modality	Target	Settings	Params	Time	AUC	ADiv@1	ADiv@3	ADiv@5
AUDIO	LOGISTIC	TIMBRE-MLP	0.01M	1s	0.792	0.04	0.14	0.22
AUDIO	LOGISTIC	LOW-3×3	0.5M	390s	0.859	0.14	0.34	0.54
AUDIO	LOGISTIC	HIGH-3×3	16.5M	2280s	0.840	0.20	0.43	0.69
AUDIO	LOGISTIC	LOW-4×96	0.2M	140s	0.851	0.14	0.32	0.48
AUDIO	LOGISTIC	HIGH-4×96	5M	260s	0.862	0.12	0.33	0.48
AUDIO	LOGISTIC	LOW-4×70	0.35M	200s	0.871	0.05	0.16	0.34
AUDIO	LOGISTIC	HIGH-4×70	7.5M	600s	0.849	0.08	0.23	0.38
AUDIO	COSINE	LOW-3×3	0.33M	400s	0.864	0.26	0.47	0.65
AUDIO	COSINE	HIGH-3×3	15.5M	2200s	0.881	0.30	0.54	0.69
AUDIO	COSINE	LOW-4×96	0.15M	135s	0.860	0.19	0.40	0.52
AUDIO	COSINE	HIGH-4×96	4M	250s	0.884	0.35	0.59	0.75
AUDIO	COSINE	LOW-4×70	0.3M	190s	0.868	0.26	0.51	0.68
AUDIO (A)	COSINE	HIGH-4×70	6.5M	590s	0.888	0.35	0.60	0.74
TEXT	LOGISTIC	VSM	25M	11s	0.905	0.08	0.20	0.37
TEXT	LOGISTIC	VSM+SEM	25M	11s	0.916	0.10	0.25	0.44
TEXT	COSINE	VSM	25M	11s	0.901	0.53	0.44	0.90
TEXT (T)	COSINE	VSM + SEM	25M	11s	0.917	0.42	0.70	0.85
IMAGE (I)	LOGISTIC	RESNET	1.7M	4009s	0.743	0.06	0.15	0.27
A + T	LOGISTIC	MLP	1.5M	2s	0.923	0.10	0.40	0.64
A + I	LOGISTIC	MLP	1.5M	2s	0.900	0.10	0.38	0.66
T + I	LOGISTIC	MLP	1.5M	2s	0.921	0.10	0.37	0.63
A + T + I	LOGISTIC	MLP	2M	2s	0.936	0.11	0.39	0.66
A + T	COSINE	MLP	0.3M	2s	0.930	0.43	0.74	0.86
A + I	COSINE	MLP	0.3M	2s	0.896	0.32	0.57	0.76
T + I	COSINE	MLP	0.3M	2s	0.919	0.43	0.74	0.85
A + T + I	COSINE	MLP	0.4M	2s	0.931	0.42	0.72	0.86

t-SNE of PPMI album factors from the *single-parent-label* subset.

Table 7

Top-3 genre predictions in albums from the test set for LOGISTIC and COSINE audio-based approaches.

Amazon ID	LOGISTIC	COSINE
B00002SWJF	Pop, Dance & Electronic, Rock	Dance & Electronic, Dance Pop, Electronica
B00006FX4G	Pop, Rock, Alternative Rock	Rock, Alternative Rock, Pop
B000000PLF	Pop, Jazz, Rock	Jazz, Bebop, Modern Postbebop
B00005YQOV	Pop, Jazz, Bebop	Jazz, Cool Jazz, Bebop
B0000026BS	Jazz, Pop, Bebop	Jazz, Bebop, Cool Jazz
B0000006PK	Pop, Jazz, Bebop	Jazz, Bebop, Cool Jazz
B00005O6NI	Pop, Rock, World Music	Blues, Traditional Blues, Acoustic Blues
B000BPYKLY	Pop, Jazz, R&B	Smooth Jazz, Soul-Jazz & Boogaloo, Jazz
B000007U2R	Pop, Dance & Electronic, Dance Pop	Dance Pop, Dance & Electronic, Electronica
B002LSPVJO	Rock, Pop, Alternative Rock	Rock, Alternative Rock, Metal
B000007WE5	Pop, Rock, Country	Rock, Pop, Singer-Songwriters
B001IUC4AA	Dance & Electronic, Pop, Dance Pop	Dance & Electronic, Dance Pop, Electronica
B000SFJW2O	Pop, Rock, World Music	Pop, Rock, World Music
B000002NE8	Pop, Rock, Dance & Electronic	Dance & Electronic, Dance Pop, Electronica
B000002GUJ	Rock, Pop, Alternative Rock	Alternative Rock, Indie & Lo-Fi, Indie Rock
B00004T0QB	Pop, Rock, Alternative Rock	Rock, Alternative Rock, Pop
B0000520XS	Pop, Rock, Folk	Singer-Songwriters, Contemporary Folk, Folk
B000EQ47W2	Pop, Rock, Alternative Rock	Metal, Pop Metal, Rock
B00000258F	Pop, Rock, Jazz	Smooth Jazz, Soul-Jazz & Boogaloo, Jazz
B000003748	Pop, Rock, Alternative Rock	Alternative Rock, Rock, American Alternative

t-SNE visualization of image vectors from the *single-parent-label* subset.

References

Authors

Metrics

Articles in this issue

DOI: https://doi.org/10.5334/tismir.10 | Journal eISSN: 2514-3298

Journal RSS Feed

Language: English

Submitted on: Jan 20, 2018

Accepted on: May 1, 2018

Published on: Sep 4, 2018

Published by: Ubiquity Press

In partnership with: Paradigm Publishing Services

Publication frequency: 1 issue per year

Keywords:

information retrieval,

deep learning,

music,

multimodal,

multi-label classification

© 2018 Sergio Oramas, Francesco Barbieri, Oriol Nieto, Xavier Serra, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.

Volume 1 (2018): Issue 1

Multimodal Deep Learning for Music Genre Classification

Figures & Tables

Table 1

Figure 1

Figure 2

Table 2

Table 3

Figure 3

Figure 4

Figure 5

Figure 6

Table 4

Table 5

Table 6

Figure 7

Table 7

Figure 8

Paradigm

My account