Multimodal Deep Learning for Music Genre Classification

Sergio Oramas; Francesco Barbieri; Oriol Nieto; Xavier Serra

doi:10.5334/tismir.10

Multimodal Deep Learning for Music Genre Classification

Transactions of the International Society for Music Information Retrieval

Volume 1 (2018): Issue 1

By: Sergio Oramas , Francesco Barbieri, Oriol Nieto and Xavier Serra

Open Access

|Sep 2018

Adomavicius, G., & Kwon, Y. (2012). Improving aggregate recommendation diversity using ranking based techniques. IEEE Transactions on Knowledge and Data Engineering, 24(5), 896–911. DOI: 10.1109/TKDE.2011.15
Open DOI Search in Google Scholar Back to article
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828. DOI: 10.1109/TPAMI.2013.50
Open DOI Search in Google Scholar Back to article
Bertin-Mahieux, T., Eck, D., Maillet, F., & Lamere, P. (2008). Autotagger: A model for predicting social tags from acoustic features on large music databases. Journal of New Music Research, 37(2), 115–135. DOI: 10.1080/09298210802479250
Open DOI Search in Google Scholar Back to article
Bertin-Mahieux, T., Ellis, D. P. W., Whitman, B., & Lamere, P. (2011). The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference.
Search in Google Scholar Back to article
Bogdanov, D., Porter, A., Herrera, P., & Serra, X. (2016). Cross-collection evaluation for music classification tasks. In Proceedings of the 17th International Society for Music Information Retrieval Conference, 379–385.
Search in Google Scholar Back to article
Choi, K., Fazekas, G., & Sandler, M. (2016a). Automatic tagging using deep convolutional neural networks. In Proceedings of the 17th International Society for Music Information Retrieval Conference, 805–811.
Search in Google Scholar Back to article
Choi, K., Fazekas, G., Sandler, M., & Cho, K. (2016b). Convolutional recurrent neural networks for music classification. arXiv preprint arXiv:1609.04243.
Search in Google Scholar Back to article
Choi, K., Lee, J.H., & Downie, J.S. (2014). What is this song about anyway?: Automatic classification of subject using user interpretations and lyrics. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 453–454. DOI: 10.1109/JCDL.2014.6970221
Open DOI Search in Google Scholar Back to article
Chollet, F. (2016). Information-theoretical label embeddings for large-scale image classification. arXiv preprint arXiv:1607.05691.
Search in Google Scholar Back to article
Dieleman, S., Brakel, P., & Schrauwen, B. (2011). Audio-based music classification with a pretrained convolutional network. In Proceedings of the 12th International Society for Music Information Retrieval Conference, 669–674.
Search in Google Scholar Back to article
Dieleman, S., & Schrauwen, B. (2014). End-to-end learning for music audio. In IEEE International Conference on Acoustics, Speech and Signal Processing, 6964–6968. DOI: 10.1109/ICASSP.2014.6854950
Open DOI Search in Google Scholar Back to article
Dorfer, M., Arzt, A., & Widmer, G. (2016). Towards score following in sheet music images. Proceedings of the 17th International Society for Music Information Retrieval Conference.
Search in Google Scholar Back to article
Downie, J.S., & Hu, X. (2006). Review mining for music digital libraries: phase II. Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, 196–197. DOI: 10.1145/1141753.1141796
Open DOI Search in Google Scholar Back to article
Flexer, A. (2007). A closer look on artist filters for musical genre classification. In Proceedings of the 8th International Conference on Music Information Retrieval.
Search in Google Scholar Back to article
Gouyon, F., Dixon, S., Pampalk, E., & Widmer, G. (2004). Evaluating rhythmic descriptors for musical genre classification. In Proceedings of the AES 25th International Conference, 196–204.
Search in Google Scholar Back to article
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778. DOI: 10.1109/CVPR.2016.90
Open DOI Search in Google Scholar Back to article
Howard, A.G. (2013). Some improvements on deep convolutional neural network based image classification. arXiv preprint arXiv:1312.5402.
Search in Google Scholar Back to article
Hu, X., & Downie, J. (2006). Stylistics in customer reviews of cultural objects. SIGIR Forum, 49–51.
Search in Google Scholar Back to article
Hu, X., Downie, J., West, K., & Ehmann, A. (2005). Mining music reviews: Promising preliminary results. In Proceedings of the 6th International Conference on Music Information Retrieval.
Search in Google Scholar Back to article
Jain, H., Prabhu, Y., & Varma, M. (2016). Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 935–944. DOI: 10.1145/2939672.2939756
Open DOI Search in Google Scholar Back to article
Kim, Y. (2014). Convolutional neural networks for sentence classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1746–1751. DOI: 10.3115/v1/D14-1181
Open DOI Search in Google Scholar Back to article
Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Search in Google Scholar Back to article
Laurier, C., Grivolla, J., & Herrera, P. (2008). Multimodal music mood classification using audio and lyrics. In Seventh IEEE International Conference on Machine Learning and Applications, 688–693. DOI: 10.1109/ICMLA.2008.96
Open DOI Search in Google Scholar Back to article
Levy, O., & Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, 2177–2185.
Search in Google Scholar Back to article
Libeks, J., & Turnbull, D. (2011). You can judge an artist by an album cover: Using images for music annotation. IEEE MultiMedia, 18(4), 30–37. DOI: 10.1109/MMUL.2011.1
Open DOI Search in Google Scholar Back to article
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft COCO: Common objects in context. In European Conference on Computer Vision, 740–755. DOI: 10.1007/978-3-319-10602-1_48
Open DOI Search in Google Scholar Back to article
Logan, B. (2000). Mel frequency cepstral coefficients for music modeling. In Proceedings of the 1st International Symposium on Music Information Retrieval.
Search in Google Scholar Back to article
Maaten, L. v. d., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(Nov), 2579–2605.
Search in Google Scholar Back to article
McAuley, J., Targett, C., Shi, Q., & Van Den Hengel, A. (2015). Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, 43–52. DOI: 10.1145/2766462.2767755
Open DOI Search in Google Scholar Back to article
McFee, B., Bertin-Mahieux, T., Ellis, D. P. W., & Lanckriet, G. R. G. (2012). The Million Song Dataset challenge. In WWW’12 Companion: Proceedings of the 21st International Conference on World Wide Web, 909–916. DOI: 10.1145/2187980.2188222
Open DOI Search in Google Scholar Back to article
McFee, B., Raffel, C., Liang, D., Ellis, D. P. W., McVicar, M., Battenberg, E., & Nieto, O. (2015). librosa: Audio and music signal analysis in python. In Proceedings of the 14th Python in Science Conference, 1–7.
Search in Google Scholar Back to article
McKay, C., & Fujinaga, I. (2008). Combining features extracted from audio, symbolic and cultural sources. In Proceedings of the 9th International Conference on Music Information Retrieval.
Search in Google Scholar Back to article
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, 3111–3119.
Search in Google Scholar Back to article
Moro, A., Raganato, A., & Navigli, R. (2014). Entity linking meets word sense disambiguation: A unified approach. Transactions of the Association for Computational Linguistics, 2, 231–244.
Search in Google Scholar Back to article
Navigli, R., & Ponzetto, S.P. (2012). BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193, 217–250. DOI: 10.1016/j.artint.2012.07.001
Open DOI Search in Google Scholar Back to article
Neumayer, R., & Rauber, A. (2007). Integration of text and audio features for genre classification in music information retrieval. In European Conference on Information Retrieval, 724–727. DOI: 10.1007/978-3-540-71496-5_78
Open DOI Search in Google Scholar Back to article
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A.Y. (2011). Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning, 689–696.
Search in Google Scholar Back to article
Oramas, S. (2017). Semantic enrichment for similarity and classification. In Knowledge Extraction and Representation Learning for Music Recommendation and Classification, chapter 6, 75–88. PhD Thesis, Universitat Pompeu Fabra.
Search in Google Scholar Back to article
Oramas, S., Espinosa-Anke, L., Lawlor, A., & Serra, X. (2016a). Exploring customer reviews for music genre classification and evolutionary studies. In Proceedings of the 17th International Society for Music Information Retrieval Conference.
Search in Google Scholar Back to article
Oramas, S., Espinosa-Anke, L., Sordo, M., Saggion, H., & Serra, X. (2016b). ELMD: An automatically generated entity linking gold standard dataset in the music domain. In Proceedings of the 10th International Conference on Language Resources and Evaluation.
Search in Google Scholar Back to article
Oramas, S., Gómez, F., Gómez, E., & Mora, J. (2015). FlaBase: Towards the creation of a flamenco music knowledge base. In Proceedings of the 16th International Society for Music Information Retrieval Conference.
Search in Google Scholar Back to article
Oramas, S., Nieto, O., Barbieri, F., & Serra, X. (2017a). Multi-label music genre classification from audio, text, and images using deep features. Proceedings of the 18th International Society for Music Information Retrieval Conference.
Search in Google Scholar Back to article
Oramas, S., Nieto, O., Sordo, M., & Serra, X. (2017b). A deep multimodal approach for cold-start music recommendation. 2nd Workshop on Deep Learning for Recommender Systems, collocated with RecSys 2017.
Search in Google Scholar Back to article
Pachet, F., & Cazaly, D. (2000). A taxonomy of musical genres. In Content-Based Multimedia Information Access, 2, 1238–1245.
Search in Google Scholar Back to article
Pons, J., Lidy, T., & Serra, X. (2016). Experimenting with musically motivated convolutional neural networks. In 14th International Workshop on Content-Based Multimedia Indexing, 1–6. IEEE.
Search in Google Scholar Back to article
Pons, J., Nieto, O., Prockup, M., Schmidt, E.M., Ehmann, A.F., & Serra, X. (2017). End-to-end learning for music audio tagging at scale. arXiv preprint arXiv:1711.02520.
Search in Google Scholar Back to article
Razavian, A.S., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An astounding baseline for recognition. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, 512–519. DOI: 10.1109/CVPRW.2014.131
Open DOI Search in Google Scholar Back to article
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., & Fei-Fei, L. (2015). ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3), 211–252. DOI: 10.1007/s11263-015-0816-y
Open DOI Search in Google Scholar Back to article
Sanden, C., & Zhang, J.Z. (2011). Enhancing multi-label music genre classification through ensemble techniques. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, 705–714. SIGIR ’11. DOI: 10.1145/2009916.2010011
Open DOI Search in Google Scholar Back to article
Schedl, M., Orio, N., Liem, C., & Peeters, G. (2013). A professionally annotated and enriched multi-modal data set on popular music. In Proceedings of the 4th ACM Multimedia Systems Conference, 78–83. DOI: 10.1145/2483977.2483985
Open DOI Search in Google Scholar Back to article
Schindler, A., & Rauber, A. (2015). An audio-visual approach to music genre classification through affective color features. In European Conference on Information Retrieval, 61–67.
Search in Google Scholar Back to article
Schörkhuber, C., & Klapuri, A. (2010). Constant-Q transform toolbox for music processing. In 7th Sound and Music Computing Conference, 3–64.
Search in Google Scholar Back to article
Schreiber, H. (2015). Improving genre annotations for the Million Song Dataset. Proceedings of the 16th International Society for Music Information Retrieval Conference.
Search in Google Scholar Back to article
Sermanet, P., & LeCun, Y. (2011). Traffic sign recognition with multi-scale convolutional networks. In International Joint Conference on Neural Networks, 2809–2813. IEEE.
Search in Google Scholar Back to article
Seyerlehner, K., Schedl, M., Pohle, T., & Knees, P. (2010a). Using block-level features for genre classification, tag classification and music similarity estimation. Submission to Audio Music Similarity and Retrieval Task of MIREX.
Search in Google Scholar Back to article
Seyerlehner, K., Widmer, G., Schedl, M., & Knees, P. (2010b). Automatic music tag classification based on block-level. In 7th Sound and Music Computing Conference.
Search in Google Scholar Back to article
Sordo, M. (2012). Semantic annotation of music col-lections: A computational approach. PhD thesis, Universitat Pompeu Fabra.
Search in Google Scholar Back to article
Srivastava, N., & Salakhutdinov, R.R. (2012). Multi-modal learning with deep Boltzmann machines. In Advances in Neural Information Processing Systems, 2222–2230.
Search in Google Scholar Back to article
Sturm, B.L. (2012). A survey of evaluation in music genre recognition. In International Workshop on Adaptive Multimedia Retrieval, 29–66.
Search in Google Scholar Back to article
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1–9.
Search in Google Scholar Back to article
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826.
Search in Google Scholar Back to article
Tsoumakas, G., & Katakis, I. (2006). Multi-label classification: An overview. International Journal of Data Warehousing and Mining, 3(3).
Search in Google Scholar Back to article
Turnbull, D., Barrington, L., Torres, D., & Lanckriet, G. (2008). Semantic annotation and retrieval of music and sound effects. IEEE Transactions on Audio, Speech, and Language Processing, 16(2), 467–476. DOI: 10.1109/TASL.2007.913750
Open DOI Search in Google Scholar Back to article
Tzanetakis, G., & Cook, P. (2002). Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing, 10(5), 293–302. DOI: 10.1109/TSA.2002.800560
Open DOI Search in Google Scholar Back to article
van den Oord, A., Dieleman, S., & Schrauwen, B. (2013). Deep content-based music recommendation. In Advances in Neural Information Processing Systems, 2643–2651.
Search in Google Scholar Back to article
Wang, F., Wang, X., Shao, B., Li, T., & Ogihara, M. (2009). Tag integrated multi-label music style classification with hypergraph. In Proceedings of the 10th International Society for Music Information Retrieval Conference.
Search in Google Scholar Back to article
Wu, X., Qiao, Y., Wang, X., & Tang, X. (2016). Bridging music and image via cross-modal ranking analysis. IEEE Transactions on Multimedia, 18(7), 1305–1318. DOI: 10.1109/TMM.2016.2557722
Open DOI Search in Google Scholar Back to article
Yan, F., & Mikolajczyk, K. (2015). Deep correlation for matching images and text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3441–3450. DOI: 10.1109/CVPR.2015.7298966
Open DOI Search in Google Scholar Back to article
Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems, 3320–3328.
Search in Google Scholar Back to article
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2921–2929.
Search in Google Scholar Back to article
Zobel, J., & Moffat, A. (1998). Exploring the similarity space. ACM SIGIR Forum, 32(1), 18–34.
Search in Google Scholar Back to article

Authors

Metrics

Articles in this issue

DOI: https://doi.org/10.5334/tismir.10 | Journal eISSN: 2514-3298

Journal RSS Feed

Language: English

Submitted on: Jan 20, 2018

Accepted on: May 1, 2018

Published on: Sep 4, 2018

Published by: Ubiquity Press

In partnership with: Paradigm Publishing Services

Publication frequency: 1 issue per year

Keywords:

information retrieval,

deep learning,

music,

multimodal,

multi-label classification

© 2018 Sergio Oramas, Francesco Barbieri, Oriol Nieto, Xavier Serra, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.

Volume 1 (2018): Issue 1