
Multimodal Deep Learning for Music Genre Classification
References
- Adomavicius, G., & Kwon, Y. (2012). Improving aggregate recommendation diversity using ranking based techniques. IEEE Transactions on Knowledge and Data Engineering, 24(5), 896–911. DOI: 10.1109/TKDE.2011.15
- Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828. DOI: 10.1109/TPAMI.2013.50
- Bertin-Mahieux, T., Eck, D., Maillet, F., & Lamere, P. (2008). Autotagger: A model for predicting social tags from acoustic features on large music databases. Journal of New Music Research, 37(2), 115–135. DOI: 10.1080/09298210802479250
- Bertin-Mahieux, T., Ellis, D. P. W., Whitman, B., & Lamere, P. (2011). The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference.
- Bogdanov, D., Porter, A., Herrera, P., & Serra, X. (2016). Cross-collection evaluation for music classification tasks. In Proceedings of the 17th International Society for Music Information Retrieval Conference, 379–385.
- Choi, K., Fazekas, G., & Sandler, M. (2016a). Automatic tagging using deep convolutional neural networks. In Proceedings of the 17th International Society for Music Information Retrieval Conference, 805–811.
- Choi, K., Fazekas, G., Sandler, M., & Cho, K. (2016b). Convolutional recurrent neural networks for music classification. arXiv preprint arXiv:1609.04243.
- Choi, K., Lee, J.H., & Downie, J.S. (2014). What is this song about anyway?: Automatic classification of subject using user interpretations and lyrics. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 453–454. DOI: 10.1109/JCDL.2014.6970221
- Chollet, F. (2016). Information-theoretical label embeddings for large-scale image classification. arXiv preprint arXiv:1607.05691.
- Dieleman, S., Brakel, P., & Schrauwen, B. (2011). Audio-based music classification with a pretrained convolutional network. In Proceedings of the 12th International Society for Music Information Retrieval Conference, 669–674.
- Dieleman, S., & Schrauwen, B. (2014). End-to-end learning for music audio. In IEEE International Conference on Acoustics, Speech and Signal Processing, 6964–6968. DOI: 10.1109/ICASSP.2014.6854950
- Dorfer, M., Arzt, A., & Widmer, G. (2016). Towards score following in sheet music images. Proceedings of the 17th International Society for Music Information Retrieval Conference.
- Downie, J.S., & Hu, X. (2006). Review mining for music digital libraries: phase II. Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, 196–197. DOI: 10.1145/1141753.1141796
- Flexer, A. (2007). A closer look on artist filters for musical genre classification. In Proceedings of the 8th International Conference on Music Information Retrieval.
- Gouyon, F., Dixon, S., Pampalk, E., & Widmer, G. (2004). Evaluating rhythmic descriptors for musical genre classification. In Proceedings of the AES 25th International Conference, 196–204.
- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778. DOI: 10.1109/CVPR.2016.90
- Howard, A.G. (2013). Some improvements on deep convolutional neural network based image classification. arXiv preprint arXiv:1312.5402.
- Hu, X., & Downie, J. (2006). Stylistics in customer reviews of cultural objects. SIGIR Forum, 49–51.
- Hu, X., Downie, J., West, K., & Ehmann, A. (2005). Mining music reviews: Promising preliminary results. In Proceedings of the 6th International Conference on Music Information Retrieval.
- Jain, H., Prabhu, Y., & Varma, M. (2016). Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 935–944. DOI: 10.1145/2939672.2939756
- Kim, Y. (2014). Convolutional neural networks for sentence classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1746–1751. DOI: 10.3115/v1/D14-1181
- Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Laurier, C., Grivolla, J., & Herrera, P. (2008). Multimodal music mood classification using audio and lyrics. In Seventh IEEE International Conference on Machine Learning and Applications, 688–693. DOI: 10.1109/ICMLA.2008.96
- Levy, O., & Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, 2177–2185.
- Libeks, J., & Turnbull, D. (2011). You can judge an artist by an album cover: Using images for music annotation. IEEE MultiMedia, 18(4), 30–37. DOI: 10.1109/MMUL.2011.1
- Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft COCO: Common objects in context. In European Conference on Computer Vision, 740–755. DOI: 10.1007/978-3-319-10602-1_48
- Logan, B. (2000). Mel frequency cepstral coefficients for music modeling. In Proceedings of the 1st International Symposium on Music Information Retrieval.
- Maaten, L. v. d., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(Nov), 2579–2605.
- McAuley, J., Targett, C., Shi, Q., & Van Den Hengel, A. (2015). Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, 43–52. DOI: 10.1145/2766462.2767755
- McFee, B., Bertin-Mahieux, T., Ellis, D. P. W., & Lanckriet, G. R. G. (2012). The Million Song Dataset challenge. In WWW’12 Companion: Proceedings of the 21st International Conference on World Wide Web, 909–916. DOI: 10.1145/2187980.2188222
- McFee, B., Raffel, C., Liang, D., Ellis, D. P. W., McVicar, M., Battenberg, E., & Nieto, O. (2015). librosa: Audio and music signal analysis in python. In Proceedings of the 14th Python in Science Conference, 1–7.
- McKay, C., & Fujinaga, I. (2008). Combining features extracted from audio, symbolic and cultural sources. In Proceedings of the 9th International Conference on Music Information Retrieval.
- Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, 3111–3119.
- Moro, A., Raganato, A., & Navigli, R. (2014). Entity linking meets word sense disambiguation: A unified approach. Transactions of the Association for Computational Linguistics, 2, 231–244.
- Navigli, R., & Ponzetto, S.P. (2012). BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193, 217–250. DOI: 10.1016/j.artint.2012.07.001
- Neumayer, R., & Rauber, A. (2007). Integration of text and audio features for genre classification in music information retrieval. In European Conference on Information Retrieval, 724–727. DOI: 10.1007/978-3-540-71496-5_78
- Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A.Y. (2011). Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning, 689–696.
- Oramas, S. (2017).
Semantic enrichment for similarity and classification . In Knowledge Extraction and Representation Learning for Music Recommendation and Classification, chapter 6, 75–88. PhD Thesis, Universitat Pompeu Fabra. - Oramas, S., Espinosa-Anke, L., Lawlor, A., & Serra, X. (2016a). Exploring customer reviews for music genre classification and evolutionary studies. In Proceedings of the 17th International Society for Music Information Retrieval Conference.
- Oramas, S., Espinosa-Anke, L., Sordo, M., Saggion, H., & Serra, X. (2016b). ELMD: An automatically generated entity linking gold standard dataset in the music domain. In Proceedings of the 10th International Conference on Language Resources and Evaluation.
- Oramas, S., Gómez, F., Gómez, E., & Mora, J. (2015). FlaBase: Towards the creation of a flamenco music knowledge base. In Proceedings of the 16th International Society for Music Information Retrieval Conference.
- Oramas, S., Nieto, O., Barbieri, F., & Serra, X. (2017a). Multi-label music genre classification from audio, text, and images using deep features. Proceedings of the 18th International Society for Music Information Retrieval Conference.
- Oramas, S., Nieto, O., Sordo, M., & Serra, X. (2017b). A deep multimodal approach for cold-start music recommendation. 2nd Workshop on Deep Learning for Recommender Systems, collocated with RecSys 2017.
- Pachet, F., & Cazaly, D. (2000). A taxonomy of musical genres. In Content-Based Multimedia Information Access, 2, 1238–1245.
- Pons, J., Lidy, T., & Serra, X. (2016). Experimenting with musically motivated convolutional neural networks. In 14th International Workshop on Content-Based Multimedia Indexing, 1–6.
IEEE . - Pons, J., Nieto, O., Prockup, M., Schmidt, E.M., Ehmann, A.F., & Serra, X. (2017). End-to-end learning for music audio tagging at scale. arXiv preprint arXiv:1711.02520.
- Razavian, A.S., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An astounding baseline for recognition. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, 512–519. DOI: 10.1109/CVPRW.2014.131
- Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., & Fei-Fei, L. (2015). ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3), 211–252. DOI: 10.1007/s11263-015-0816-y
- Sanden, C., & Zhang, J.Z. (2011). Enhancing multi-label music genre classification through ensemble techniques. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, 705–714. SIGIR ’11. DOI: 10.1145/2009916.2010011
- Schedl, M., Orio, N., Liem, C., & Peeters, G. (2013). A professionally annotated and enriched multi-modal data set on popular music. In Proceedings of the 4th ACM Multimedia Systems Conference, 78–83. DOI: 10.1145/2483977.2483985
- Schindler, A., & Rauber, A. (2015). An audio-visual approach to music genre classification through affective color features. In European Conference on Information Retrieval, 61–67.
- Schörkhuber, C., & Klapuri, A. (2010). Constant-Q transform toolbox for music processing. In 7th Sound and Music Computing Conference, 3–64.
- Schreiber, H. (2015). Improving genre annotations for the Million Song Dataset. Proceedings of the 16th International Society for Music Information Retrieval Conference.
- Sermanet, P., & LeCun, Y. (2011). Traffic sign recognition with multi-scale convolutional networks. In International Joint Conference on Neural Networks, 2809–2813.
IEEE . - Seyerlehner, K., Schedl, M., Pohle, T., & Knees, P. (2010a). Using block-level features for genre classification, tag classification and music similarity estimation. Submission to Audio Music Similarity and Retrieval Task of MIREX.
- Seyerlehner, K., Widmer, G., Schedl, M., & Knees, P. (2010b). Automatic music tag classification based on block-level. In 7th Sound and Music Computing Conference.
- Sordo, M. (2012). Semantic annotation of music col-lections: A computational approach. PhD thesis, Universitat Pompeu Fabra.
- Srivastava, N., & Salakhutdinov, R.R. (2012). Multi-modal learning with deep Boltzmann machines. In Advances in Neural Information Processing Systems, 2222–2230.
- Sturm, B.L. (2012). A survey of evaluation in music genre recognition. In International Workshop on Adaptive Multimedia Retrieval, 29–66.
- Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1–9.
- Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826.
- Tsoumakas, G., & Katakis, I. (2006). Multi-label classification: An overview. International Journal of Data Warehousing and Mining, 3(3).
- Turnbull, D., Barrington, L., Torres, D., & Lanckriet, G. (2008). Semantic annotation and retrieval of music and sound effects. IEEE Transactions on Audio, Speech, and Language Processing, 16(2), 467–476. DOI: 10.1109/TASL.2007.913750
- Tzanetakis, G., & Cook, P. (2002). Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing, 10(5), 293–302. DOI: 10.1109/TSA.2002.800560
- van den Oord, A., Dieleman, S., & Schrauwen, B. (2013). Deep content-based music recommendation. In Advances in Neural Information Processing Systems, 2643–2651.
- Wang, F., Wang, X., Shao, B., Li, T., & Ogihara, M. (2009). Tag integrated multi-label music style classification with hypergraph. In Proceedings of the 10th International Society for Music Information Retrieval Conference.
- Wu, X., Qiao, Y., Wang, X., & Tang, X. (2016). Bridging music and image via cross-modal ranking analysis. IEEE Transactions on Multimedia, 18(7), 1305–1318. DOI: 10.1109/TMM.2016.2557722
- Yan, F., & Mikolajczyk, K. (2015). Deep correlation for matching images and text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3441–3450. DOI: 10.1109/CVPR.2015.7298966
- Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems, 3320–3328.
- Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2921–2929.
- Zobel, J., & Moffat, A. (1998). Exploring the similarity space. ACM SIGIR Forum, 32(1), 18–34.
DOI: https://doi.org/10.5334/tismir.10 | Journal eISSN: 2514-3298
Language: English
Submitted on: Jan 20, 2018
Accepted on: May 1, 2018
Published on: Sep 4, 2018
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year
© 2018 Sergio Oramas, Francesco Barbieri, Oriol Nieto, Xavier Serra, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.