Attend to Chords: Improving Harmonic Analysis of Symbolic Music Using Transformer-Based Models

Tsung-Ping Chen; Li Su

doi:10.5334/tismir.65

Attend to Chords: Improving Harmonic Analysis of Symbolic Music Using Transformer-Based Models

Transactions of the International Society for Music Information Retrieval

Volume 4 (2021): Issue 1

By: Tsung-Ping Chen and Li Su

Open Access

|Feb 2021

Ba, L. J., Kiros, R., and Hinton, G. E. (2016). Layer normalization. In arXiv preprint arXiv: 1607.06450.
Search in Google Scholar Back to article
Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR).
Search in Google Scholar Back to article
Belinkov, Y., Durrani, N., Dalvi, F., Sajjad, H., and Glass, J. R. (2017). What do neural machine translation models learn about morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), pages 861–872. DOI: 10.18653/v1/P17-1080
Open DOI Search in Google Scholar Back to article
Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. (2013). Audio chord recognition with recurrent neural networks. In Proceedings of the 14th International Society for Music Information Retrieval Conference (ISMIR), pages 335–340.
Search in Google Scholar Back to article
Carsault, T., Nika, J., and Esling, P. (2018). Using musical relationships between chord labels in automatic chord extraction tasks. In Proceedings of the 19th International Society for Music Information Retrieval Conference, (ISMIR), pages 18–25.
Search in Google Scholar Back to article
Chen, T. and Su, L. (2018). Functional harmony recognition of symbolic music data with multi-task recurrent neural networks. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 90–97.
Search in Google Scholar Back to article
Chen, T. and Su, L. (2019). Harmony Transformer: Incorporating chord segmentation into harmony recognition. In Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), pages 259–267.
Search in Google Scholar Back to article
Cho, T. and Bello, J. P. (2009). Real-time implementation of HMM-based chord estimation in music audio. In Proceedings of the International Computer Music Conference (ICMC).
Search in Google Scholar Back to article
Chung, J., Ahn, S., and Bengio, Y. (2017). Hierarchical multiscale recurrent neural networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR).
Search in Google Scholar Back to article
Dai, Z., Yang, Z., Yang, Y., Carbonell, J. G., Le, Q. V., and Salakhutdinov, R. (2019). Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL), pages 2978–2988. DOI: 10.18653/v1/P19-1285
Open DOI Search in Google Scholar Back to article
de Haas, W. B., Magalhães, J. P., Veltkamp, R. C., and Wiering, F. (2011). HARMTRACE: Improving harmonic similarity estimation using functional harmony analysis. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR), pages 67–72.
Search in Google Scholar Back to article
Degani, A., Dalai, M., Leonardi, R., and Migliorati, P. (2015). Harmonic change detection for musical chords segmentation. In 2015 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. DOI: 10.1109/ICME.2015.7177404
Open DOI Search in Google Scholar Back to article
Degani, A., Dalai, M., Leonardi, R., and Migliorati, P. (2017). Audio chord estimation based on meter modeling and two-stage decoding. In Proceedings of the 10th International Symposium on Image and Signal Processing and Analysis (ISPA), pages 65–69. DOI: 10.1109/ISPA.2017.8073570
Open DOI Search in Google Scholar Back to article
Deng, J. and Kwok, Y. (2016). A hybrid Gaussian-HMM-deep learning approach for automatic chord estimation with very large vocabulary. In Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR), pages 812–818.
Search in Google Scholar Back to article
Deng, J. and Kwok, Y. (2017). Large vocabulary automatic chord estimation with an even chance training scheme. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 531–536.
Search in Google Scholar Back to article
Devaney, J., Arthur, C., Condit-Schultz, N., and Nisula, K. (2015). Theme and variation encodings with Roman numerals (TAVERN): A new data set for symbolic music analysis. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 728–734.
Search in Google Scholar Back to article
Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACLHLT), pages 4171–4186.
Search in Google Scholar Back to article
Donahue, C., Mao, H. H., Li, Y. E., Cottrell, G. W., and McAuley, J. J. (2019). LakhNES: Improving multiinstrumental music generation with cross-domain pre-training. In Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), pages 685–692.
Search in Google Scholar Back to article
Dong, H. and Yang, Y. (2018). Convolutional generative adversarial networks with binary neurons for polyphonic music generation. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 190–196.
Search in Google Scholar Back to article
Fujishima, T. (1999). Realtime chord recognition of musical sound: a system using Common Lisp Music. In Proceedings of the International Computer Music Conference (ICMC).
Search in Google Scholar Back to article
Gotham, M. and Ireland, M. (2019). Taking form: A representation standard, conversion code, and example corpora for recording, visualizing, and studying analyses of musical form. In Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), pages 693–699.
Search in Google Scholar Back to article
Gu, J., Bradbury, J., Xiong, C., Li, V. O. K., and Socher, R. (2018). Non-autoregressive neural machine translation. In Proceedings of the 6th International Conference on Learning Representations (ICLR).
Search in Google Scholar Back to article
Harte, C., Sandler, M., and Gasser, M. (2006). Detecting harmonic change in musical audio. In Proceedings of the 1st ACM workshop on Audio and music computing multimedia, pages 21–26. DOI: 10.1145/1178723.1178727
Open DOI Search in Google Scholar Back to article
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778. DOI: 10.1109/CVPR.2016.90
Open DOI Search in Google Scholar Back to article
Hermann, K. M., Kociský, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M., and Blunsom, P. (2015). Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems (NIPS), pages 1693–1701.
Search in Google Scholar Back to article
Hori, T., Nakamura, K., and Sagayama, S. (2017). Music chord recognition from audio data using bidirectional encoder-decoder LSTMs. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pages 1312–1315. DOI: 10.1109/APSIPA.2017.8282235
Open DOI Search in Google Scholar Back to article
Hossain, M. Z., Sohel, F., Shiratuddin, M. F., Laga, H., and Bennamoun, M. (2019). Bi-SAN-CAP: Bidirectional self-attention for image captioning. In Proceedings of the Digital Image Computing: Techniques and Applications (DICTA), pages 1–7. DOI: 10.1109/DICTA47822.2019.8946003
Open DOI Search in Google Scholar Back to article
Hou, J., Guo, W., Song, Y., and Dai, L. (2020). Segment boundary detection directed attention for online end-to-end speech recognition. EURASIP J. Audio, Speech and Music Processing, 2020(1), 3. DOI: 10.1186/s13636-020-0170-z
Open DOI Search in Google Scholar Back to article
Huang, C. A., Vaswani, A., Uszkoreit, J., Simon, I., Hawthorne, C., Shazeer, N., Dai, A. M., Hoffman, M. D., Dinculescu, M., and Eck, D. (2019). Music Transformer: Generating music with long-term structure. In Proceedings of 7th International Conference on Learning Representations (ICLR).
Search in Google Scholar Back to article
Humphrey, E. J. and Bello, J. P. (2012). Rethinking automatic chord recognition with convolutional neural networks. In Proceedings of the 11th International Conference on Machine Learning and Applications (ICMLA), pages 357–362. DOI: 10.1109/ICMLA.2012.220
Open DOI Search in Google Scholar Back to article
Humphrey, E. J. and Bello, J. P. (2015). Four timely insights on automatic chord estimation. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 673–679.
Search in Google Scholar Back to article
Illescas, P. R., Rizo, D., and Quereda, J. M. I. (2007). Harmonic, melodic, and functional automatic analysis. In Proceedings of the International Computer Music Conference (ICMC).
Search in Google Scholar Back to article
Jiang, J., Chen, K., Li, W., and Xia, G. (2019). Large-vocabulary chord transcription via chord structure decomposition. In Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), pages 644–651.
Search in Google Scholar Back to article
Korzeniowski, F. and Widmer, G. (2016). A fully convolutional deep auditory model for musical chord recognition. In Proceedings of the 26th IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6. DOI: 10.1109/MLSP.2016.7738895
Open DOI Search in Google Scholar Back to article
Korzeniowski, F. and Widmer, G. (2017). On the futility of learning complex frame-level language models for chord recognition. In Proceedings of the AES International Conference on Semantic Audio.
Search in Google Scholar Back to article
Korzeniowski, F. and Widmer, G. (2018). Improved chord recognition by combining duration and harmonic language models. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 10–17.
Search in Google Scholar Back to article
Lee, K. (2006). Automatic chord recognition from audio using enhanced pitch class profile. In Proceedings of the International Computer Music Conference (ICMC).
Search in Google Scholar Back to article
Li, X. and Wu, X. (2015). Long Short-Term Memory based convolutional recurrent neural networks for large vocabulary speech recognition. In Proceedings of the 16th Annual Conference of the International Speech Communication Association (INTERSPEECH), pages 3219–3223. DOI: 10.1109/ICASSP.2015.7178826
Open DOI Search in Google Scholar Back to article
Lim, Y., Chan, C. S., and Loo, F. Y. (2020). Style-conditioned music generation. In IEEE International Conference on Multimedia and Expo, ICME 2020, London, UK, July 6–10, 2020, pages 1–6. DOI: 10.1109/ICME46284.2020.9102870
Open DOI Search in Google Scholar Back to article
Masada, K. and Bunescu, R. C. (2017). Chord recognition in symbolic music using Semi-Markov Conditional Random Fields. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 272–278.
Search in Google Scholar Back to article
Masada, K. and Bunescu, R. C. (2019). Chord recognition in symbolic music: A segmental CRF model, segment-level features, and comparative evaluations on classical and popular music. Trans. Int. Soc. Music. Inf. Retr., 2(1), 1–13. DOI: 10.5334/tismir.18
Open DOI Search in Google Scholar Back to article
Mauch, M. and Dixon, S. (2010). Simultaneous estimation of chords and musical context from audio. IEEE Trans. Audio, Speech & Language Processing (TASLP), 18(6), 1280–1289. DOI: 10.1109/TASL.2009.2032947
Open DOI Search in Google Scholar Back to article
McFee, B. and Bello, J. P. (2017). Structured training for large-vocabulary chord recognition. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 188–194.
Search in Google Scholar Back to article
Melamud, O., Goldberger, J., and Dagan, I. (2016). Context2vec: Learning generic context embedding with bidirectional LSTM. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning (CoNLL), pages 51–61. DOI: 10.18653/v1/K16-1006
Open DOI Search in Google Scholar Back to article
Micchi, G., Gotham, M., and Giraud, M. (2020). Not all roads lead to Rome: Pitch representation and model architecture for automatic harmonic analysis. Trans. Int. Soc. Music. Inf. Retr., 3(1), 42–54. DOI: 10.5334/tismir.45
Open DOI Search in Google Scholar Back to article
Miller, A. H., Fisch, A., Dodge, J., Karimi, A., Bordes, A., and Weston, J. (2016). Key-value memory networks for directly reading documents. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1400–409. DOI: 10.18653/v1/D16-1147
Open DOI Search in Google Scholar Back to article
Neuwirth, M., Harasim, D., Moss, F. C., and Rohrmeier, M. (2018). The annotated Beethoven corpus (ABC): A dataset of harmonic analyses of all Beethoven string quartets. Front. Digital Humanities, 5. DOI: 10.3389/fdigh.2018.00016
Open DOI Search in Google Scholar Back to article
Ni, Y., McVicar, M., Santos-Rodríguez, R., and Bie, T. D. (2013). Understanding effects of subjectivity in measuring chord estimation accuracy. IEEE ACM Trans. Audio Speech Lang. Process., 21(12), 2607–2615. DOI: 10.1109/TASL.2013.2280218
Open DOI Search in Google Scholar Back to article
Oudre, L., Févotte, C., and Grenier, Y. (2011). Probabilistic template-based chord recognition. IEEE Trans. Audio, Speech & Language Processing (TASLP), 19(8), 2249–2259. DOI: 10.1109/TASL.2010.2098870
Open DOI Search in Google Scholar Back to article
Parikh, A. P., Täckström, O., Das, D., and Uszkoreit, J. (2016). A decomposable attention model for natural language inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2249–2255. DOI: 10.18653/v1/D16-1244
Open DOI Search in Google Scholar Back to article
Park, J., Choi, K., Jeon, S., Kim, D., and Park, J. (2019). A bi-directional Transformer for musical chord recognition. In Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), pages 620–627.
Search in Google Scholar Back to article
Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., and Tran, D. (2018). Image Transformer. In Proceedings of the 35th International Conference on Machine Learning (ICML), pages 4052–4061.
Search in Google Scholar Back to article
Passos, A. T., Sampaio, M., Kröger, P., and de Cidra, G. (2009). Functional harmonic analysis and computational musicology in Rameau. In Proceedings of the 12th Brazilian Symposium on Computer Music (SBCM).
Search in Google Scholar Back to article
Pauwels, J., O’Hanlon, K., Gómez, E., and Sandler, M. B. (2019). 20 years of automatic chord recognition from audio. In Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), pages 54–63.
Search in Google Scholar Back to article
Raphael, C. and Stoddard, J. (2004). Functional harmonic analysis using probabilistic models. Computer Music Journal, 28(3), 45–52. DOI: 10.1162/0148926041790676
Open DOI Search in Google Scholar Back to article
Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T. (2019). Fastspeech: Fast, robust and controllable text to speech. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems (NeurIPS), pages 3165–3174.
Search in Google Scholar Back to article
Rhodes, C., Lewis, D., and Müllensiefen, D. (2009). Bayesian model selection for harmonic labelling. In Klouche, T. and Noll, T., editors, Mathematics and Computation in Music, pages 107–116. Springer Berlin Heidelberg. DOI: 10.1007/978-3-642-04579-0_11
Open DOI Search in Google Scholar Back to article
Rocher, T., Robine, M., Hanna, P., and Strandh, R. (2009). Dynamic chord analysis for symbolic music. In Proceedings of the 2009 International Computer Music Conference (ICMC).
Search in Google Scholar Back to article
Scholz, R. E. P. and Ramalho, G. L. (2008). COCHONUT: recognizing complex chords from MIDI guitar sequences. In Proceedings of the 9th International Conference on Music Information Retrieval (ISMIR), pages 27–32.
Search in Google Scholar Back to article
Shaw, P., Uszkoreit, J., and Vaswani, A. (2018). Selfattention with relative position representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACLHLT), pages 464–468. DOI: 10.18653/v1/N18-2074
Open DOI Search in Google Scholar Back to article
Sheh, A. and Ellis, D. P. W. (2003). Chord segmentation and recognition using EM-trained Hidden Markov Models. In Proceedings of the 4th International Conference on Music Information Retrieval (ISMIR).
Search in Google Scholar Back to article
Shen, T., Zhou, T., Long, G., Jiang, J., and Zhang, C. (2018). Bi-directional block self-attention for fast and memory-efficient sequence modeling. In Proceedings of the 6th International Conference on Learning Representations (ICLR).
Search in Google Scholar Back to article
Stark, A. M. and Plumbley, M. D. (2009). Real-time chord recognition for live performance. In Proceedings of the International Computer Music Conference (ICMC).
Search in Google Scholar Back to article
Tsui, V. and MacLean, W. J. (2002). Harmonic analysis using neural networks. In Proceedings of the International Computer Music Conference (ICMC).
Search in Google Scholar Back to article
Tymoczko, D., Gotham, M., Cuthbert, M. S., and Ariza, C. (2019). The Romantext format: A flexible and standard method for representing Roman numerial analyses. In Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), pages 123–129.
Search in Google Scholar Back to article
Ueda, Y., Uchiyama, Y., Nishimoto, T., Ono, N., and Sagayama, S. (2010). HMM-based approach for automatic chord detection using refined acoustic features. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5518–5521. DOI: 10.1109/ICASSP.2010.5495218
Open DOI Search in Google Scholar Back to article
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems (NIPS), pages 5998–6008.
Search in Google Scholar Back to article
Wang, Y., Lee, H., and Lee, L. (2018). Segmental audio word2vec: Representing utterances as sequences of vectors with applications in spoken term detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6269–6273. DOI: 10.1109/ICASSP.2018.8462002
Open DOI Search in Google Scholar Back to article
Yang, M., Su, L., and Yang, Y. (2016). Highlighting root notes in chord recognition using cepstral features and multi-task learning. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pages 1–8. DOI: 10.1109/APSIPA.2016.7820865
Open DOI Search in Google Scholar Back to article
Yoshioka, T., Kitahara, T., Komatani, K., Ogata, T., and Okuno, H. G. (2004). Automatic chord transcription with concurrent recognition of chord symbols and boundaries. In Proceedings of the 5th International Conference on Music Information Retrieval (ISMIR).
Search in Google Scholar Back to article
Zenz, V. and Rauber, A. (2007). Automatic chord detection incorporating beat and key detection. In Proceedings of the IEEE International Conference on Signal Processing and Communications (ICSPC), pages 1175–1178. DOI: 10.1109/ICSPC.2007.4728534
Open DOI Search in Google Scholar Back to article
Zhou, X. and Lerch, A. (2015). Chord detection using deep learning. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 52–58.
Search in Google Scholar Back to article