
Attend to Chords: Improving Harmonic Analysis of Symbolic Music Using Transformer-Based Models
By: Tsung-Ping Chen and Li Su
References
- Ba, L. J., Kiros, R., and Hinton, G. E. (2016). Layer normalization. In arXiv preprint arXiv: 1607.06450.
- Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR).
- Belinkov, Y., Durrani, N., Dalvi, F., Sajjad, H., and Glass, J. R. (2017). What do neural machine translation models learn about morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), pages 861–872. DOI: 10.18653/v1/P17-1080
- Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. (2013). Audio chord recognition with recurrent neural networks. In Proceedings of the 14th International Society for Music Information Retrieval Conference (ISMIR), pages 335–340.
- Carsault, T., Nika, J., and Esling, P. (2018). Using musical relationships between chord labels in automatic chord extraction tasks. In Proceedings of the 19th International Society for Music Information Retrieval Conference, (ISMIR), pages 18–25.
- Chen, T. and Su, L. (2018). Functional harmony recognition of symbolic music data with multi-task recurrent neural networks. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 90–97.
- Chen, T. and Su, L. (2019). Harmony Transformer: Incorporating chord segmentation into harmony recognition. In Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), pages 259–267.
- Cho, T. and Bello, J. P. (2009). Real-time implementation of HMM-based chord estimation in music audio. In Proceedings of the International Computer Music Conference (ICMC).
- Chung, J., Ahn, S., and Bengio, Y. (2017). Hierarchical multiscale recurrent neural networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR).
- Dai, Z., Yang, Z., Yang, Y., Carbonell, J. G., Le, Q. V., and Salakhutdinov, R. (2019). Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL), pages 2978–2988. DOI: 10.18653/v1/P19-1285
- de Haas, W. B., Magalhães, J. P., Veltkamp, R. C., and Wiering, F. (2011). HARMTRACE: Improving harmonic similarity estimation using functional harmony analysis. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR), pages 67–72.
- Degani, A., Dalai, M., Leonardi, R., and Migliorati, P. (2015). Harmonic change detection for musical chords segmentation. In 2015 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. DOI: 10.1109/ICME.2015.7177404
- Degani, A., Dalai, M., Leonardi, R., and Migliorati, P. (2017). Audio chord estimation based on meter modeling and two-stage decoding. In Proceedings of the 10th International Symposium on Image and Signal Processing and Analysis (ISPA), pages 65–69. DOI: 10.1109/ISPA.2017.8073570
- Deng, J. and Kwok, Y. (2016). A hybrid Gaussian-HMM-deep learning approach for automatic chord estimation with very large vocabulary. In Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR), pages 812–818.
- Deng, J. and Kwok, Y. (2017). Large vocabulary automatic chord estimation with an even chance training scheme. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 531–536.
- Devaney, J., Arthur, C., Condit-Schultz, N., and Nisula, K. (2015). Theme and variation encodings with Roman numerals (TAVERN): A new data set for symbolic music analysis. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 728–734.
- Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACLHLT), pages 4171–4186.
- Donahue, C., Mao, H. H., Li, Y. E., Cottrell, G. W., and McAuley, J. J. (2019). LakhNES: Improving multiinstrumental music generation with cross-domain pre-training. In Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), pages 685–692.
- Dong, H. and Yang, Y. (2018). Convolutional generative adversarial networks with binary neurons for polyphonic music generation. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 190–196.
- Fujishima, T. (1999). Realtime chord recognition of musical sound: a system using Common Lisp Music. In Proceedings of the International Computer Music Conference (ICMC).
- Gotham, M. and Ireland, M. (2019). Taking form: A representation standard, conversion code, and example corpora for recording, visualizing, and studying analyses of musical form. In Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), pages 693–699.
- Gu, J., Bradbury, J., Xiong, C., Li, V. O. K., and Socher, R. (2018). Non-autoregressive neural machine translation. In Proceedings of the 6th International Conference on Learning Representations (ICLR).
- Harte, C., Sandler, M., and Gasser, M. (2006). Detecting harmonic change in musical audio. In Proceedings of the 1st ACM workshop on Audio and music computing multimedia, pages 21–26. DOI: 10.1145/1178723.1178727
- He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778. DOI: 10.1109/CVPR.2016.90
- Hermann, K. M., Kociský, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M., and Blunsom, P. (2015). Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems (NIPS), pages 1693–1701.
- Hori, T., Nakamura, K., and Sagayama, S. (2017). Music chord recognition from audio data using bidirectional encoder-decoder LSTMs. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pages 1312–1315. DOI: 10.1109/APSIPA.2017.8282235
- Hossain, M. Z., Sohel, F., Shiratuddin, M. F., Laga, H., and Bennamoun, M. (2019). Bi-SAN-CAP: Bidirectional self-attention for image captioning. In Proceedings of the Digital Image Computing: Techniques and Applications (DICTA), pages 1–7. DOI: 10.1109/DICTA47822.2019.8946003
- Hou, J., Guo, W., Song, Y., and Dai, L. (2020). Segment boundary detection directed attention for online end-to-end speech recognition. EURASIP J. Audio, Speech and Music Processing, 2020(1), 3. DOI: 10.1186/s13636-020-0170-z
- Huang, C. A., Vaswani, A., Uszkoreit, J., Simon, I., Hawthorne, C., Shazeer, N., Dai, A. M., Hoffman, M. D., Dinculescu, M., and Eck, D. (2019). Music Transformer: Generating music with long-term structure. In Proceedings of 7th International Conference on Learning Representations (ICLR).
- Humphrey, E. J. and Bello, J. P. (2012). Rethinking automatic chord recognition with convolutional neural networks. In Proceedings of the 11th International Conference on Machine Learning and Applications (ICMLA), pages 357–362. DOI: 10.1109/ICMLA.2012.220
- Humphrey, E. J. and Bello, J. P. (2015). Four timely insights on automatic chord estimation. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 673–679.
- Illescas, P. R., Rizo, D., and Quereda, J. M. I. (2007). Harmonic, melodic, and functional automatic analysis. In Proceedings of the International Computer Music Conference (ICMC).
- Jiang, J., Chen, K., Li, W., and Xia, G. (2019). Large-vocabulary chord transcription via chord structure decomposition. In Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), pages 644–651.
- Korzeniowski, F. and Widmer, G. (2016). A fully convolutional deep auditory model for musical chord recognition. In Proceedings of the 26th IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6. DOI: 10.1109/MLSP.2016.7738895
- Korzeniowski, F. and Widmer, G. (2017). On the futility of learning complex frame-level language models for chord recognition. In Proceedings of the AES International Conference on Semantic Audio.
- Korzeniowski, F. and Widmer, G. (2018). Improved chord recognition by combining duration and harmonic language models. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 10–17.
- Lee, K. (2006). Automatic chord recognition from audio using enhanced pitch class profile. In Proceedings of the International Computer Music Conference (ICMC).
- Li, X. and Wu, X. (2015). Long Short-Term Memory based convolutional recurrent neural networks for large vocabulary speech recognition. In Proceedings of the 16th Annual Conference of the International Speech Communication Association (INTERSPEECH), pages 3219–3223. DOI: 10.1109/ICASSP.2015.7178826
- Lim, Y., Chan, C. S., and Loo, F. Y. (2020). Style-conditioned music generation. In IEEE International Conference on Multimedia and Expo,
ICME 2020 , London, UK,July 6–10, 2020 , pages 1–6. DOI: 10.1109/ICME46284.2020.9102870 - Masada, K. and Bunescu, R. C. (2017). Chord recognition in symbolic music using Semi-Markov Conditional Random Fields. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 272–278.
- Masada, K. and Bunescu, R. C. (2019). Chord recognition in symbolic music: A segmental CRF model, segment-level features, and comparative evaluations on classical and popular music. Trans. Int. Soc. Music. Inf. Retr., 2(1), 1–13. DOI: 10.5334/tismir.18
- Mauch, M. and Dixon, S. (2010). Simultaneous estimation of chords and musical context from audio. IEEE Trans. Audio, Speech & Language Processing (TASLP), 18(6), 1280–1289. DOI: 10.1109/TASL.2009.2032947
- McFee, B. and Bello, J. P. (2017). Structured training for large-vocabulary chord recognition. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 188–194.
- Melamud, O., Goldberger, J., and Dagan, I. (2016). Context2vec: Learning generic context embedding with bidirectional LSTM. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning (CoNLL), pages 51–61. DOI: 10.18653/v1/K16-1006
- Micchi, G., Gotham, M., and Giraud, M. (2020). Not all roads lead to Rome: Pitch representation and model architecture for automatic harmonic analysis. Trans. Int. Soc. Music. Inf. Retr., 3(1), 42–54. DOI: 10.5334/tismir.45
- Miller, A. H., Fisch, A., Dodge, J., Karimi, A., Bordes, A., and Weston, J. (2016). Key-value memory networks for directly reading documents. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1400–409. DOI: 10.18653/v1/D16-1147
- Neuwirth, M., Harasim, D., Moss, F. C., and Rohrmeier, M. (2018). The annotated Beethoven corpus (ABC): A dataset of harmonic analyses of all Beethoven string quartets. Front. Digital Humanities, 5. DOI: 10.3389/fdigh.2018.00016
- Ni, Y., McVicar, M., Santos-Rodríguez, R., and Bie, T. D. (2013). Understanding effects of subjectivity in measuring chord estimation accuracy. IEEE ACM Trans. Audio Speech Lang. Process., 21(12), 2607–2615. DOI: 10.1109/TASL.2013.2280218
- Oudre, L., Févotte, C., and Grenier, Y. (2011). Probabilistic template-based chord recognition. IEEE Trans. Audio, Speech & Language Processing (TASLP), 19(8), 2249–2259. DOI: 10.1109/TASL.2010.2098870
- Parikh, A. P., Täckström, O., Das, D., and Uszkoreit, J. (2016). A decomposable attention model for natural language inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2249–2255. DOI: 10.18653/v1/D16-1244
- Park, J., Choi, K., Jeon, S., Kim, D., and Park, J. (2019). A bi-directional Transformer for musical chord recognition. In Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), pages 620–627.
- Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., and Tran, D. (2018). Image Transformer. In Proceedings of the 35th International Conference on Machine Learning (ICML), pages 4052–4061.
- Passos, A. T., Sampaio, M., Kröger, P., and de Cidra, G. (2009). Functional harmonic analysis and computational musicology in Rameau. In Proceedings of the 12th Brazilian Symposium on Computer Music (SBCM).
- Pauwels, J., O’Hanlon, K., Gómez, E., and Sandler, M. B. (2019). 20 years of automatic chord recognition from audio. In Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), pages 54–63.
- Raphael, C. and Stoddard, J. (2004). Functional harmonic analysis using probabilistic models. Computer Music Journal, 28(3), 45–52. DOI: 10.1162/0148926041790676
- Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T. (2019). Fastspeech: Fast, robust and controllable text to speech. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems (NeurIPS), pages 3165–3174.
- Rhodes, C., Lewis, D., and Müllensiefen, D. (2009).
Bayesian model selection for harmonic labelling . In Klouche, T. and Noll, T., editors, Mathematics and Computation in Music, pages 107–116. Springer Berlin Heidelberg. DOI: 10.1007/978-3-642-04579-0_11 - Rocher, T., Robine, M., Hanna, P., and Strandh, R. (2009). Dynamic chord analysis for symbolic music. In Proceedings of the 2009 International Computer Music Conference (ICMC).
- Scholz, R. E. P. and Ramalho, G. L. (2008). COCHONUT: recognizing complex chords from MIDI guitar sequences. In Proceedings of the 9th International Conference on Music Information Retrieval (ISMIR), pages 27–32.
- Shaw, P., Uszkoreit, J., and Vaswani, A. (2018). Selfattention with relative position representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACLHLT), pages 464–468. DOI: 10.18653/v1/N18-2074
- Sheh, A. and Ellis, D. P. W. (2003). Chord segmentation and recognition using EM-trained Hidden Markov Models. In Proceedings of the 4th International Conference on Music Information Retrieval (ISMIR).
- Shen, T., Zhou, T., Long, G., Jiang, J., and Zhang, C. (2018). Bi-directional block self-attention for fast and memory-efficient sequence modeling. In Proceedings of the 6th International Conference on Learning Representations (ICLR).
- Stark, A. M. and Plumbley, M. D. (2009). Real-time chord recognition for live performance. In Proceedings of the International Computer Music Conference (ICMC).
- Tsui, V. and MacLean, W. J. (2002). Harmonic analysis using neural networks. In Proceedings of the International Computer Music Conference (ICMC).
- Tymoczko, D., Gotham, M., Cuthbert, M. S., and Ariza, C. (2019). The Romantext format: A flexible and standard method for representing Roman numerial analyses. In Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), pages 123–129.
- Ueda, Y., Uchiyama, Y., Nishimoto, T., Ono, N., and Sagayama, S. (2010). HMM-based approach for automatic chord detection using refined acoustic features. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5518–5521. DOI: 10.1109/ICASSP.2010.5495218
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems (NIPS), pages 5998–6008.
- Wang, Y., Lee, H., and Lee, L. (2018). Segmental audio word2vec: Representing utterances as sequences of vectors with applications in spoken term detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6269–6273. DOI: 10.1109/ICASSP.2018.8462002
- Yang, M., Su, L., and Yang, Y. (2016). Highlighting root notes in chord recognition using cepstral features and multi-task learning. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pages 1–8. DOI: 10.1109/APSIPA.2016.7820865
- Yoshioka, T., Kitahara, T., Komatani, K., Ogata, T., and Okuno, H. G. (2004). Automatic chord transcription with concurrent recognition of chord symbols and boundaries. In Proceedings of the 5th International Conference on Music Information Retrieval (ISMIR).
- Zenz, V. and Rauber, A. (2007). Automatic chord detection incorporating beat and key detection. In Proceedings of the IEEE International Conference on Signal Processing and Communications (ICSPC), pages 1175–1178. DOI: 10.1109/ICSPC.2007.4728534
- Zhou, X. and Lerch, A. (2015). Chord detection using deep learning. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 52–58.
DOI: https://doi.org/10.5334/tismir.65 | Journal eISSN: 2514-3298
Language: English
Submitted on: May 10, 2020
Accepted on: Jan 7, 2021
Published on: Feb 24, 2021
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year
Keywords:
© 2021 Tsung-Ping Chen, Li Su, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.