An Analysis of the Effect of Data Augmentation Methods: Experiments for a Musical Genre Classification Task

Rémi Mignot; Geoffroy Peeters

doi:10.5334/tismir.26

An Analysis of the Effect of Data Augmentation Methods: Experiments for a Musical Genre Classification Task

Transactions of the International Society for Music Information Retrieval

Volume 2 (2019): Issue 1

By: Rémi Mignot and Geoffroy Peeters

Open Access

|Dec 2019

Abstract

Supervised machine learning relies on the accessibility of large datasets of annotated data. This is essential since small datasets generally lead to overfitting when training high-dimensional machine-learning models. Since the manual annotation of such large datasets is a long, tedious and expensive process, another possibility is to artificially increase the size of the dataset. This is known as data augmentation. In this paper we provide an in-depth analysis of two data augmentation methods: sound transformations and sound segmentation. The first transforms a music track to a set of new music tracks by applying processes such as pitch-shifting, time-stretching or filtering. The second one splits a long sound signal into a set of shorter time segments. We study the effect of these two techniques (and the parameters of those) for a genre classification task using public datasets. The main contribution of this work is to detail by experimentation the benefit of these methods, used alone or together, during training and/or testing. We also demonstrate their use in improving the robustness of potentially unknown sound degradations. By analyzing these results, good practice recommendations are provided.

References

Barlow, R. J. (1989). Statistics: A guide to the use of statistical methods in the physical sciences, volume 29. John Wiley & Sons.
Search in Google Scholar Back to article
Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford University Press. DOI: 10.1201/9781420050646.ptb6
Open DOI Search in Google Scholar Back to article
Boser, B. E., Guyon, I., & Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In ACM Conference on Computational Learning Theory, pages 144–152. DOI: 10.1145/130385.130401
Open DOI Search in Google Scholar Back to article
Chang, E. I., & Lippmann, R. P. (1995). Using voice transformations to create additional training talkers for word spotting. In Advances in Neural Information Processing Systems, pages 875–882.
Search in Google Scholar Back to article
Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27. DOI: 10.1109/TIT.1967.1053964
Open DOI Search in Google Scholar Back to article
Cui, X., Goel, V., & Kingsbury, B. (2015). Data augmentation for deep neural network acoustic modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(9), 1469–1477. DOI: 10.1109/TASLP.2015.2438544
Open DOI Search in Google Scholar Back to article
Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transaction on Acoustics, Speech and Signal Processing, 28(4), 357–366. DOI: 10.1109/TASSP.1980.1163420
Open DOI Search in Google Scholar Back to article
Defferrard, M., Benzi, K., Vandergheynst, P., & Bresson, X. (2017). FMA: A dataset for music analysis. In International Society for Music Information Retrieval Conference, pages 316–323. https://github.com/mdeff/fma.
Search in Google Scholar Back to article
Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. Wiley New York, 2nd edition.
Search in Google Scholar Back to article
Feng, Y., Zhuang, Y., & Pan, Y. (2003). Music information retrieval by detecting mood via computational media aesthetics. In IEEE International Conference on Web Intelligence, pages 235–241.
Search in Google Scholar Back to article
Flexer, A. (2007). A closer look on artist filters for musical genre classification. In International Conference on Music Information Retrieval.
Search in Google Scholar Back to article
Fu, Z., Lu, G., Ting, K. M., & Zhang, D. (2011). A survey of audio-based music classification and annotation. IEEE Transactions on Multimedia, 13(2), 303–319. DOI: 10.1109/TMM.2010.2098858
Open DOI Search in Google Scholar Back to article
Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6), 417. DOI: 10.1037/h0071325
Open DOI Search in Google Scholar Back to article
Humphrey, E. J., & Bello, J. P. (2012). Rethinking automatic chord recognition with convolutional neural networks. In 11th International Conference on Machine Learning and Applications (ICMLA), volume 2, pages 357–362. DOI: 10.1109/ICMLA.2012.220
Open DOI Search in Google Scholar Back to article
Jaitly, N., & Hinton, G. E. (2013). Vocal tract length perturbation (VTLP) improves speech recognition. In ICML Workshop on Deep Learning for Audio, Speech and Language, volume 117.
Search in Google Scholar Back to article
Kanda, N., Takeda, R., & Obuchi, Y. (2013). Elastic spectral distortion for low resource speech recognition with deep neural networks. In IEEE Workshop on Automatic Speech Recognition and Understanding, pages 309–314. DOI: 10.1109/ASRU.2013.6707748
Open DOI Search in Google Scholar Back to article
Kirchhoff, H., Dixon, S., & Klapuri, A. (2012). Multitemplate shift-variant non-negative matrix deconvolution for semi-automatic music transcription. In International Society for Music Information Retrieval Conference, pages 415–420. DOI: 10.1109/ICASSP.2012.6287833
Open DOI Search in Google Scholar Back to article
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105.
Search in Google Scholar Back to article
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. DOI: 10.1109/5.726791
Open DOI Search in Google Scholar Back to article
Lee, C.-H., Shih, J.-L., Yu, K.-M., & Lin, H.-S. (2009a). Automatic music genre classification based on modulation spectral analysis of spectral and cepstral features. IEEE Transactions on Multimedia, 11(4), 670–682. DOI: 10.1109/TMM.2009.2017635
Open DOI Search in Google Scholar Back to article
Lee, H., Pham, P., Largman, Y., & Ng, A. Y. (2009b). Unsupervised feature learning for audio classification using convolutional deep belief networks. In Advances in Neural Information Processing Systems, pages 1096–1104.
Search in Google Scholar Back to article
Lee, K., & Slaney, M. (2008). Acoustic chord transcription and key extraction from audio using keydependent HMMs trained on synthesized audio. IEEE Transactions on Audio, Speech, and Language Processing, 16(2), 291–301. DOI: 10.1109/TASL.2007.914399
Open DOI Search in Google Scholar Back to article
Li, T. L. H., & Chan, A. B. (2011). Genre classification and the invariance of MFCC features to key and tempo. In International Conference on Multimedia Modeling, pages 317–327. DOI: 10.1007/978-3-642-17832-0_30
Open DOI Search in Google Scholar Back to article
Lidy, T., Rauber, A., Pertusa, A., & Quereda, J. (2007). Improving genre classification by combination of audio and symbolic descriptors using a transcription system. In International Conference on Music Information Retrieval, pages 61–66.
Search in Google Scholar Back to article
Mandel, M. I., & Ellis, D. P. (2008). Multiple-instance learning for music information retrieval. In International Conference on Music Information Retrieval, pages 577–582.
Search in Google Scholar Back to article
Marchand, U., & Peeters, G. (2014). The modulation scale spectrum and its application to rhythmcontent description. In International Conference on Digital Audio Effects, pages 167–172.
Search in Google Scholar Back to article
Mauch, M., & Ewert, S. (2013). The Audio Degradation Toolbox and its application to robustness evaluation. In International Society for Music Information Retrieval Conference, pages 83–88.
Search in Google Scholar Back to article
McFee, B., Humphrey, E. J., & Bello, J. P. (2015). A software framework for musical data augmentation. In International Society for Music Information Retrieval Conference, pages 248–254.
Search in Google Scholar Back to article
Ness, S. R., Theocharis, A., Tzanetakis, G., & Martins, L. G. (2009). Improving automatic music tag annotation using stacked generalization of probabilistic SVM outputs. In Proceedings of the 17th ACM International Conference on Multimedia, pages 705–708. DOI: 10.1145/1631272.1631393
Open DOI Search in Google Scholar Back to article
Oppenheim, A. V., & Schafer, R. W. (2009). Discrete-Time Signal Processing. Prentice Hall, 3rd edition.
Search in Google Scholar Back to article
Orfanidis, S. J. (2005). High-order digital parametric equalizer design. Journal of the Audio Engineering Society, 53(11), 1026–1046.
Search in Google Scholar Back to article
Peeters, G. (2007). A generic system for audio indexing: Application to speech/music segmentation and music genre recognition. In International Conference on Digital Audio Effects, pages 205–212.
Search in Google Scholar Back to article
Peeters, G., Giordano, B., Susini, P., Misdariis, N., & McAdams, S. (2011). The Timbre Toolbox: Extracting audio descriptors from musical signals. The Journal of the Acoustical Society of America, 130(5), 2902–2916. DOI: 10.1121/1.3642604
Open DOI Search in Google Scholar Back to article
Peeters, G., & Rodet, X. (2003). Hierarchical Gaussian tree with inertia ratio maximization for the classification of large musical instrument databases. In International Conference on Digital Audio Effects.
Search in Google Scholar Back to article
Quatieri, T. F., & McAulay, R. J. (1992). Shape invariant time-scale and pitch modification of speech. IEEE Transactions on Signal Processing, 40(3), 497–510. DOI: 10.1109/78.120793
Open DOI Search in Google Scholar Back to article
Ragni, A., Knill, K. M., Rath, S. P., & Gales, M. J. (2014). Data augmentation for low resource languages. In 15th Annual Conference of the International Speech Communication Association, pages 810–814.
Search in Google Scholar Back to article
Röbel, A. (2003). Transient detection and preservation in the phase vocoder. In International Computer Music Conference (ICMC), pages 247–250.
Search in Google Scholar Back to article
Röbel, A., & Rodet, X. (2005). Efficient spectral envelope estimation and its application to pitch shifting and envelope preservation. In International Conference on Digital Audio Effects, pages 30–35.
Search in Google Scholar Back to article
Schlüter, J. (2016). Learning to pinpoint singing voice from weakly labeled examples. In International Society for Music Information Retrieval Conference, pages 44–50.
Search in Google Scholar Back to article
Schlüter, J., & Grill, T. (2015). Exploring data augmentation for improved singing voice detection with neural networks. In International Society for Music Information Retrieval Conference, pages 121–126.
Search in Google Scholar Back to article
Seyerlehner, K., & Schedl, M. (2014). MIREX 2014: Optimizing the fluctuation pattern extraction process. Technical report, Dept. of Computational Perception, Johannes Kepler University, Linz, Austria.
Search in Google Scholar Back to article
Seyerlehner, K., Widmer, G., & Pohle, T. (2010a). Fusing block-level features for music similarity estimation. In International Conference on Digital Audio Effects, pages 225–232.
Search in Google Scholar Back to article
Seyerlehner, K., Widmer, G., Schedl, M., & Knees, P. (2010b). Automatic music tag classification based on block-level features. In 7th Sound and Music Computing Conference.
Search in Google Scholar Back to article
Simard, P. Y., Steinkraus, D., & Platt, J. C. (2003). Best practices for convolutional neural networks applied to visual document analysis. In International Conference on Document Analysis and Recognition, volume 3, pages 958–962. DOI: 10.1109/ICDAR.2003.1227801
Open DOI Search in Google Scholar Back to article
Tzanetakis, G., & Cook, P. (2002). Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing, 10(5), 293–302. DOI: 10.1109/TSA.2002.800560
Open DOI Search in Google Scholar Back to article
Yaeger, L. S., Lyon, R. F., & Webb, B. J. (1997). Effective training of a neural network character classifier for word recognition. In Advances in Neural Information Processing Systems, pages 807–816.
Search in Google Scholar Back to article
Zölzer, U. (2011). DAFx: Digital Audio Effects. John Wiley & Sons. DOI: 10.1002/9781119991298
Open DOI Search in Google Scholar Back to article