Abstract
The application of deep neural networks, particularly convolutional neural networks, in the field of music auto-tagging has been gaining traction in recent times. These deep networks relieve the engineers from the burden of handcrafting domain-specific features. However, musical features often show great temporal diversity which traditional deep networks are unable to capture. Keeping this in mind, we propose a convolutional neural network architecture which attempts to learn features over multiple timescales. The architecture runs multiple convolutions over various subsampled versions of the original audio spectrogram. These convolution streams are then concatenated to make the tag predictions. We evaluate the architecture on the MagnaTagATune dataset, and we show that the proposed architecture yields results close to the state of the art and comprehensively beats shallow classifiers trained on handcrafted features.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Sig. Process. 28(4), 357–366 (1980)
Hinton, G.E., Osindero, S., Teh, Y.-W.: A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527–1554 (2006)
Schmidhuber, J.: Deep learning in neural networks: an overview (2014). arXiv:1404.7828
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of the Neural Information Processing Systems Conference (2012)
Hinton, G., Deng, L., Dong, Y., Dahl, G., Mohamed, A.-R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition. IEEE Sig. Process. Mag. 29, 82–97 (2012)
Humphrey, E.J., Bello, J.P., LeCun, Y.: Moving beyond feature design: deep architecture and automatic feature learning in music informatics. In: Proceedings of the 13th International Society for Music Information Retrieval Conference (2012)
Choi, K., Fazekas, G., Sandler, M., Cho, K.: Convolutional recurrent neural networks for music classification (2016). arXiv:1609.04243
Briot, J.-P., Hadjeres, G., Pachet, F.: Deep learning techniques for music generation—A survey (2017). arXiv:1709.01620
Multiscale approaches to music audio feature learning. In: Proceedings of the 14th International Society for Music Information Retrieval Conference (2013)
Law, E., West, K., Mandel, M., Bay, M., Downie, J.S.: Evaluation of algorithms using games: the case of music annotation. In: Proceedings of the 10th International Conference on Music Information Retrieval (ISMIR) (2009)
Wulfing, J., Riedmiller, M.: Unsupervised learning of local features for music classification. In: Proceedings of the 13th International Society for Music Information Retrieval Conference (2012)
Nam, J., Herrera, J., Slaney, M., Smith, J.: Learning sparse feature representations for music annotation and retrieval. In: Proceedings of the 13th International Society for Music Information Retrieval Conference (2012)
Nam, J., Herrera, J., Lee, K.: A deep bag-of-features model for music auto-tagging (2015). arXiv:1508.04999
Choi, K., Fazekas, G., Sandler, M.: Automatic tagging using deep convolutional neural networks (2016). arXiv:1606.00298
van den Oord, A., Dieleman, S., Schrauwen, B.: Deep content-based music recommendation. In: Proceedings of the Neural Information Processing Systems Conference (2013)
Dieleman, S., Schrauwen, B.: End-to-end learning for music audio. In: Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) (2014)
Lee, J., Park, J., Kim, K.L., Nam, J.: Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms (2017). arXiv:1703.01789
Hamel, P., Bengio, Y., Eck, D.: Building musically-relevant audio features through multiple timescale representations. In: Proceedings of the 13th International Society for Music Information Retrieval Conference (2012)
Mesgarani, N., Shamma, S., Slaney, M.: Speech discrimination based on multiscale spectro-temporal modulations. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2004)
Lee, J., Nam, J.: Multi-level and multi-scale feature aggregation using pre-trained convolutional neural networks for music auto-tagging. arXiv:1703.01793 (2017)
LeCun, Y., Boser, B.E., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.E., Jackel, L.D.: Handwritten digit recognition with a back-propagation network. In: Proceedings of the Neural Information Processing Systems Conference (1989)
Sainath, T.N., Mohamed, A.-R., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks For LVCSR. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013)
Dorfler, M., Bammer, R., Grill, T.: Inside the spectrogram: convolutional neural networks in audio processing. In: Proceedings of the International Conference on Sampling Theory and Applications (SampTA) (2017)
Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27, 861–874 (2006)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML) (2010)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift (2015). arXiv:1502.03167
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)
Kingma, D.P., Adam, J.B.: A method for stochastic optimization (2014). arXiv:1412.6980
McFee, B., Raffel, C., Liang, D., Ellis, D.P.W., McVicar, M., Battenberg, E., Nieto, O.: librosa: audio and music signal analysis in python. In: Proceedings of the 14th Python in Science Conference, pp. 18–25 (2015)
Theano Development Team, Theano: A python framework for fast computation of mathematical expressions (2016). arXiv:1605.02688
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Dabral, T.S., Deshmukh, A.S., Malapati, A. (2019). A Multi-scale Convolutional Neural Network Architecture for Music Auto-Tagging. In: Bansal, J., Das, K., Nagar, A., Deep, K., Ojha, A. (eds) Soft Computing for Problem Solving. Advances in Intelligent Systems and Computing, vol 816. Springer, Singapore. https://doi.org/10.1007/978-981-13-1592-3_60
Download citation
DOI: https://doi.org/10.1007/978-981-13-1592-3_60
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1591-6
Online ISBN: 978-981-13-1592-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)