Deep Learning-Based Music Instrument Recognition: Exploring Learned Feature Representations

Taenzer, Michael; Mimilakis, Stylianos I.; Abeßer, Jakob

doi:10.1007/978-3-031-35382-6_4

Michael Taenzer¹²,
Stylianos I. Mimilakis¹² &
Jakob Abeßer¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13770 ))

Included in the following conference series:

International Symposium on Computer Music Multidisciplinary Research

650 Accesses

Abstract

In this work, we focus on the problem of automatic instrument recognition (AIR) using supervised learning. In particular, we follow a state-of-the-art AIR approach that combines a deep convolutional neural network (CNN) architecture with an attention mechanism. This attention mechanism is conditioned on a learned input feature representation, which itself is extracted by another CNN model acting as a feature extractor. The extractor is pre-trained on a large-scale audio dataset using discriminative objectives for sound event detection. In our experiments, we show that when using log-mel spectrograms as input features instead, the performance of the CNN-based AIR algorithm decreases significantly. Hence, our results indicate that the feature representations are the main factor that affects the performance of the AIR algorithm. Furthermore, we show that various pre-training tasks affect the AIR performance in different ways for subsets of the music instrument classes.

M. Taenzer and S. I. Mimilakis—Equally contributing authors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Publicly available under https://github.com/cosmir/openmic-2018.

References

Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013). https://doi.org/10.1109/TPAMI.2013.50
Article Google Scholar
Drossos, K., Adavanne, S., Virtanen, T.: Automated audio captioning with recurrent neural networks. In: Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, New York, USA (2017)
Google Scholar
Engel, J., et al.: Neural audio synthesis of musical notes with WaveNet autoencoders. arXiv preprint arXiv:1704.01279 (2017)
Favory, X., Drossos, K., Virtanen, T., Serra, X.: Coala: co-aligned autoencoders for learning semantically enriched audio representations. arXiv preprint arXiv:2006.08386 (2020)
Font, F., Roma, G., Serra, X.: Freesound technical demo. In: Proceedings of the 21st ACM International Conference on Multimedia, New York, NY, USA, pp. 411–412 (2013)
Google Scholar
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780 (2017)
Google Scholar
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS 2010), pp. 249–256 (2010)
Google Scholar
Gomez, J., Abeßer, J., Cano, E.: Jazz solo instrument classification with convolutional neural networks, source separation, and transfer learning. In: Proceedings of the 19th International Society of Music Information Retrieval Conference (ISMIR), Paris, France, pp. 577–584 (2018)
Google Scholar
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)
MATH Google Scholar
Gururani, S., Sharma, M., Lerch, A.: An attention mechanism for musical instrument recognition. In: Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), Delft, The Netherlands, pp. 83–90 (2019)
Google Scholar
Han, Y., Kim, J., Lee, K.: Deep convolutional neural networks for predominant instrument recognition in polyphonic music. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 25(1), 208–221 (2017)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015)
Hershey, S., et al.: CNN architectures for large-scale audio classification. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, pp. 131–135 (2017)
Google Scholar
Humphrey, E.J., Durand, S., Mcfee, B.: OpenMIC-2018: an open data-set for multiple instrument recognition. In: Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, pp. 438–444 (2018)
Google Scholar
Hung, Y.N., Yang, Y.H.: Frame-level instrument recognition by timbre and pitch. In: Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, pp. 135–142 (2018)
Google Scholar
Jansen, A., Gemmeke, J.F., Ellis, D.P.W., Liu, X., Lawrence, W., Freedman, D.: Large-scale audio event discovery in one million YouTube videos. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 786–790 (2017). https://doi.org/10.1109/ICASSP.2017.7952263
Kim, D., Sung, T., Cho, S., Lee, G., Sohn, C.: A single predominant instrument recognition of polyphonic music using CNN-based timbre analysis. Int. J. Eng. Technol. (UAE) 7, 590–593 (2018)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR) (2015)
Google Scholar
Li, P., Qian, J., Wang, T.: Automatic instrument recognition in polyphonic music using convolutional neural networks. arXiv preprint arXiv:1511.05520 (2015)
Li, X., Wang, K., Soraghan, J., Ren, J.: Fusion of Hilbert-Huang transform and deep convolutional neural network for predominant musical instruments recognition. In: Romero, J., Ekárt, A., Martins, T., Correia, J. (eds.) EvoMUSART 2020. LNCS, vol. 12103, pp. 80–89. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-43859-3_6
Chapter Google Scholar
Long, M., Cao, Y., Wang, J., Jordan, M.I.: Learning transferable features with deep adaptation networks. In: Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, vol. 37, pp. 97–105 (2015)
Google Scholar
Mimilakis, S.I., Weiss, C., Arifi-Müller, V., Abeßer, J., Müller, M.: Cross-version singing voice detection in opera recordings: challenges for supervised learning. In: Cellier, P., Driessens, K. (eds.) ECML PKDD 2019. CCIS, vol. 1168, pp. 429–436. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-43887-6_35
Chapter Google Scholar
Müller, M.: Information Retrieval for Music and Motion. Springer, Heidelberg (2007)
Book Google Scholar
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on International Conference on Machine Learning (ICML), pp. 807–814. Omnipress, Madison (2010)
Google Scholar
Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Learning and transferring mid-level image representations using convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1717–1724 (2014)
Google Scholar
Park, T., Lee, T.: Musical instrument sound classification with deep convolutional neural network using feature fusion approach. arXiv preprint arXiv:1512.07370 (2015)
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Rafii, Z., Liutkus, A., Stöter, F.R., Mimilakis, S.I., FitzGerald, D., Pardo, B.: An overview of lead and accompaniment separation in music. IEEE/ACM Trans. Audio Speech Lang. Process. 26(8), 1307–1335 (2018)
Article Google Scholar
Scheirer, E.D.: Music-listening systems. Ph.D. thesis, Massachusetts Institute of Technology (2000)
Google Scholar
Smaragdis, P.: Redundancy reduction for computational audition, a unifying approach. Ph.D. thesis, Massachusetts Institute of Technology (2001)
Google Scholar
Taenzer, M., Abeßer, J., Mimilakis, S.I., Weiß, C., Müller, M., Lukashevich, H.: Investigating CNN-based instrument family recognition for western classical music recordings. In: Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), Delft, The Netherlands, pp. 612–619 (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Proceedings of the 30th International Conference Advances in Neural Information Processing Systems (NeurIPS), pp. 5998–6008. Curran Associates, Inc. (2017)
Google Scholar
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning (ICML), Helsinki, Finland, pp. 1096–1103. ACM (2008)
Google Scholar
Watcharasupat, K., Gururani, S., Lerch, A.: Visual attention for musical instrument recognition. arXiv preprint arXiv:2006.09640 (2020)
Wu, Y., He, K.: Group normalization. arXiv preprint arXiv:1803.08494 (2018)

Download references

Acknowledgments

This work has been supported by the German Research Foundation (AB 675/2-1).

Author information

Authors and Affiliations

Fraunhofer Institute for Digital Media Technology IDMT, Ilmenau, Germany
Michael Taenzer, Stylianos I. Mimilakis & Jakob Abeßer

Authors

Michael Taenzer
View author publications
You can also search for this author in PubMed Google Scholar
Stylianos I. Mimilakis
View author publications
You can also search for this author in PubMed Google Scholar
Jakob Abeßer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Taenzer .

Editor information

Editors and Affiliations

Aix-Marseille Univ, Marseille Cedex 09, France
Mitsuko Aramaki
Future University Hakodate, Hakodate, Hokkaido, Japan
Keiji Hirata
Nihon University, Tokyo, Japan
Tetsuro Kitahara
Aix-Marseille Univ, Marseille Cedex 09, France
Richard Kronland-Martinet
Aix-Marseille Univ, Marseille Cedex 09, France
Sølvi Ystad

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Taenzer, M., Mimilakis, S.I., Abeßer, J. (2023). Deep Learning-Based Music Instrument Recognition: Exploring Learned Feature Representations. In: Aramaki, M., Hirata, K., Kitahara, T., Kronland-Martinet, R., Ystad, S. (eds) Music in the AI Era. CMMR 2021. Lecture Notes in Computer Science, vol 13770 . Springer, Cham. https://doi.org/10.1007/978-3-031-35382-6_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-35382-6_4
Published: 22 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-35381-9
Online ISBN: 978-3-031-35382-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics