skip to main content
10.1145/3616195.3616215acmotherconferencesArticle/Chapter ViewAbstractPublication PagesamConference Proceedingsconference-collections
research-article

Supervised Contrastive Learning For Musical Onset Detection

Published: 11 October 2023 Publication History

Abstract

This paper applies supervised contrastive learning to musical onset detection to alleviate the issue of noisy annotated data for onset datasets. The results are compared against a state-of-the-art, convolutional, cross-entropy model. Both models were trained on two datasets. The first dataset comprised of a manually annotated selection of music. This data was then augmented with inaccurate labelling to produce the second data set. When trained on the original data the supervised contrastive model produced an F1 score of 0.878. This was close to the cross-entropy model score of 0.888. This showed that supervised contrastive loss is applicable to onset detection but does not outperform cross-entropy models in an ideal training case. When trained on the augmented set the contrastive model consistently outperformed the cross-entropy model across increasing percentage inaccuracies, with a difference in F1 score of 0.1 for the most inaccurate data. This demonstrates the robustness of supervised contrastive learning with inaccurate data for onset detection, suggesting that supervised contrastive loss could provide a new onset detection architecture which is invariant to noise in the data or inaccuracies in labelling.

References

[1]
Beici Liang, György Fazekas, and Mark Sandler. 2019. Piano Sustain-Pedal Detection Using Convolution Neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton, UK, 241–245.
[2]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In International conference on machine learning. 1597–1607.
[3]
Simon Dixon. 2006. Onset Detection Revisited. In Proceedings of the 9th International Conference on Digital Audio Effects, Vol. 120. Espoo, Finland, 133–137.
[4]
Florian Eyben, Sebastian Böck, Björn Schuller, and Alex Grave. 2010. Universal Onset Detection with Bidirectional Long-Short Term Memory Neural Networks. In 11th International Society for Music Information Retrieval Conference. Utrecht, Netherlands.
[5]
Rong Gong and Xavier Serra. 2018. Towards an Efficient Deep Learning Model for Musical Onset Detection. In arXiv:1806.06773.
[6]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Conference on Computer Vision and Pattern Recognition. Las Vegas, USA, 770–778.
[7]
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised Contrastive Learning. Advances in neural information processing systems (2020), 18661–18673.
[8]
Filip Korzeniowski and Gerhard Widmer. 2017. End-to-End Musical Key Estimation Using a Convolutional Neural Network. In Proceedings of the 25th European Signal Processing Conference. Kos, Greece, 966–970.
[9]
Erik Marchi, Giacomo Ferroni, Florian Eyben, Leonardo Gabrielli, Stefano Squartini, and Bjorn Schuller. 2014. Multi-resolution Linear Prediction Based Features for Audio Onset Detection with Bidirectional LSTM Neural Networks. In International Conference on Acoustics, Speech and Signal Processing. Florence, Italy, 2164–2168.
[10]
Aaqib Saeed, David Grangier, and Neil Zeghidour. 2021. Contrastive learning of general-purpose audio representations. In International Conference on Acoustics, Speech and Signal Processing, Vol. 2021-June. Toronto, Canda, 3875–3879.
[11]
Nikunj Saunshi, Jordan T Ash, Surbhi Goel, Dipendra Misra, Cyril Zhang, Sanjeev Arora, Sham Kakade, and Akshay Krishnamurthy. 2022. Understanding Contrastive Learning Requires Incorporating Inductive Biases. In International Conference on Machine Learning. 19250–19286.
[12]
Jan Schlüter and Sebastian Böck. 2014. Improved musical onset detection with Convolutional Neural Networks. In International Conference on Acoustics, Speech and Signal Processing. Florence, Italy, 6979–6983.
[13]
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 07-12-June-2015. Boston, USA, 815–823.
[14]
Lukasz Smietanka and Tomasz Maka. 2021. DNN Architectures and Audio Representations Comparison for Emotional Speech Classification. In International Conference on Software, Telecommunications and Computer Networks. Hvar, Croatia, 1–5.
[15]
Janne Spijkervet and John Ashley Burgoyne. 2021. Contrastive Learning of Musical Representations. arXiv:2103.09410 (2021).
[16]
Hideyuki Tachibana, Nobutaka Ono, and Shigeki Sagayama. 2014. Singing voice enhancement in monaural music signals based on two-stage harmonic/percussive sound separation on multiple resolution spectrograms. IEEE Transactions on Audio, Speech and Language Processing 22, 1 (2014), 228–237.
[17]
Thomas Wilmering, György Fazekas, and Mark Sandler. 2010. The Effects of Reverberation on Onset Detection Tasks. In Audio Engineering Society Convention. London, UK.
[18]
Yihao Xue, Kyle Whitecross, and Baharan Mirzasoleiman. 2022. Investigating Why Contrastive Learning Benefits Robustness Against Label Noise. In International Conference on Machine Learning, Vol. 162. Baltimore, USA, 24851–24871.
[19]
Yuya Yamamoto, Juhan Nam, Hiroko Terasawa, and . Yuzuru Hiraga. 2021. Investigating Time-Frequency Representations for Audio Feature Extraction in Singing Technique Classification. In Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Tokyo, Japan, 890–896.
[20]
Rikiya Yamashita, Mizuho Nishio, Richard Kinh Gian Do, and Kaori Togashi. 2018. Convolutional neural networks: an overview and application in radiology. Insights into Imaging 9, 4 (2018), 611–629.
[21]
Dong Yao, Zhou Zhao, Shengyu Zhang, Jieming Zhu, Yudong Zhu, Rui Zhang, and Xiuqiang He. 2022. Contrastive Learning with Positive-Negative Frame Mask for Music Representation. In Proceedings of the ACM Web Conference. 2906–2915.

Cited By

View all
  • (2024)Future Feature-Based Supervised Contrastive Learning for Streaming PerceptionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.343969234:12(13611-13625)Online publication date: Dec-2024
  • (2024)On the Acoustic-Based Recognition of Multiple Objects Using Overlapped Impact SoundsIEEE Access10.1109/ACCESS.2024.345942312(135651-135666)Online publication date: 2024
  • (2024)Identification of Non-Speaking and Minimal-Speaking Individuals Using Nonverbal VocalizationsIEEE Access10.1109/ACCESS.2024.339858412(68954-68967)Online publication date: 2024

Index Terms

  1. Supervised Contrastive Learning For Musical Onset Detection

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      AM '23: Proceedings of the 18th International Audio Mostly Conference
      August 2023
      204 pages
      ISBN:9798400708183
      DOI:10.1145/3616195
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 October 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. contrastive learning
      2. data inaccuracies
      3. onset detection

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      • UKRI

      Conference

      AM '23
      AM '23: Audio Mostly 2023
      August 30 - September 1, 2023
      Edinburgh, United Kingdom

      Acceptance Rates

      Overall Acceptance Rate 177 of 275 submissions, 64%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)41
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 02 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Future Feature-Based Supervised Contrastive Learning for Streaming PerceptionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.343969234:12(13611-13625)Online publication date: Dec-2024
      • (2024)On the Acoustic-Based Recognition of Multiple Objects Using Overlapped Impact SoundsIEEE Access10.1109/ACCESS.2024.345942312(135651-135666)Online publication date: 2024
      • (2024)Identification of Non-Speaking and Minimal-Speaking Individuals Using Nonverbal VocalizationsIEEE Access10.1109/ACCESS.2024.339858412(68954-68967)Online publication date: 2024

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media