research-article

Supervised Contrastive Learning For Musical Onset Detection

Authors:

GyöRgy FazekasAuthors Info & Claims

AM '23: Proceedings of the 18th International Audio Mostly Conference

Pages 130 - 135

https://doi.org/10.1145/3616195.3616215

Published: 11 October 2023 Publication History

Abstract

This paper applies supervised contrastive learning to musical onset detection to alleviate the issue of noisy annotated data for onset datasets. The results are compared against a state-of-the-art, convolutional, cross-entropy model. Both models were trained on two datasets. The first dataset comprised of a manually annotated selection of music. This data was then augmented with inaccurate labelling to produce the second data set. When trained on the original data the supervised contrastive model produced an F1 score of 0.878. This was close to the cross-entropy model score of 0.888. This showed that supervised contrastive loss is applicable to onset detection but does not outperform cross-entropy models in an ideal training case. When trained on the augmented set the contrastive model consistently outperformed the cross-entropy model across increasing percentage inaccuracies, with a difference in F1 score of 0.1 for the most inaccurate data. This demonstrates the robustness of supervised contrastive learning with inaccurate data for onset detection, suggesting that supervised contrastive loss could provide a new onset detection architecture which is invariant to noise in the data or inaccuracies in labelling.

References

[1]

Beici Liang, György Fazekas, and Mark Sandler. 2019. Piano Sustain-Pedal Detection Using Convolution Neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton, UK, 241–245.

[2]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In International conference on machine learning. 1597–1607.

[3]

Simon Dixon. 2006. Onset Detection Revisited. In Proceedings of the 9th International Conference on Digital Audio Effects, Vol. 120. Espoo, Finland, 133–137.

[4]

Florian Eyben, Sebastian Böck, Björn Schuller, and Alex Grave. 2010. Universal Onset Detection with Bidirectional Long-Short Term Memory Neural Networks. In 11th International Society for Music Information Retrieval Conference. Utrecht, Netherlands.

[5]

Rong Gong and Xavier Serra. 2018. Towards an Efficient Deep Learning Model for Musical Onset Detection. In arXiv:1806.06773.

[6]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Conference on Computer Vision and Pattern Recognition. Las Vegas, USA, 770–778.

[7]

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised Contrastive Learning. Advances in neural information processing systems (2020), 18661–18673.

[8]

Filip Korzeniowski and Gerhard Widmer. 2017. End-to-End Musical Key Estimation Using a Convolutional Neural Network. In Proceedings of the 25th European Signal Processing Conference. Kos, Greece, 966–970.

[9]

Erik Marchi, Giacomo Ferroni, Florian Eyben, Leonardo Gabrielli, Stefano Squartini, and Bjorn Schuller. 2014. Multi-resolution Linear Prediction Based Features for Audio Onset Detection with Bidirectional LSTM Neural Networks. In International Conference on Acoustics, Speech and Signal Processing. Florence, Italy, 2164–2168.

[10]

Aaqib Saeed, David Grangier, and Neil Zeghidour. 2021. Contrastive learning of general-purpose audio representations. In International Conference on Acoustics, Speech and Signal Processing, Vol. 2021-June. Toronto, Canda, 3875–3879.

[11]

Nikunj Saunshi, Jordan T Ash, Surbhi Goel, Dipendra Misra, Cyril Zhang, Sanjeev Arora, Sham Kakade, and Akshay Krishnamurthy. 2022. Understanding Contrastive Learning Requires Incorporating Inductive Biases. In International Conference on Machine Learning. 19250–19286.

[12]

Jan Schlüter and Sebastian Böck. 2014. Improved musical onset detection with Convolutional Neural Networks. In International Conference on Acoustics, Speech and Signal Processing. Florence, Italy, 6979–6983.

[13]

Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 07-12-June-2015. Boston, USA, 815–823.

[14]

Lukasz Smietanka and Tomasz Maka. 2021. DNN Architectures and Audio Representations Comparison for Emotional Speech Classification. In International Conference on Software, Telecommunications and Computer Networks. Hvar, Croatia, 1–5.

[15]

Janne Spijkervet and John Ashley Burgoyne. 2021. Contrastive Learning of Musical Representations. arXiv:2103.09410 (2021).

[16]

Hideyuki Tachibana, Nobutaka Ono, and Shigeki Sagayama. 2014. Singing voice enhancement in monaural music signals based on two-stage harmonic/percussive sound separation on multiple resolution spectrograms. IEEE Transactions on Audio, Speech and Language Processing 22, 1 (2014), 228–237.

Digital Library

[17]

Thomas Wilmering, György Fazekas, and Mark Sandler. 2010. The Effects of Reverberation on Onset Detection Tasks. In Audio Engineering Society Convention. London, UK.

[18]

Yihao Xue, Kyle Whitecross, and Baharan Mirzasoleiman. 2022. Investigating Why Contrastive Learning Benefits Robustness Against Label Noise. In International Conference on Machine Learning, Vol. 162. Baltimore, USA, 24851–24871.

[19]

Yuya Yamamoto, Juhan Nam, Hiroko Terasawa, and . Yuzuru Hiraga. 2021. Investigating Time-Frequency Representations for Audio Feature Extraction in Singing Technique Classification. In Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Tokyo, Japan, 890–896.

[20]

Rikiya Yamashita, Mizuho Nishio, Richard Kinh Gian Do, and Kaori Togashi. 2018. Convolutional neural networks: an overview and application in radiology. Insights into Imaging 9, 4 (2018), 611–629.

[21]

Dong Yao, Zhou Zhao, Shengyu Zhang, Jieming Zhu, Yudong Zhu, Rui Zhang, and Xiuqiang He. 2022. Contrastive Learning with Positive-Negative Frame Mask for Music Representation. In Proceedings of the ACM Web Conference. 2906–2915.

Digital Library

Cited By

Wang THuang H(2024)Future Feature-Based Supervised Contrastive Learning for Streaming PerceptionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.343969234:12(13611-13625)Online publication date: Dec-2024
https://doi.org/10.1109/TCSVT.2024.3439692
Tran VTsai W(2024)On the Acoustic-Based Recognition of Multiple Objects Using Overlapped Impact SoundsIEEE Access10.1109/ACCESS.2024.345942312(135651-135666)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3459423
Tran vTsai W(2024)Identification of Non-Speaking and Minimal-Speaking Individuals Using Nonverbal VocalizationsIEEE Access10.1109/ACCESS.2024.339858412(68954-68967)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3398584

Index Terms

Supervised Contrastive Learning For Musical Onset Detection
1. Applied computing
  1. Arts and humanities
    1. Sound and music computing
2. Computing methodologies
  1. Artificial intelligence

Recommendations

Bootstrap learning for accurate onset detection

Supervised learning models have been applied to create good onset detection systems for musical audio signals. However, this always requires a large set of labeled training examples, and hand-labeling is quite tedious and time consuming. In this paper, ...
SSCL: Semi-supervised Contrastive Learning for Industrial Anomaly Detection
Pattern Recognition and Computer Vision
Abstract
Anomaly detection is an important machine learning task that aims to identify data points that are inconsistent with normal data patterns. In real-world scenarios, it is common to have access to some labeled and unlabeled samples that are known to ...
Deep semi-supervised learning with contrastive learning and partial label propagation for image data
Abstract
Deep semi-supervised learning is becoming an active research topic because it jointly utilizes labeled and unlabeled samples in training deep neural networks. Recent advances are mainly focused on inductive semi-supervised learning ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

AM '23: Proceedings of the 18th International Audio Mostly Conference

August 2023

204 pages

ISBN:9798400708183

DOI:10.1145/3616195

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

UKRI

Conference

AM '23

AM '23: Audio Mostly 2023

August 30 - September 1, 2023

Edinburgh, United Kingdom

Acceptance Rates

Overall Acceptance Rate 177 of 275 submissions, 64%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
114
Total Downloads

Downloads (Last 12 months)41
Downloads (Last 6 weeks)3

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang THuang H(2024)Future Feature-Based Supervised Contrastive Learning for Streaming PerceptionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.343969234:12(13611-13625)Online publication date: Dec-2024
https://doi.org/10.1109/TCSVT.2024.3439692
Tran VTsai W(2024)On the Acoustic-Based Recognition of Multiple Objects Using Overlapped Impact SoundsIEEE Access10.1109/ACCESS.2024.345942312(135651-135666)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3459423
Tran vTsai W(2024)Identification of Non-Speaking and Minimal-Speaking Individuals Using Nonverbal VocalizationsIEEE Access10.1109/ACCESS.2024.339858412(68954-68967)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3398584

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten