research-article

Automatic Audio Augmentation for Requests Sub-Challenge

Authors:

Kun QianAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 9482 - 9486

https://doi.org/10.1145/3581783.3612849

Published: 27 October 2023 Publication History

Abstract

This paper presents our solution for the Requests Sub-challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge. Drawing upon the framework of self-supervised learning, we put forth an automated data augmentation technique for audio classification, accompanied by a multi-channel fusion strategy aimed at enhancing overall performance. Specifically, to tackle the issue of imbalanced classes in complaint classification, we propose an audio data augmentation method that generates appropriate augmentation strategies for the challenge dataset. Furthermore, recognizing the distinctive characteristics of the dual-channel HC-C dataset, we individually evaluate the classification performance of the left channel, right channel, channel difference, and channel sum, subsequently selecting the optimal integration approach. Our approach yields a significant improvement in performance when compared to the competitive baselines, particularly in the context of the complaint task. Moreover, our method demonstrates noteworthy cross-task transferability.

References

[1]

Jakob Abeßer, Stylianos Ioannis Mimilakis, Robert Gräfe, and Hanna M. Lukashevich. 2017. Acoustic Scene Classification by Combining Autoencoder-Based Dimensionality Reduction and Convolutional Neural Networks. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE 2017, Munich, Germany, November 16-17, 2017. 7--11.

[2]

Muhammad Anshari, Mohammad Nabil Almunawar, Syamimi Ariff Lim, and Abdullah Al-Mudimigh. 2019. Customer relationship management and big data enabled: Personalization & customization of services. Applied Computing and Informatics, Vol. 15, 2 (2019), 94--101.

[3]

Alan Baade, Puyuan Peng, and David Harwath. 2022. MAE-AST: Masked Autoencoding Audio Spectrogram Transformer. In Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022. ISCA, 2438--2442. https://doi.org/10.21437/Interspeech.2022-10961

[4]

James Bergstra, Daniel Yamins, and David D. Cox. 2013. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013 (JMLR Workshop and Conference Proceedings, Vol. 28). JMLR.org, 115--123. https://doi.org/10.5555/3042817.3042832

Digital Library

[5]

Dading Chong, Helin Wang, Peilin Zhou, and Qingcheng Zeng. 2022. Masked Spectrogram Prediction For Self-Supervised Audio Pre-Training. CoRR, Vol. abs/2204.12768 (2022). https://doi.org/10.48550/arXiv.2204.12768 showeprint[arXiv]2204.12768

[6]

Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio Set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017. IEEE, 776--780. https://doi.org/10.1109/ICASSP.2017.7952261

Digital Library

[7]

Yuan Gong, Cheng-I Lai, Yu-An Chung, and James R. Glass. 2022. SSAST: Self-Supervised Audio Spectrogram Transformer. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022. AAAI Press, 10699--10709. https://doi.org/10.1609/aaai.v36i10.21315

[8]

Christian Hildebrand, Fotis Efthymiou, Francesc Busquet, William H Hampton, Donna L Hoffman, and Thomas P Novak. 2020. Voice analytics in business research: Conceptual foundations, acoustic feature extraction, and applications. Journal of Business Research, Vol. 121 (2020), 364--374.

[9]

Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, and Christoph Feichtenhofer. 2022. Masked autoencoders that listen. Advances in Neural Information Processing Systems, Vol. 35 (2022), 28708--28720.

[10]

Turab Iqbal, Karim Helwani, Arvindh Krishnaswamy, and Wenwu Wang. 2021. Enhancing Audio Augmentation Methods with Consistency Learning. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021. IEEE, 646--650. https://doi.org/10.1109/ICASSP39728.2021.9414316

[11]

Khaled Koutini, Jan Schlüter, Hamid Eghbal-zadeh, and Gerhard Widmer. 2022. Efficient Training of Audio Transformers with Patchout. In Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022. ISCA, 2753--2757. https://doi.org/10.21437/Interspeech.2022-227

[12]

Sungbin Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, and Sungwoong Kim. 2019. Fast AutoAugment. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada. 6662--6672.

[13]

Swarup Padhy, Juhi Tiwari, Shivam Rathore, and Neetesh Kumar. 2019. Emergency Signal Classification for the Hearing Impaired using Multi-channel Convolutional Neural Network Architecture. In 2019 IEEE Conference on Information and Communication Technology. 1--6. https://doi.org/10.1109/CICT48419.2019.9066252

[14]

Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. 2019. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019. ISCA, 2613--2617. https://doi.org/10.21437/Interspeech.2019--2680

[15]

Justin Salamon and Juan Pablo Bello. 2017. Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification. IEEE Signal Process. Lett., Vol. 24, 3 (2017), 279--283. https://doi.org/10.1109/LSP.2017.2657381

[16]

Scott Scheidt and Q. B. Chung. 2019. Making a case for speech analytics to improve customer service quality: Vision, implementation, and evaluation. Int. J. Inf. Manag., Vol. 45 (2019), 223--232. https://doi.org/10.1016/j.ijinfomgt.2018.01.002

Digital Library

[17]

Björn W. Schuller, Anton Batliner, Shahin Amiriparian, Alexander Barnhill, Maurice Gerczuk, Andreas Triantafyllopoulos, Alice Baird, Panagiotis Tzirakis, Chris Gagne, Alan S. Cowen, Nikola Lackovic, Marie-José Caraty, and Claude Montacié. 2023. The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In Proceedings of the 31. ACM International Conference on Multimedia, MM 2023. ACM, ACM, Ottawa, Canada. 5 pages.

Digital Library

[18]

Björn W. Schuller, Anton Batliner, Stefan Steidl, and Dino Seppi. 2011. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Commun., Vol. 53, 9-10 (2011), 1062--1087. https://doi.org/10.1016/j.specom.2011.01.011

Digital Library

[19]

Björn W. Schuller, Stefan Steidl, and Anton Batliner. 2009. The INTERSPEECH 2009 emotion challenge. In INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, Brighton, United Kingdom, September 6-10, 2009. ISCA, 312--315. https://doi.org/10.21437/Interspeech.2009-103

[20]

Shengyun Wei, Kele Xu, Dezhi Wang, Feifan Liao, Huaimin Wang, and Qiuqiang Kong. 2018. Sample mixed-based data augmentation for domestic audio tagging. arXiv preprint arXiv:1808.03883 (2018).

[21]

Yunsheng Xiong, Kele Xu, Meng Jiang, Liang Cheng, Yong Dou, and Jinjia Wang. 2022. Improving the Classification of Phonetic Segments from Raw Ultrasound Using Self-Supervised Learning and Hard Example Mining. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022. IEEE, 8262--8266. https://doi.org/10.1109/ICASSP43922.2022.9746804

[22]

Kele Xu, Dawei Feng, Haibo Mi, Boqing Zhu, Dezhi Wang, Lilun Zhang, Hengxing Cai, and Shuwen Liu. 2018a. Mixup-based acoustic scene classification using multi-channel convolutional neural network. In Advances in Multimedia Information Processing-PCM 2018: 19th Pacific-Rim Conference on Multimedia, Hefei, China, September 21-22, 2018, Proceedings, Part III 19. Springer, 14--23.

Digital Library

[23]

Kele Xu, Ming Feng, Boqing Zhu, et al. 2022. Underwater acoustic classification using masked modeling-based swin transformer. The Journal of the Acoustical Society of America, Vol. 152, 4 (2022), A296--A296.

[24]

Kele Xu, Kang You, Ming Feng, and Boqing Zhu. 2023. Trust-worth multi-representation learning for audio classification with uncertainty estimation. The Journal of the Acoustical Society of America, Vol. 153, 3_supplement (2023), A125--A125.

[25]

Kele Xu, Boqing Zhu, Qiuqiang Kong, Haibo Mi, Bo Ding, Dezhi Wang, and Huaimin Wang. 2018b. General audio tagging with ensembling convolutional neural network and statistical features. CoRR, Vol. abs/1810.12832 (2018). [arXiv]1810.12832

[26]

Kang You, Kele Xu, Boqing Zhu, Ming Feng, Dawei Feng, Bo Liu, Tian Gao, and Bo Ding. 2022. Masked Modeling-based Audio Representation for ACM Multimedia 2022 Computational Paralinguistics ChallengE. In MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022. ACM, 7060--7064. https://doi.org/10.1145/3503161.3551579

Digital Library

[27]

Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. 2020. Random Erasing Data Augmentation. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press, 13001--13008. https://doi.org/10.1609/aaai.v34i07.7000

[28]

Boqing Zhu, Kele Xu, Qiuqiang Kong, Huaimin Wang, and Yuxing Peng. 2020. Audio Tagging by Cross Filtering Noisy Labels. IEEE ACM Trans. Audio Speech Lang. Process., Vol. 28 (2020), 2073--2083. https://doi.org/10.1109/TASLP.2020.3008832

Digital Library

Cited By

Yin ZXu XSchuller B(2025)Request and complaint recognition in call-center speech using a pointwise-convolution recurrent networkInternational Journal of Speech Technology10.1007/s10772-025-10171-7Online publication date: 5-Feb-2025
https://doi.org/10.1007/s10772-025-10171-7
Sun YXu KDou YGao T(2024)Self-Supervised Learning-Based General Fine-tuning Framework For Audio Classification and Event Detection2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687821(1-6)Online publication date: 15-Jul-2024
https://doi.org/10.1109/ICME57554.2024.10687821
Ristea NAnghel AIonescu R(2024)Cascaded cross-modal transformer for audio–textual classificationArtificial Intelligence Review10.1007/s10462-024-10869-157:9Online publication date: 2-Aug-2024
https://doi.org/10.1007/s10462-024-10869-1

Index Terms

Automatic Audio Augmentation for Requests Sub-Challenge
1. Computing methodologies
  1. Artificial intelligence
    1. Knowledge representation and reasoning
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification

Recommendations

Automatic Classification of Guitar Playing Modes
Sound, Music, and Motion
Abstract
When they improvise, musicians typically alternate between several playing modes on their instruments. Guitarists in particular, alternate between modes such as octave playing, mixed chords and bass, chord comping, solo melodies, walking bass, ...
ClaveNet: Generating Afro-Cuban Drum Patterns through Data Augmentation
AM '24: Proceedings of the 19th International Audio Mostly Conference: Explorations in Sonic Cultures

We present ClaveNet: a generative MIDI model for Afro-Cuban percussion. We adapt the Monotonic Groove Transformer (MGT) —originally trained on the Groove MIDI Dataset (GMD)— to generate Afro-Cuban-influenced MIDI drum grooves. As Afro-Cuban drum MIDI ...
Pitch-density-based features and an SVM binary tree approach for multi-class audio classification in broadcast news

Audio classification is an essential task in multimedia content analysis, which is a prerequisite to a variety of tasks such as segmentation, indexing and retrieval. This paper describes our study on multi-class audio classification on broadcast news, a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
92
Total Downloads

Downloads (Last 12 months)41
Downloads (Last 6 weeks)2

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yin ZXu XSchuller B(2025)Request and complaint recognition in call-center speech using a pointwise-convolution recurrent networkInternational Journal of Speech Technology10.1007/s10772-025-10171-7Online publication date: 5-Feb-2025
https://doi.org/10.1007/s10772-025-10171-7
Sun YXu KDou YGao T(2024)Self-Supervised Learning-Based General Fine-tuning Framework For Audio Classification and Event Detection2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687821(1-6)Online publication date: 15-Jul-2024
https://doi.org/10.1109/ICME57554.2024.10687821
Ristea NAnghel AIonescu R(2024)Cascaded cross-modal transformer for audio–textual classificationArtificial Intelligence Review10.1007/s10462-024-10869-157:9Online publication date: 2-Aug-2024
https://doi.org/10.1007/s10462-024-10869-1

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten