skip to main content
10.1145/3581783.3612849acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Automatic Audio Augmentation for Requests Sub-Challenge

Published: 27 October 2023 Publication History

Abstract

This paper presents our solution for the Requests Sub-challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge. Drawing upon the framework of self-supervised learning, we put forth an automated data augmentation technique for audio classification, accompanied by a multi-channel fusion strategy aimed at enhancing overall performance. Specifically, to tackle the issue of imbalanced classes in complaint classification, we propose an audio data augmentation method that generates appropriate augmentation strategies for the challenge dataset. Furthermore, recognizing the distinctive characteristics of the dual-channel HC-C dataset, we individually evaluate the classification performance of the left channel, right channel, channel difference, and channel sum, subsequently selecting the optimal integration approach. Our approach yields a significant improvement in performance when compared to the competitive baselines, particularly in the context of the complaint task. Moreover, our method demonstrates noteworthy cross-task transferability.

References

[1]
Jakob Abeßer, Stylianos Ioannis Mimilakis, Robert Gräfe, and Hanna M. Lukashevich. 2017. Acoustic Scene Classification by Combining Autoencoder-Based Dimensionality Reduction and Convolutional Neural Networks. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE 2017, Munich, Germany, November 16-17, 2017. 7--11.
[2]
Muhammad Anshari, Mohammad Nabil Almunawar, Syamimi Ariff Lim, and Abdullah Al-Mudimigh. 2019. Customer relationship management and big data enabled: Personalization & customization of services. Applied Computing and Informatics, Vol. 15, 2 (2019), 94--101.
[3]
Alan Baade, Puyuan Peng, and David Harwath. 2022. MAE-AST: Masked Autoencoding Audio Spectrogram Transformer. In Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022. ISCA, 2438--2442. https://doi.org/10.21437/Interspeech.2022-10961
[4]
James Bergstra, Daniel Yamins, and David D. Cox. 2013. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013 (JMLR Workshop and Conference Proceedings, Vol. 28). JMLR.org, 115--123. https://doi.org/10.5555/3042817.3042832
[5]
Dading Chong, Helin Wang, Peilin Zhou, and Qingcheng Zeng. 2022. Masked Spectrogram Prediction For Self-Supervised Audio Pre-Training. CoRR, Vol. abs/2204.12768 (2022). https://doi.org/10.48550/arXiv.2204.12768 showeprint[arXiv]2204.12768
[6]
Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio Set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017. IEEE, 776--780. https://doi.org/10.1109/ICASSP.2017.7952261
[7]
Yuan Gong, Cheng-I Lai, Yu-An Chung, and James R. Glass. 2022. SSAST: Self-Supervised Audio Spectrogram Transformer. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022. AAAI Press, 10699--10709. https://doi.org/10.1609/aaai.v36i10.21315
[8]
Christian Hildebrand, Fotis Efthymiou, Francesc Busquet, William H Hampton, Donna L Hoffman, and Thomas P Novak. 2020. Voice analytics in business research: Conceptual foundations, acoustic feature extraction, and applications. Journal of Business Research, Vol. 121 (2020), 364--374.
[9]
Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, and Christoph Feichtenhofer. 2022. Masked autoencoders that listen. Advances in Neural Information Processing Systems, Vol. 35 (2022), 28708--28720.
[10]
Turab Iqbal, Karim Helwani, Arvindh Krishnaswamy, and Wenwu Wang. 2021. Enhancing Audio Augmentation Methods with Consistency Learning. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021. IEEE, 646--650. https://doi.org/10.1109/ICASSP39728.2021.9414316
[11]
Khaled Koutini, Jan Schlüter, Hamid Eghbal-zadeh, and Gerhard Widmer. 2022. Efficient Training of Audio Transformers with Patchout. In Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022. ISCA, 2753--2757. https://doi.org/10.21437/Interspeech.2022-227
[12]
Sungbin Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, and Sungwoong Kim. 2019. Fast AutoAugment. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada. 6662--6672.
[13]
Swarup Padhy, Juhi Tiwari, Shivam Rathore, and Neetesh Kumar. 2019. Emergency Signal Classification for the Hearing Impaired using Multi-channel Convolutional Neural Network Architecture. In 2019 IEEE Conference on Information and Communication Technology. 1--6. https://doi.org/10.1109/CICT48419.2019.9066252
[14]
Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. 2019. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019. ISCA, 2613--2617. https://doi.org/10.21437/Interspeech.2019--2680
[15]
Justin Salamon and Juan Pablo Bello. 2017. Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification. IEEE Signal Process. Lett., Vol. 24, 3 (2017), 279--283. https://doi.org/10.1109/LSP.2017.2657381
[16]
Scott Scheidt and Q. B. Chung. 2019. Making a case for speech analytics to improve customer service quality: Vision, implementation, and evaluation. Int. J. Inf. Manag., Vol. 45 (2019), 223--232. https://doi.org/10.1016/j.ijinfomgt.2018.01.002
[17]
Björn W. Schuller, Anton Batliner, Shahin Amiriparian, Alexander Barnhill, Maurice Gerczuk, Andreas Triantafyllopoulos, Alice Baird, Panagiotis Tzirakis, Chris Gagne, Alan S. Cowen, Nikola Lackovic, Marie-José Caraty, and Claude Montacié. 2023. The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In Proceedings of the 31. ACM International Conference on Multimedia, MM 2023. ACM, ACM, Ottawa, Canada. 5 pages.
[18]
Björn W. Schuller, Anton Batliner, Stefan Steidl, and Dino Seppi. 2011. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Commun., Vol. 53, 9-10 (2011), 1062--1087. https://doi.org/10.1016/j.specom.2011.01.011
[19]
Björn W. Schuller, Stefan Steidl, and Anton Batliner. 2009. The INTERSPEECH 2009 emotion challenge. In INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, Brighton, United Kingdom, September 6-10, 2009. ISCA, 312--315. https://doi.org/10.21437/Interspeech.2009-103
[20]
Shengyun Wei, Kele Xu, Dezhi Wang, Feifan Liao, Huaimin Wang, and Qiuqiang Kong. 2018. Sample mixed-based data augmentation for domestic audio tagging. arXiv preprint arXiv:1808.03883 (2018).
[21]
Yunsheng Xiong, Kele Xu, Meng Jiang, Liang Cheng, Yong Dou, and Jinjia Wang. 2022. Improving the Classification of Phonetic Segments from Raw Ultrasound Using Self-Supervised Learning and Hard Example Mining. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022. IEEE, 8262--8266. https://doi.org/10.1109/ICASSP43922.2022.9746804
[22]
Kele Xu, Dawei Feng, Haibo Mi, Boqing Zhu, Dezhi Wang, Lilun Zhang, Hengxing Cai, and Shuwen Liu. 2018a. Mixup-based acoustic scene classification using multi-channel convolutional neural network. In Advances in Multimedia Information Processing-PCM 2018: 19th Pacific-Rim Conference on Multimedia, Hefei, China, September 21-22, 2018, Proceedings, Part III 19. Springer, 14--23.
[23]
Kele Xu, Ming Feng, Boqing Zhu, et al. 2022. Underwater acoustic classification using masked modeling-based swin transformer. The Journal of the Acoustical Society of America, Vol. 152, 4 (2022), A296--A296.
[24]
Kele Xu, Kang You, Ming Feng, and Boqing Zhu. 2023. Trust-worth multi-representation learning for audio classification with uncertainty estimation. The Journal of the Acoustical Society of America, Vol. 153, 3_supplement (2023), A125--A125.
[25]
Kele Xu, Boqing Zhu, Qiuqiang Kong, Haibo Mi, Bo Ding, Dezhi Wang, and Huaimin Wang. 2018b. General audio tagging with ensembling convolutional neural network and statistical features. CoRR, Vol. abs/1810.12832 (2018). [arXiv]1810.12832
[26]
Kang You, Kele Xu, Boqing Zhu, Ming Feng, Dawei Feng, Bo Liu, Tian Gao, and Bo Ding. 2022. Masked Modeling-based Audio Representation for ACM Multimedia 2022 Computational Paralinguistics ChallengE. In MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022. ACM, 7060--7064. https://doi.org/10.1145/3503161.3551579
[27]
Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. 2020. Random Erasing Data Augmentation. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press, 13001--13008. https://doi.org/10.1609/aaai.v34i07.7000
[28]
Boqing Zhu, Kele Xu, Qiuqiang Kong, Huaimin Wang, and Yuxing Peng. 2020. Audio Tagging by Cross Filtering Noisy Labels. IEEE ACM Trans. Audio Speech Lang. Process., Vol. 28 (2020), 2073--2083. https://doi.org/10.1109/TASLP.2020.3008832

Cited By

View all
  • (2025)Request and complaint recognition in call-center speech using a pointwise-convolution recurrent networkInternational Journal of Speech Technology10.1007/s10772-025-10171-7Online publication date: 5-Feb-2025
  • (2024)Self-Supervised Learning-Based General Fine-tuning Framework For Audio Classification and Event Detection2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687821(1-6)Online publication date: 15-Jul-2024
  • (2024)Cascaded cross-modal transformer for audio–textual classificationArtificial Intelligence Review10.1007/s10462-024-10869-157:9Online publication date: 2-Aug-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. audio classification
  2. automatic audio augmentation
  3. computational paralinguistics
  4. data augmentation

Qualifiers

  • Research-article

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)41
  • Downloads (Last 6 weeks)2
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Request and complaint recognition in call-center speech using a pointwise-convolution recurrent networkInternational Journal of Speech Technology10.1007/s10772-025-10171-7Online publication date: 5-Feb-2025
  • (2024)Self-Supervised Learning-Based General Fine-tuning Framework For Audio Classification and Event Detection2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687821(1-6)Online publication date: 15-Jul-2024
  • (2024)Cascaded cross-modal transformer for audio–textual classificationArtificial Intelligence Review10.1007/s10462-024-10869-157:9Online publication date: 2-Aug-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media