skip to main content
10.1145/3581783.3612849acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Automatic Audio Augmentation for Requests Sub-Challenge

Published:27 October 2023Publication History

ABSTRACT

This paper presents our solution for the Requests Sub-challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge. Drawing upon the framework of self-supervised learning, we put forth an automated data augmentation technique for audio classification, accompanied by a multi-channel fusion strategy aimed at enhancing overall performance. Specifically, to tackle the issue of imbalanced classes in complaint classification, we propose an audio data augmentation method that generates appropriate augmentation strategies for the challenge dataset. Furthermore, recognizing the distinctive characteristics of the dual-channel HC-C dataset, we individually evaluate the classification performance of the left channel, right channel, channel difference, and channel sum, subsequently selecting the optimal integration approach. Our approach yields a significant improvement in performance when compared to the competitive baselines, particularly in the context of the complaint task. Moreover, our method demonstrates noteworthy cross-task transferability.

References

  1. Jakob Abeßer, Stylianos Ioannis Mimilakis, Robert Gräfe, and Hanna M. Lukashevich. 2017. Acoustic Scene Classification by Combining Autoencoder-Based Dimensionality Reduction and Convolutional Neural Networks. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE 2017, Munich, Germany, November 16-17, 2017. 7--11.Google ScholarGoogle Scholar
  2. Muhammad Anshari, Mohammad Nabil Almunawar, Syamimi Ariff Lim, and Abdullah Al-Mudimigh. 2019. Customer relationship management and big data enabled: Personalization & customization of services. Applied Computing and Informatics, Vol. 15, 2 (2019), 94--101.Google ScholarGoogle ScholarCross RefCross Ref
  3. Alan Baade, Puyuan Peng, and David Harwath. 2022. MAE-AST: Masked Autoencoding Audio Spectrogram Transformer. In Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022. ISCA, 2438--2442. https://doi.org/10.21437/Interspeech.2022-10961Google ScholarGoogle ScholarCross RefCross Ref
  4. James Bergstra, Daniel Yamins, and David D. Cox. 2013. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013 (JMLR Workshop and Conference Proceedings, Vol. 28). JMLR.org, 115--123. https://doi.org/10.5555/3042817.3042832Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Dading Chong, Helin Wang, Peilin Zhou, and Qingcheng Zeng. 2022. Masked Spectrogram Prediction For Self-Supervised Audio Pre-Training. CoRR, Vol. abs/2204.12768 (2022). https://doi.org/10.48550/arXiv.2204.12768 showeprint[arXiv]2204.12768Google ScholarGoogle ScholarCross RefCross Ref
  6. Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio Set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017. IEEE, 776--780. https://doi.org/10.1109/ICASSP.2017.7952261Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Yuan Gong, Cheng-I Lai, Yu-An Chung, and James R. Glass. 2022. SSAST: Self-Supervised Audio Spectrogram Transformer. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022. AAAI Press, 10699--10709. https://doi.org/10.1609/aaai.v36i10.21315Google ScholarGoogle ScholarCross RefCross Ref
  8. Christian Hildebrand, Fotis Efthymiou, Francesc Busquet, William H Hampton, Donna L Hoffman, and Thomas P Novak. 2020. Voice analytics in business research: Conceptual foundations, acoustic feature extraction, and applications. Journal of Business Research, Vol. 121 (2020), 364--374.Google ScholarGoogle ScholarCross RefCross Ref
  9. Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, and Christoph Feichtenhofer. 2022. Masked autoencoders that listen. Advances in Neural Information Processing Systems, Vol. 35 (2022), 28708--28720.Google ScholarGoogle Scholar
  10. Turab Iqbal, Karim Helwani, Arvindh Krishnaswamy, and Wenwu Wang. 2021. Enhancing Audio Augmentation Methods with Consistency Learning. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021. IEEE, 646--650. https://doi.org/10.1109/ICASSP39728.2021.9414316Google ScholarGoogle ScholarCross RefCross Ref
  11. Khaled Koutini, Jan Schlüter, Hamid Eghbal-zadeh, and Gerhard Widmer. 2022. Efficient Training of Audio Transformers with Patchout. In Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022. ISCA, 2753--2757. https://doi.org/10.21437/Interspeech.2022-227Google ScholarGoogle ScholarCross RefCross Ref
  12. Sungbin Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, and Sungwoong Kim. 2019. Fast AutoAugment. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada. 6662--6672.Google ScholarGoogle Scholar
  13. Swarup Padhy, Juhi Tiwari, Shivam Rathore, and Neetesh Kumar. 2019. Emergency Signal Classification for the Hearing Impaired using Multi-channel Convolutional Neural Network Architecture. In 2019 IEEE Conference on Information and Communication Technology. 1--6. https://doi.org/10.1109/CICT48419.2019.9066252Google ScholarGoogle ScholarCross RefCross Ref
  14. Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. 2019. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019. ISCA, 2613--2617. https://doi.org/10.21437/Interspeech.2019--2680Google ScholarGoogle ScholarCross RefCross Ref
  15. Justin Salamon and Juan Pablo Bello. 2017. Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification. IEEE Signal Process. Lett., Vol. 24, 3 (2017), 279--283. https://doi.org/10.1109/LSP.2017.2657381Google ScholarGoogle ScholarCross RefCross Ref
  16. Scott Scheidt and Q. B. Chung. 2019. Making a case for speech analytics to improve customer service quality: Vision, implementation, and evaluation. Int. J. Inf. Manag., Vol. 45 (2019), 223--232. https://doi.org/10.1016/j.ijinfomgt.2018.01.002Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Björn W. Schuller, Anton Batliner, Shahin Amiriparian, Alexander Barnhill, Maurice Gerczuk, Andreas Triantafyllopoulos, Alice Baird, Panagiotis Tzirakis, Chris Gagne, Alan S. Cowen, Nikola Lackovic, Marie-José Caraty, and Claude Montacié. 2023. The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In Proceedings of the 31. ACM International Conference on Multimedia, MM 2023. ACM, ACM, Ottawa, Canada. 5 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Björn W. Schuller, Anton Batliner, Stefan Steidl, and Dino Seppi. 2011. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Commun., Vol. 53, 9-10 (2011), 1062--1087. https://doi.org/10.1016/j.specom.2011.01.011Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Björn W. Schuller, Stefan Steidl, and Anton Batliner. 2009. The INTERSPEECH 2009 emotion challenge. In INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, Brighton, United Kingdom, September 6-10, 2009. ISCA, 312--315. https://doi.org/10.21437/Interspeech.2009-103Google ScholarGoogle ScholarCross RefCross Ref
  20. Shengyun Wei, Kele Xu, Dezhi Wang, Feifan Liao, Huaimin Wang, and Qiuqiang Kong. 2018. Sample mixed-based data augmentation for domestic audio tagging. arXiv preprint arXiv:1808.03883 (2018).Google ScholarGoogle Scholar
  21. Yunsheng Xiong, Kele Xu, Meng Jiang, Liang Cheng, Yong Dou, and Jinjia Wang. 2022. Improving the Classification of Phonetic Segments from Raw Ultrasound Using Self-Supervised Learning and Hard Example Mining. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022. IEEE, 8262--8266. https://doi.org/10.1109/ICASSP43922.2022.9746804Google ScholarGoogle ScholarCross RefCross Ref
  22. Kele Xu, Dawei Feng, Haibo Mi, Boqing Zhu, Dezhi Wang, Lilun Zhang, Hengxing Cai, and Shuwen Liu. 2018a. Mixup-based acoustic scene classification using multi-channel convolutional neural network. In Advances in Multimedia Information Processing-PCM 2018: 19th Pacific-Rim Conference on Multimedia, Hefei, China, September 21-22, 2018, Proceedings, Part III 19. Springer, 14--23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kele Xu, Ming Feng, Boqing Zhu, et al. 2022. Underwater acoustic classification using masked modeling-based swin transformer. The Journal of the Acoustical Society of America, Vol. 152, 4 (2022), A296--A296.Google ScholarGoogle ScholarCross RefCross Ref
  24. Kele Xu, Kang You, Ming Feng, and Boqing Zhu. 2023. Trust-worth multi-representation learning for audio classification with uncertainty estimation. The Journal of the Acoustical Society of America, Vol. 153, 3_supplement (2023), A125--A125.Google ScholarGoogle ScholarCross RefCross Ref
  25. Kele Xu, Boqing Zhu, Qiuqiang Kong, Haibo Mi, Bo Ding, Dezhi Wang, and Huaimin Wang. 2018b. General audio tagging with ensembling convolutional neural network and statistical features. CoRR, Vol. abs/1810.12832 (2018). [arXiv]1810.12832Google ScholarGoogle Scholar
  26. Kang You, Kele Xu, Boqing Zhu, Ming Feng, Dawei Feng, Bo Liu, Tian Gao, and Bo Ding. 2022. Masked Modeling-based Audio Representation for ACM Multimedia 2022 Computational Paralinguistics ChallengE. In MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022. ACM, 7060--7064. https://doi.org/10.1145/3503161.3551579Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. 2020. Random Erasing Data Augmentation. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press, 13001--13008. https://doi.org/10.1609/aaai.v34i07.7000Google ScholarGoogle ScholarCross RefCross Ref
  28. Boqing Zhu, Kele Xu, Qiuqiang Kong, Huaimin Wang, and Yuxing Peng. 2020. Audio Tagging by Cross Filtering Noisy Labels. IEEE ACM Trans. Audio Speech Lang. Process., Vol. 28 (2020), 2073--2083. https://doi.org/10.1109/TASLP.2020.3008832Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Automatic Audio Augmentation for Requests Sub-Challenge

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MM '23: Proceedings of the 31st ACM International Conference on Multimedia
        October 2023
        9913 pages
        ISBN:9798400701085
        DOI:10.1145/3581783

        Copyright © 2023 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 October 2023

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate995of4,171submissions,24%

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia
      • Article Metrics

        • Downloads (Last 12 months)66
        • Downloads (Last 6 weeks)10

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader