ABSTRACT
This paper presents our solution for the Requests Sub-challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge. Drawing upon the framework of self-supervised learning, we put forth an automated data augmentation technique for audio classification, accompanied by a multi-channel fusion strategy aimed at enhancing overall performance. Specifically, to tackle the issue of imbalanced classes in complaint classification, we propose an audio data augmentation method that generates appropriate augmentation strategies for the challenge dataset. Furthermore, recognizing the distinctive characteristics of the dual-channel HC-C dataset, we individually evaluate the classification performance of the left channel, right channel, channel difference, and channel sum, subsequently selecting the optimal integration approach. Our approach yields a significant improvement in performance when compared to the competitive baselines, particularly in the context of the complaint task. Moreover, our method demonstrates noteworthy cross-task transferability.
- Jakob Abeßer, Stylianos Ioannis Mimilakis, Robert Gräfe, and Hanna M. Lukashevich. 2017. Acoustic Scene Classification by Combining Autoencoder-Based Dimensionality Reduction and Convolutional Neural Networks. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE 2017, Munich, Germany, November 16-17, 2017. 7--11.Google Scholar
- Muhammad Anshari, Mohammad Nabil Almunawar, Syamimi Ariff Lim, and Abdullah Al-Mudimigh. 2019. Customer relationship management and big data enabled: Personalization & customization of services. Applied Computing and Informatics, Vol. 15, 2 (2019), 94--101.Google ScholarCross Ref
- Alan Baade, Puyuan Peng, and David Harwath. 2022. MAE-AST: Masked Autoencoding Audio Spectrogram Transformer. In Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022. ISCA, 2438--2442. https://doi.org/10.21437/Interspeech.2022-10961Google ScholarCross Ref
- James Bergstra, Daniel Yamins, and David D. Cox. 2013. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013 (JMLR Workshop and Conference Proceedings, Vol. 28). JMLR.org, 115--123. https://doi.org/10.5555/3042817.3042832Google ScholarDigital Library
- Dading Chong, Helin Wang, Peilin Zhou, and Qingcheng Zeng. 2022. Masked Spectrogram Prediction For Self-Supervised Audio Pre-Training. CoRR, Vol. abs/2204.12768 (2022). https://doi.org/10.48550/arXiv.2204.12768 showeprint[arXiv]2204.12768Google ScholarCross Ref
- Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio Set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017. IEEE, 776--780. https://doi.org/10.1109/ICASSP.2017.7952261Google ScholarDigital Library
- Yuan Gong, Cheng-I Lai, Yu-An Chung, and James R. Glass. 2022. SSAST: Self-Supervised Audio Spectrogram Transformer. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022. AAAI Press, 10699--10709. https://doi.org/10.1609/aaai.v36i10.21315Google ScholarCross Ref
- Christian Hildebrand, Fotis Efthymiou, Francesc Busquet, William H Hampton, Donna L Hoffman, and Thomas P Novak. 2020. Voice analytics in business research: Conceptual foundations, acoustic feature extraction, and applications. Journal of Business Research, Vol. 121 (2020), 364--374.Google ScholarCross Ref
- Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, and Christoph Feichtenhofer. 2022. Masked autoencoders that listen. Advances in Neural Information Processing Systems, Vol. 35 (2022), 28708--28720.Google Scholar
- Turab Iqbal, Karim Helwani, Arvindh Krishnaswamy, and Wenwu Wang. 2021. Enhancing Audio Augmentation Methods with Consistency Learning. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021. IEEE, 646--650. https://doi.org/10.1109/ICASSP39728.2021.9414316Google ScholarCross Ref
- Khaled Koutini, Jan Schlüter, Hamid Eghbal-zadeh, and Gerhard Widmer. 2022. Efficient Training of Audio Transformers with Patchout. In Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022. ISCA, 2753--2757. https://doi.org/10.21437/Interspeech.2022-227Google ScholarCross Ref
- Sungbin Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, and Sungwoong Kim. 2019. Fast AutoAugment. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada. 6662--6672.Google Scholar
- Swarup Padhy, Juhi Tiwari, Shivam Rathore, and Neetesh Kumar. 2019. Emergency Signal Classification for the Hearing Impaired using Multi-channel Convolutional Neural Network Architecture. In 2019 IEEE Conference on Information and Communication Technology. 1--6. https://doi.org/10.1109/CICT48419.2019.9066252Google ScholarCross Ref
- Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. 2019. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019. ISCA, 2613--2617. https://doi.org/10.21437/Interspeech.2019--2680Google ScholarCross Ref
- Justin Salamon and Juan Pablo Bello. 2017. Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification. IEEE Signal Process. Lett., Vol. 24, 3 (2017), 279--283. https://doi.org/10.1109/LSP.2017.2657381Google ScholarCross Ref
- Scott Scheidt and Q. B. Chung. 2019. Making a case for speech analytics to improve customer service quality: Vision, implementation, and evaluation. Int. J. Inf. Manag., Vol. 45 (2019), 223--232. https://doi.org/10.1016/j.ijinfomgt.2018.01.002Google ScholarDigital Library
- Björn W. Schuller, Anton Batliner, Shahin Amiriparian, Alexander Barnhill, Maurice Gerczuk, Andreas Triantafyllopoulos, Alice Baird, Panagiotis Tzirakis, Chris Gagne, Alan S. Cowen, Nikola Lackovic, Marie-José Caraty, and Claude Montacié. 2023. The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In Proceedings of the 31. ACM International Conference on Multimedia, MM 2023. ACM, ACM, Ottawa, Canada. 5 pages.Google ScholarDigital Library
- Björn W. Schuller, Anton Batliner, Stefan Steidl, and Dino Seppi. 2011. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Commun., Vol. 53, 9-10 (2011), 1062--1087. https://doi.org/10.1016/j.specom.2011.01.011Google ScholarDigital Library
- Björn W. Schuller, Stefan Steidl, and Anton Batliner. 2009. The INTERSPEECH 2009 emotion challenge. In INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, Brighton, United Kingdom, September 6-10, 2009. ISCA, 312--315. https://doi.org/10.21437/Interspeech.2009-103Google ScholarCross Ref
- Shengyun Wei, Kele Xu, Dezhi Wang, Feifan Liao, Huaimin Wang, and Qiuqiang Kong. 2018. Sample mixed-based data augmentation for domestic audio tagging. arXiv preprint arXiv:1808.03883 (2018).Google Scholar
- Yunsheng Xiong, Kele Xu, Meng Jiang, Liang Cheng, Yong Dou, and Jinjia Wang. 2022. Improving the Classification of Phonetic Segments from Raw Ultrasound Using Self-Supervised Learning and Hard Example Mining. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022. IEEE, 8262--8266. https://doi.org/10.1109/ICASSP43922.2022.9746804Google ScholarCross Ref
- Kele Xu, Dawei Feng, Haibo Mi, Boqing Zhu, Dezhi Wang, Lilun Zhang, Hengxing Cai, and Shuwen Liu. 2018a. Mixup-based acoustic scene classification using multi-channel convolutional neural network. In Advances in Multimedia Information Processing-PCM 2018: 19th Pacific-Rim Conference on Multimedia, Hefei, China, September 21-22, 2018, Proceedings, Part III 19. Springer, 14--23.Google ScholarDigital Library
- Kele Xu, Ming Feng, Boqing Zhu, et al. 2022. Underwater acoustic classification using masked modeling-based swin transformer. The Journal of the Acoustical Society of America, Vol. 152, 4 (2022), A296--A296.Google ScholarCross Ref
- Kele Xu, Kang You, Ming Feng, and Boqing Zhu. 2023. Trust-worth multi-representation learning for audio classification with uncertainty estimation. The Journal of the Acoustical Society of America, Vol. 153, 3_supplement (2023), A125--A125.Google ScholarCross Ref
- Kele Xu, Boqing Zhu, Qiuqiang Kong, Haibo Mi, Bo Ding, Dezhi Wang, and Huaimin Wang. 2018b. General audio tagging with ensembling convolutional neural network and statistical features. CoRR, Vol. abs/1810.12832 (2018). [arXiv]1810.12832Google Scholar
- Kang You, Kele Xu, Boqing Zhu, Ming Feng, Dawei Feng, Bo Liu, Tian Gao, and Bo Ding. 2022. Masked Modeling-based Audio Representation for ACM Multimedia 2022 Computational Paralinguistics ChallengE. In MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022. ACM, 7060--7064. https://doi.org/10.1145/3503161.3551579Google ScholarDigital Library
- Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. 2020. Random Erasing Data Augmentation. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press, 13001--13008. https://doi.org/10.1609/aaai.v34i07.7000Google ScholarCross Ref
- Boqing Zhu, Kele Xu, Qiuqiang Kong, Huaimin Wang, and Yuxing Peng. 2020. Audio Tagging by Cross Filtering Noisy Labels. IEEE ACM Trans. Audio Speech Lang. Process., Vol. 28 (2020), 2073--2083. https://doi.org/10.1109/TASLP.2020.3008832Google ScholarDigital Library
Index Terms
- Automatic Audio Augmentation for Requests Sub-Challenge
Recommendations
Automatic Classification of Guitar Playing Modes
Sound, Music, and MotionAbstractWhen they improvise, musicians typically alternate between several playing modes on their instruments. Guitarists in particular, alternate between modes such as octave playing, mixed chords and bass, chord comping, solo melodies, walking bass, ...
Pitch-density-based features and an SVM binary tree approach for multi-class audio classification in broadcast news
Audio classification is an essential task in multimedia content analysis, which is a prerequisite to a variety of tasks such as segmentation, indexing and retrieval. This paper describes our study on multi-class audio classification on broadcast news, a ...
Drum loop pattern extraction from polyphonic music audio
ICME'09: Proceedings of the 2009 IEEE international conference on Multimedia and ExpoAlthough drum loops are widely present in many audio recordings of modern style music, there is little research that deals with automatic extraction of drum loops in polyphonic music audio. This paper presents an approach for drum loop pattern ...
Comments