research-article

Multi-Layer Acoustic & Linguistic Feature Fusion for ComParE-23 Emotion and Requests Challenge

Authors:

Siddhant R. Viksit,

Vinayak AbrolAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 9492 - 9495

https://doi.org/10.1145/3581783.3612851

Published: 27 October 2023 Publication History

Get Access

Abstract

The ACM Multimedia 2023 ComParE challenge focuses on classification/regression tasks for spoken customer-agent and emotionally rated conversations. The challenge baseline systems build upon the recent advancement in large-scale supervised/unsupervised foundational acoustic models that demonstrate consistently good performance across tasks. In this work, with the aim of improving the performance further, we present a novel multi-layer feature fusion method. In particular, the proposed approach leverages the hierarchical information from acoustic models using multi-layer statistics pooling, where we compute the weighted sum of layer-wise (mean and standard deviation) features. We further experimented with linguistic features and their late fusion with acoustic features, especially for subtasks involving complex conversations. Exploring various combinations of methods and features, we present four different systems tailored for each subchallenge, demonstrating significant performance gains over the baseline on the development and test set.

References

[1]

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, Vol. 33 (2020), 12449--12460.

Google Scholar

[2]

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. 2022. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, Vol. 16, 6 (2022), 1505--1518.

Crossref

Google Scholar

[3]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186. https://doi.org/10.18653/v1/N19-1423

Crossref

Google Scholar

[4]

Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: The Munich Versatile and Fast Open-Source Audio Feature Extractor. In ACM International Conference on Multimedia (Firenze, Italy) (MM '10). Association for Computing Machinery, New York, NY, USA, 1459--1462. https://doi.org/10.1145/1873951.1874246

Digital Library

Google Scholar

[5]

Devansh Gupta and Vinayak Abrol. 2022. Time-Frequency and Geometric Analysis of Task-Dependent Learning in Raw Waveform Based Acoustic Models. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4323--4327. https://doi.org/10.1109/ICASSP43922.2022.9746577

Crossref

Google Scholar

[6]

Hannah Muckenhirn, Vinayak Abrol, Mathew Magimai-Doss, and Sébastien Marcel. 2019. Understanding and Visualizing Raw Waveform-Based CNNs. In Proc. Interspeech. 2345--2349. https://doi.org/10.21437/Interspeech.2019--2341

Crossref

Google Scholar

[7]

Koji Okabe, Takafumi Koshinaka, and Koichi Shinoda. 2018. Attentive Statistics Pooling for Deep Speaker Embedding. In Interspeech. ISCA, Hyderabad, India, 2252--2256.

Google Scholar

[8]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In Energy Efficient Training and Inference of Transformer Based Models (EMC2). Vancouver BC, Canada, 1--5.

Google Scholar

[9]

Björn W. Schuller, Anton Batliner, Shahin Amiriparian, Alexander Barnhill, Maurice Gerczuk, Andreas Triantafyllopoulos, Alice Baird, Panagiotis Tzirakis, Chris Gagne, Alan S. Cowen, Nikola Lackovic, Marie-José Caraty, and Claude Montacié. 2023. The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In Proceedings of the 31. ACM International Conference on Multimedia, MM 2023. ACM, ACM, Ottawa, Canada. 5 pages.

Digital Library

Google Scholar

[10]

Johannes Wagner, Andreas Triantafyllopoulos, Hagen Wierstorf, Maximilian Schmitt, Felix Burkhardt, Florian Eyben, and Björn W. Schuller. 2022. Model for Dimensional Speech Emotion Recognition based on Wav2vec 2.0. https://doi.org/10.5281/zenodo.6221127

Crossref

Google Scholar

[11]

Yingzhi Wang, Abdelmoumene Boumadane, and Abdelwahab Heba. 2021. A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding. arXiv preprint arXiv:2111.02735 (2021).

Google Scholar

Cited By

View all

Yin ZXu XSchuller B(2025)Request and complaint recognition in call-center speech using a pointwise-convolution recurrent networkInternational Journal of Speech Technology10.1007/s10772-025-10171-7Online publication date: 5-Feb-2025
https://doi.org/10.1007/s10772-025-10171-7

Index Terms

Multi-Layer Acoustic & Linguistic Feature Fusion for ComParE-23 Emotion and Requests Challenge
1. Computing methodologies
  1. Artificial intelligence
2. Information systems
  1. Information systems applications

Recommendations

Multi-feature fusion enhanced transformer with multi-layer fused decoding for image captioning
Abstract
The objects’ semantic information of the image is vital for image captioning. Though some methods have used semantic information, the alignment between the specific semantic feature and the corresponding visual feature has not been explored. In ...
Incongruity-aware multimodal physiology signals fusion for emotion recognition
Abstract
Various physiological signals can reflect the human’s emotional states objectively. How to take advantage of the common as well as complementary properties of different physiological signals in representing the emotional states is an interesting ...
Highlights
- Multimodal physiological signals are fused for emotion recognition.
- Incongruity among auxiliary modalities is reduced by Cross Modal Transformer.
- Primary modality is enhanced by auxiliary modality through Modified Cross Modal ...
Multimodal Sentiment Analysis Based on Attentional Temporal Convolutional Network and Multi-Layer Feature Fusion
Multimodal sentiment analysis aims to extract and integrate information from different modalities to accurately identify the sentiment expressed in multimodal data. How to effectively capture the relevant information within a specific modality and how to ...

Comments

Information & Contributors

Information

Published In

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

JPMorgan Chase and Company

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
129
Total Downloads

Downloads (Last 12 months)70
Downloads (Last 6 weeks)10

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Yin ZXu XSchuller B(2025)Request and complaint recognition in call-center speech using a pointwise-convolution recurrent networkInternational Journal of Speech Technology10.1007/s10772-025-10171-7Online publication date: 5-Feb-2025
https://doi.org/10.1007/s10772-025-10171-7

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Multi-feature fusion enhanced transformer with multi-layer fused decoding for image captioning

Incongruity-aware multimodal physiology signals fusion for emotion recognition

Multimodal Sentiment Analysis Based on Attentional Temporal Convolutional Network and Multi-Layer Feature Fusion

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations