ABSTRACT
Multi-modal emotion analysis has become an active research field . However, in real-world scenarios, it is often necessary to analyze and recognize emotion data with noise. Integrating information from different modalities effectively to enhance the overall robustness of the model remains a challenge. To address this, we propose an improved approach that leverages modality latent information to enhance cross-modal interaction and improve the robustness of multi-modal emotion classification models. Specifically, we apply a multi-period-based preprocessing technique to the audio modality data. Additionally, we introduce a random modality noise injection strategy to augment the training data and enhance generalization capabilities. Finally, we employ a composite fusion method to integrate information features from different modalities, effectively promoting cross-modal information interaction and enhancing the overall robustness of the model. We evaluate our proposed method in the MER-NOISE sub-challenge of MER2023. Experimental results demonstrate that our improved multi-modal emotion classification model achieves a weighted F1 score of 69.66% and an MSE score of 0.92 on the MER-NOISE test set, with an overall score of 46.69%, representing a 5.69% improvement over the baseline. These results prove the effectiveness of our proposed approach in further enhancing the robustness of the model.
- Zheng Lian, Haiyang Sun, Licai Sun, Jinming Zhao, Ye Liu, B. Liu, Jiangyan Yi, M. Wang, E. Cambria, Guoying Zhao, Björn Schuller, and Jianhua Tao. Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning. ArXiv, abs/2304.08981, 2023.Google Scholar
- Zheng Lian, Lang Chen, Licai Sun, B. Liu, and Jianhua Tao. Gcnet: Graph completion network for incomplete multimodal learning in conversation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:8419--8432, 2022.Google Scholar
- Licai Sun, Zheng Lian, Bin Liu, and Jianhua Tao. Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis. IEEE Transactions on Affective Computing, PP.Google Scholar
- Ziqi Yuan, Wei Li, Hua Xu, and Wenmeng Yu. Transformer-based feature reconstruction network for robust multimodal sentiment analysis. 2021.Google ScholarDigital Library
- BjörnW. Schuller. Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends. Commun. ACM, 61(5):90--99, 2018.Google ScholarDigital Library
- Baijun Xie, Mariia Sidulova, and Chung Hyuk Park. Robust multimodal emotion recognition from conversation with transformer-based crossmodality fusion. Sensors, 21(14):4913, 2021.Google ScholarCross Ref
- Mingli Song, Mingyu You, Na Li, and Chun Chen. A robust multimodal approach for emotion recognition. Neurocomputing, 71(10):1913--1920, 2008.Google ScholarDigital Library
- Tadas Baltrusaitis, Peter Robinson, and Louis Philippe Morency. Openface: An open source facial behavior analysis toolkit. In IEEE Winter Conference on Applications of Computer Vision.Google Scholar
- Zhuoyuan Yao, DiWu, XiongWang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, and Xin Lei. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. 2021.Google Scholar
- Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. In Neural Information Processing Systems.Google Scholar
- Chris Chatfield. The analysis of time series: An introduction. Biometrics, 52(3):1162, 1996.Google ScholarCross Ref
- Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. arXiv e-prints, 2022.Google Scholar
- Haixu Wu, Teng Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. Timesnet: Temporal 2d-variation modeling for general time series analysis. ArXiv, abs/2210.02186, 2022.Google Scholar
- Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770--778, 2015.Google Scholar
- Z. Zhao, Q. Liu, and S.Wang. Learning deep global multi-scale and local attention features for facial expression recognition in the wild. IEEE Transactions on Image Processing, 30:6544--6556, 2021.Google ScholarDigital Library
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805, 2019.Google Scholar
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019.Google Scholar
- Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. Revisiting pre-trained models for chinese natural language processing. ArXiv, abs/2004.13922, 2020.Google Scholar
- Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdel-rahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451--3460, 2021.Google Scholar
- Zheng Lian, Bin Liu, and Jianhua Tao. Decn: Dialogical emotion correction network for conversational emotion recognition. Neurocomputing, 454:483--495, 2021.Google ScholarCross Ref
- Z. Lian, B. Liu, and J. Tao. Ctnet: Conversational transformer network for emotion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:985--1000, 2021.Google Scholar
- D. Kollias, A. Schulc, E. Hajiyev, and S. Zafeiriou. Analysing affective behavior in the first abaw 2020 competition. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pages 637--643.Google ScholarDigital Library
- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Computer Science, 2014.Google Scholar
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929--1958, 2014.Google ScholarDigital Library
Index Terms
- An Improved Method for Enhancing Robustness of Multimodal Sentiment Classification Models via Utilizing Modality Latent Information
Recommendations
Building Robust Multimodal Sentiment Recognition via a Simple yet Effective Multimodal Transformer
MM '23: Proceedings of the 31st ACM International Conference on MultimediaIn this paper, we present the solutions to the MER-MULTI and MER-NOISE sub-challenges of the Multimodal Emotion Recognition Challenge (MER 2023). For the tasks MER-MULTI and MER-NOISE, participants are required to recognize both discrete and dimensional ...
MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning
MM '23: Proceedings of the 31st ACM International Conference on MultimediaThe first Multimodal Emotion Recognition Challenge (MER 2023)1 was successfully held at ACM Multimedia. The challenge focuses on system robustness and consists of three distinct tracks: (1) MER-MULTI, where participants are required to recognize both ...
Enhancing Adversarial Robustness of Multi-modal Recommendation via Modality Balancing
MM '23: Proceedings of the 31st ACM International Conference on MultimediaRecently multi-modal recommender systems have been widely applied in real scenarios such as e-commerce businesses. Existing multi-modal recommendation methods exploit the multi-modal content of items as auxiliary information and fuse them to boost ...
Comments