FEF-Net: feature enhanced fusion network with crossmodal attention for multimodal humor prediction

Gao, Peng; Tao, Chuanqi; Guan, Donghai

doi:10.1007/s00530-024-01402-z

FEF-Net: feature enhanced fusion network with crossmodal attention for multimodal humor prediction

Regular Paper
Published: 04 July 2024

Volume 30, article number 195, (2024)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Peng Gao¹,
Chuanqi Tao^1,2 &
Donghai Guan^1,2

275 Accesses
1 Citation
Explore all metrics

Abstract

Humor segment prediction in video involves the comprehension and analysis of humor. Traditional humor prediction has been text-based; however, with the evolution of multimedia, the focus has shifted to multimodal approaches in humor prediction, marking a current trend in research. In recent years, determining whether a video is humorous has remained a challenge within the domain of sentiment analysis. Researchers have proposed multiple data fusion methods to address humor prediction and sentiment analysis. Within the realm of studying humor and emotions, text modality assumes a leading role, while audio and video modalities serve as supplementary data sources for multimodal humor prediction. However, these auxiliary modalities contain significant irrelevant information unrelated to the prediction task, resulting in redundancy. Current multimodal fusion models primarily emphasize fusion methods but overlook the issue of high redundancy in auxiliary modalities. The lack of research on reducing redundancy in auxiliary modalities introduces noise, thereby increasing the overall training complexity of models and diminishing predictive accuracy. Hence, developing a humor prediction method that effectively reduces redundancy in auxiliary modalities is pivotal for advancing multimodal research. In this paper, we propose the Feature Enhanced Fusion Network (FEF-Net), leveraging cross-modal attention to augment features from auxiliary modalities using knowledge from textual data. This mechanism generates weights to emphasize the redundancy of each corresponding time slice in the auxiliary modality. Further, employing Transformer encoders extracts high-level features for each modality, thereby enhancing the performance of humor prediction models. Experimental comparisons were conducted using the UR-FUNNY and MUStARD multimodal humor prediction models, revealing a 3.2% improvement in ‘Acc-2’ compared to the optimal model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

NRAFN: a non-text reinforcement and adaptive fusion network for multimodal sentiment analysis

Article 28 May 2024

Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection

Article 04 January 2022

Text-Oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences

Data availability

The original datasets have been published online. The datasets generated during or analyzed during the current study are available from the corresponding author upon reasonable request.

References

Annamoradnejad, I., Zoghi, G.: Colbert: using Bert sentence embedding for humor detection. arXiv preprint arXiv:2004.12765 1(3), (2020)
Baltrusaitis, T., Zadeh, A., Lim, Y.C., et al.: Openface 2.0: facial behavior analysis toolkit. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, pp 59–66 (2018)
Bertero, D., Fung, P.: Deep Learning of audio and language features for humor prediction. In: Calzolari N, Choukri K, Declerck T, et al (eds) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA), Portorož, Slovenia, pp 496–501 (2016a)
Bertero, D., Fung, P.: A long short-term memory framework for predicting humor in dialogues. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 130–135 (2016b)
Biewald, L.: Experiment tracking with weights and biases. Software available from wandb.com (2020)
Binsted, K., Ritchie, G.: Computational rules for generating punning riddles. HUMOR (1997)
Castro, S., Hazarika, D., Pérez-Rosas, V., et al.: Towards multimodal sarcasm detection (an _obviously_ perfect paper). (2019). https://doi.org/10.48550/arXiv.1906.01815, arXiv:1906.01815
Chapman, A.J.: Humor and laughter in social interaction and some implications for humor research. In: McGhee, P.E., Goldstein, J.H. (eds.) Handbook of Humor Research, pp. 135–157. Springer, New York, New York (1983). https://doi.org/10.1007/978-1-4612-5572-7_7
Chapter Google Scholar
Chen, P.Y., Soo, V.W.: Humor recognition using deep learning. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 113–117 (2018)
Cho, K., van Merrienboer, B., Gulcehre, C., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. (2014). https://doi.org/10.48550/arXiv.1406.1078, arXiv:1406.1078
Christ, L., Amiriparian, S., Kathan, A., et al.: Towards multimodal prediction of spontaneous humour: a novel dataset and first results (2022). https://doi.org/10.48550/ARXIV.2209.14272
Degottex, G., Kane, J., Drugman, T., et al.: COVAREP—a collaborative voice analysis repository for speech technologies. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (Icassp). IEEE, pp 960–964 (2014)
Devlin, J., Chang, M.W., Lee, K., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. (2018). https://doi.org/10.48550/ARXIV.1810.04805
Fan, X., Lin, H., Yang, L., et al.: Humor detection via an internal and external neural network. Neurocomputing 394, 105–111 (2020). https://doi.org/10.1016/j.neucom.2020.02.030
Article Google Scholar
Han, W., Chen, H., Gelbukh, A., et al.: Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In: Proceedings of the 2021 International Conference on Multimodal Interaction. Association for Computing Machinery, New York, NY, USA, ICMI21, pp 6–15 (2021), https://doi.org/10.1145/3462244.3479919
Hangloo, S., Arora, B.: Combating multimodal fake news on social media: methods, datasets, and future perspective. Multimed Syst 28(6), 2391–2422 (2022). https://doi.org/10.1007/s00530-022-00966-y
Article Google Scholar
Hasan, M.K., Rahman, W., Bagher Zadeh, A., et al.: UR-FUNNY: a multimodal language dataset for understanding humor. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pp 2046–2056, (2019). https://doi.org/10.18653/v1/D19-1211
Hasan, M.K., Lee, S., Rahman, W., et al.: Humor knowledge enriched transformer for understanding multimodal humor. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 12972–12980 (2021)
Hazarika, D., Zimmermann, R., Poria, S.: MISA: modality-invariant and -specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM International Conference on Multimedia. ACM, Seattle, pp 1122–1131, (2020). https://doi.org/10.1145/3394171.3413678
Kayatani, Y., Yang, Z., Otani, M., et al.: The laughing machine: predicting humor in video. In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, Waikoloa, pp 2072–2081, (2021). https://doi.org/10.1109/WACV48630.2021.00212
Kim, K., Park, S.: AOBERT: all-modalities-in-One BERT for multimodal sentiment analysis. Inform. Fusion 92, 37–45 (2023). https://doi.org/10.1016/j.inffus.2022.11.022
Article Google Scholar
Kurtzberg, T.R., Naquin, C.E., Belkin, L.Y.: Humor as a relationship-building tool in online negotiations. Int. J. Conflict Manag. 20(4), 377–397 (2009). https://doi.org/10.1108/10444060910991075
Article Google Scholar
Liang, C., Yu, Y., Jiang, H., et al.: BOND: BERT-assisted open-domain named entity recognition with distant supervision. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Association for Computing Machinery, New York, KDD’20, pp 1054–1064, (2020). https://doi.org/10.1145/3394486.3403149
Liu, Y., Ott, M., Goyal, N., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv:1907.11692 (2019)
Lupyan, G.: Objective effects of knowledge on visual perception. J. Exp. Psychol. Hum. Percep. Perform. 43(4), 794 (2017)
Article Google Scholar
Lyttle, J.: The effectiveness of humor in Persuasion: the case of business ethics training. J. General Psychol. 128(2), 206–216 (2001). https://doi.org/10.1080/00221300109598908
Article Google Scholar
Ma, F., Zhang, Y., Sun, X.: Multimodal sentiment analysis with preferential fusion and distance-aware contrastive learning. In: 2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, Brisbane, Australia, pp 1367–1372, (2023). https://doi.org/10.1109/ICME55011.2023.00237
Mai, S., Xing, S., Hu, H.: Analyzing multimodal sentiment via acoustic- and visual-LSTM with channel-aware temporal convolution network. IEEE/ACM Trans. Audio, Speech, Lang. Process. 29, 1424–1437 (2021). https://doi.org/10.1109/TASLP.2021.3068598
Article Google Scholar
Mai, S., Hu, H., Xing, S.: A unimodal representation learning and recurrent decomposition fusion structure for utterance-level multimodal embedding learning. IEEE Trans. Multimed. 24, 2488–2501 (2022). https://doi.org/10.1109/TMM.2021.3082398
Article Google Scholar
Mao, J., Liu, W.: A BERT-based approach for automatic humor detection and scoring. In: IberLEF@ SEPLN, pp 197–202 (2019)
Mihalcea, R., Strapparava, C.: Making computers laugh: investigations in automatic humor recognition. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pp 531–538 (2005)
Paraskevopoulos, G., Georgiou, E., Potamianos, A.: Mmlatch: bottom–up top–down fusion for multimodal sentiment analysis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 4573–4577, (2022). https://doi.org/10.1109/ICASSP43922.2022.9746418
Patro, B.N., Lunayach, M., Srivastava, D., et al.: Multimodal humor dataset: predicting laughter tracks for sitcoms. In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, Waikoloa, HI, USA, pp 576–585, (2021). https://doi.org/10.1109/WACV48630.2021.00062
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Moschitti A, Pang B, Daelemans W (eds) Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1532–1543, (2014). https://doi.org/10.3115/v1/D14-1162
Rahman, W., Hasan, M.K., Lee, S., et al.: Integrating multimodal information in large pretrained transformers. In: Proceedings of the Conference. Association for Computational Linguistics. Meeting, vol 2020. NIH Public Access, p 2359 (2020)
Ramachandran, V.: The neurology and evolution of humor, laughter, and smiling: the false alarm theory. Med. Hypotheses 51(4), 351–354 (1998). https://doi.org/10.1016/S0306-9877(98)90061-5
Article Google Scholar
Ray, A., Mishra, S., Nunna, A., et al.: A multimodal corpus for emotion recognition in sarcasm. In: Calzolari N, Béchet F, Blache P, et al (eds) Proceedings of the Thirteenth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, pp 6992–7003 (2022)
Rothe, S., Narayan, S., Severyn, A.: Leveraging pre-trained checkpoints for sequence generation tasks. Trans. Assoc. Comput. Ling. 8, 264–280 (2020). https://doi.org/10.1162/tacl_a_00313
Article Google Scholar
Seo, S., Na, S., Kim, J.: HMTL: heterogeneous modality transfer learning for audio–visual sentiment analysis. IEEE Access 8, 140426–140437 (2020). https://doi.org/10.1109/ACCESS.2020.3006563
Article Google Scholar
Sherzer, J., Attardo, S.: Linguistic theories of humor. Language 72(1), 132 (1996). https://doi.org/10.2307/416799. arXiv:416799
Article Google Scholar
Stock, O., Strapparava, C.: Getting serious about the development of computational humor. In: IJCAI, vol 3. Citeseer, pp 59–64 (2003)
Sun, Z., Sarma, P., Sethares, W., et al.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. Proc. AAAI Conf. Artif. Intell. 34(05), 8992–8999 (2020). https://doi.org/10.1609/aaai.v34i05.6431
Article Google Scholar
Szegedy, C., Zaremba, W., Sutskever, I., et al.: Intriguing properties of neural networks. (2014). https://doi.org/10.48550/arXiv.1312.6199, arXiv:1312.6199
Tsai, Y.H.H., Bai, S., Liang, P.P., et al.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pp 6558–6569, (2019). https://doi.org/10.18653/v1/P19-1656
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. (2017). https://doi.org/10.48550/arXiv.1706.03762, arXiv:1706.03762
Yang, D., Lavie, A., Dyer, C., et al.: Humor recognition and humor anchor extraction. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp 2367–2376 (2015)
Yang, Z., Nakashima, Y., Takemura, H.: Multi-modal humor segment prediction in video. Multimed. Syst. 29(4), 2389–2398 (2023). https://doi.org/10.1007/s00530-023-01105-x
Article Google Scholar
Zadeh, A., Liang, P.P., Mazumder, N., et al.: Memory fusion network for multi-view sequential learning. Proceedings of the AAAI Conference on Artificial Intelligence 32(1) (2018). https://doi.org/10.1609/aaai.v32i1.12021
Zhang, X., Chen, Y., Li, G.: Multi-modal sarcasm detection based on contrastive attention mechanism. In: Wang L, Feng Y, Hong Y, et al (eds) Natural Language Processing and Chinese Computing. Springer International Publishing, Cham, pp 822–833, (2021). https://doi.org/10.1007/978-3-030-88480-2_66
Zhang, Y., Wang, J., Liu, Y., et al.: A Multitask learning model for multimodal sarcasm, sentiment and emotion recognition in conversations. Inform. Fusion 93, 282–301 (2023). https://doi.org/10.1016/j.inffus.2023.01.005
Article Google Scholar
Zhu, L., Zhu, Z., Zhang, C., et al.: Multimodal sentiment analysis based on fusion methods: a survey. Inform. Fusion 95, 306–325 (2023). https://doi.org/10.1016/j.inffus.2023.02.028
Article Google Scholar
Zou, Z., Chen, K., Shi, Z., et al.: Object detection in 20 years: a survey. Proc. IEEE 111(3), 257–276 (2023). https://doi.org/10.1109/JPROC.2023.3238524
Article Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
Peng Gao, Chuanqi Tao & Donghai Guan
Ministry Key Laboratory for Safety-Critical Software Development and Verification, Nanjing University of Aeronautics and Astronautics, Nanjing, China
Chuanqi Tao & Donghai Guan

Authors

Peng Gao
View author publications
You can also search for this author in PubMed Google Scholar
Chuanqi Tao
View author publications
You can also search for this author in PubMed Google Scholar
Donghai Guan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Peng Gao(First Author):Conceptualization, Methodology, Software, Visualization, Analysis, Writing Original Draft; Chuanqi Tao(Corresponding Author): Conceptualization, Resources, Supervision, Writing - Review & Editing; Donghai Guan: Conceptualization, Resources, Supervision, Writing - Review & Editing

Corresponding author

Correspondence to Chuanqi Tao.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Communicated by Qianqian Xu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gao, P., Tao, C. & Guan, D. FEF-Net: feature enhanced fusion network with crossmodal attention for multimodal humor prediction. Multimedia Systems 30, 195 (2024). https://doi.org/10.1007/s00530-024-01402-z

Download citation

Received: 04 January 2024
Accepted: 28 June 2024
Published: 04 July 2024
DOI: https://doi.org/10.1007/s00530-024-01402-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FEF-Net: feature enhanced fusion network with crossmodal attention for multimodal humor prediction

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

NRAFN: a non-text reinforcement and adaptive fusion network for multimodal sentiment analysis

Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection

Text-Oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

FEF-Net: feature enhanced fusion network with crossmodal attention for multimodal humor prediction

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

NRAFN: a non-text reinforcement and adaptive fusion network for multimodal sentiment analysis

Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection

Text-Oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation