Skip to main content

Advertisement

Log in

FEF-Net: feature enhanced fusion network with crossmodal attention for multimodal humor prediction

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Humor segment prediction in video involves the comprehension and analysis of humor. Traditional humor prediction has been text-based; however, with the evolution of multimedia, the focus has shifted to multimodal approaches in humor prediction, marking a current trend in research. In recent years, determining whether a video is humorous has remained a challenge within the domain of sentiment analysis. Researchers have proposed multiple data fusion methods to address humor prediction and sentiment analysis. Within the realm of studying humor and emotions, text modality assumes a leading role, while audio and video modalities serve as supplementary data sources for multimodal humor prediction. However, these auxiliary modalities contain significant irrelevant information unrelated to the prediction task, resulting in redundancy. Current multimodal fusion models primarily emphasize fusion methods but overlook the issue of high redundancy in auxiliary modalities. The lack of research on reducing redundancy in auxiliary modalities introduces noise, thereby increasing the overall training complexity of models and diminishing predictive accuracy. Hence, developing a humor prediction method that effectively reduces redundancy in auxiliary modalities is pivotal for advancing multimodal research. In this paper, we propose the Feature Enhanced Fusion Network (FEF-Net), leveraging cross-modal attention to augment features from auxiliary modalities using knowledge from textual data. This mechanism generates weights to emphasize the redundancy of each corresponding time slice in the auxiliary modality. Further, employing Transformer encoders extracts high-level features for each modality, thereby enhancing the performance of humor prediction models. Experimental comparisons were conducted using the UR-FUNNY and MUStARD multimodal humor prediction models, revealing a 3.2% improvement in ‘Acc-2’ compared to the optimal model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

The original datasets have been published online. The datasets generated during or analyzed during the current study are available from the corresponding author upon reasonable request.

References

  1. Annamoradnejad, I., Zoghi, G.: Colbert: using Bert sentence embedding for humor detection. arXiv preprint arXiv:2004.12765 1(3), (2020)

  2. Baltrusaitis, T., Zadeh, A., Lim, Y.C., et al.: Openface 2.0: facial behavior analysis toolkit. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, pp 59–66 (2018)

  3. Bertero, D., Fung, P.: Deep Learning of audio and language features for humor prediction. In: Calzolari N, Choukri K, Declerck T, et al (eds) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA), Portorož, Slovenia, pp 496–501 (2016a)

  4. Bertero, D., Fung, P.: A long short-term memory framework for predicting humor in dialogues. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 130–135 (2016b)

  5. Biewald, L.: Experiment tracking with weights and biases. Software available from wandb.com (2020)

  6. Binsted, K., Ritchie, G.: Computational rules for generating punning riddles. HUMOR (1997)

  7. Castro, S., Hazarika, D., Pérez-Rosas, V., et al.: Towards multimodal sarcasm detection (an _obviously_ perfect paper). (2019). https://doi.org/10.48550/arXiv.1906.01815, arXiv:1906.01815

  8. Chapman, A.J.: Humor and laughter in social interaction and some implications for humor research. In: McGhee, P.E., Goldstein, J.H. (eds.) Handbook of Humor Research, pp. 135–157. Springer, New York, New York (1983). https://doi.org/10.1007/978-1-4612-5572-7_7

    Chapter  Google Scholar 

  9. Chen, P.Y., Soo, V.W.: Humor recognition using deep learning. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 113–117 (2018)

  10. Cho, K., van Merrienboer, B., Gulcehre, C., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. (2014). https://doi.org/10.48550/arXiv.1406.1078, arXiv:1406.1078

  11. Christ, L., Amiriparian, S., Kathan, A., et al.: Towards multimodal prediction of spontaneous humour: a novel dataset and first results (2022). https://doi.org/10.48550/ARXIV.2209.14272

  12. Degottex, G., Kane, J., Drugman, T., et al.: COVAREP—a collaborative voice analysis repository for speech technologies. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (Icassp). IEEE, pp 960–964 (2014)

  13. Devlin, J., Chang, M.W., Lee, K., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. (2018). https://doi.org/10.48550/ARXIV.1810.04805

  14. Fan, X., Lin, H., Yang, L., et al.: Humor detection via an internal and external neural network. Neurocomputing 394, 105–111 (2020). https://doi.org/10.1016/j.neucom.2020.02.030

    Article  Google Scholar 

  15. Han, W., Chen, H., Gelbukh, A., et al.: Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In: Proceedings of the 2021 International Conference on Multimodal Interaction. Association for Computing Machinery, New York, NY, USA, ICMI21, pp 6–15 (2021), https://doi.org/10.1145/3462244.3479919

  16. Hangloo, S., Arora, B.: Combating multimodal fake news on social media: methods, datasets, and future perspective. Multimed Syst 28(6), 2391–2422 (2022). https://doi.org/10.1007/s00530-022-00966-y

    Article  Google Scholar 

  17. Hasan, M.K., Rahman, W., Bagher Zadeh, A., et al.: UR-FUNNY: a multimodal language dataset for understanding humor. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pp 2046–2056, (2019). https://doi.org/10.18653/v1/D19-1211

  18. Hasan, M.K., Lee, S., Rahman, W., et al.: Humor knowledge enriched transformer for understanding multimodal humor. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 12972–12980 (2021)

  19. Hazarika, D., Zimmermann, R., Poria, S.: MISA: modality-invariant and -specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM International Conference on Multimedia. ACM, Seattle, pp 1122–1131, (2020). https://doi.org/10.1145/3394171.3413678

  20. Kayatani, Y., Yang, Z., Otani, M., et al.: The laughing machine: predicting humor in video. In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, Waikoloa, pp 2072–2081, (2021). https://doi.org/10.1109/WACV48630.2021.00212

  21. Kim, K., Park, S.: AOBERT: all-modalities-in-One BERT for multimodal sentiment analysis. Inform. Fusion 92, 37–45 (2023). https://doi.org/10.1016/j.inffus.2022.11.022

    Article  Google Scholar 

  22. Kurtzberg, T.R., Naquin, C.E., Belkin, L.Y.: Humor as a relationship-building tool in online negotiations. Int. J. Conflict Manag. 20(4), 377–397 (2009). https://doi.org/10.1108/10444060910991075

    Article  Google Scholar 

  23. Liang, C., Yu, Y., Jiang, H., et al.: BOND: BERT-assisted open-domain named entity recognition with distant supervision. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Association for Computing Machinery, New York, KDD’20, pp 1054–1064, (2020). https://doi.org/10.1145/3394486.3403149

  24. Liu, Y., Ott, M., Goyal, N., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv:1907.11692 (2019)

  25. Lupyan, G.: Objective effects of knowledge on visual perception. J. Exp. Psychol. Hum. Percep. Perform. 43(4), 794 (2017)

    Article  Google Scholar 

  26. Lyttle, J.: The effectiveness of humor in Persuasion: the case of business ethics training. J. General Psychol. 128(2), 206–216 (2001). https://doi.org/10.1080/00221300109598908

    Article  Google Scholar 

  27. Ma, F., Zhang, Y., Sun, X.: Multimodal sentiment analysis with preferential fusion and distance-aware contrastive learning. In: 2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, Brisbane, Australia, pp 1367–1372, (2023). https://doi.org/10.1109/ICME55011.2023.00237

  28. Mai, S., Xing, S., Hu, H.: Analyzing multimodal sentiment via acoustic- and visual-LSTM with channel-aware temporal convolution network. IEEE/ACM Trans. Audio, Speech, Lang. Process. 29, 1424–1437 (2021). https://doi.org/10.1109/TASLP.2021.3068598

    Article  Google Scholar 

  29. Mai, S., Hu, H., Xing, S.: A unimodal representation learning and recurrent decomposition fusion structure for utterance-level multimodal embedding learning. IEEE Trans. Multimed. 24, 2488–2501 (2022). https://doi.org/10.1109/TMM.2021.3082398

    Article  Google Scholar 

  30. Mao, J., Liu, W.: A BERT-based approach for automatic humor detection and scoring. In: IberLEF@ SEPLN, pp 197–202 (2019)

  31. Mihalcea, R., Strapparava, C.: Making computers laugh: investigations in automatic humor recognition. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pp 531–538 (2005)

  32. Paraskevopoulos, G., Georgiou, E., Potamianos, A.: Mmlatch: bottom–up top–down fusion for multimodal sentiment analysis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 4573–4577, (2022). https://doi.org/10.1109/ICASSP43922.2022.9746418

  33. Patro, B.N., Lunayach, M., Srivastava, D., et al.: Multimodal humor dataset: predicting laughter tracks for sitcoms. In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, Waikoloa, HI, USA, pp 576–585, (2021). https://doi.org/10.1109/WACV48630.2021.00062

  34. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Moschitti A, Pang B, Daelemans W (eds) Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1532–1543, (2014). https://doi.org/10.3115/v1/D14-1162

  35. Rahman, W., Hasan, M.K., Lee, S., et al.: Integrating multimodal information in large pretrained transformers. In: Proceedings of the Conference. Association for Computational Linguistics. Meeting, vol 2020. NIH Public Access, p 2359 (2020)

  36. Ramachandran, V.: The neurology and evolution of humor, laughter, and smiling: the false alarm theory. Med. Hypotheses 51(4), 351–354 (1998). https://doi.org/10.1016/S0306-9877(98)90061-5

    Article  Google Scholar 

  37. Ray, A., Mishra, S., Nunna, A., et al.: A multimodal corpus for emotion recognition in sarcasm. In: Calzolari N, Béchet F, Blache P, et al (eds) Proceedings of the Thirteenth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, pp 6992–7003 (2022)

  38. Rothe, S., Narayan, S., Severyn, A.: Leveraging pre-trained checkpoints for sequence generation tasks. Trans. Assoc. Comput. Ling. 8, 264–280 (2020). https://doi.org/10.1162/tacl_a_00313

    Article  Google Scholar 

  39. Seo, S., Na, S., Kim, J.: HMTL: heterogeneous modality transfer learning for audio–visual sentiment analysis. IEEE Access 8, 140426–140437 (2020). https://doi.org/10.1109/ACCESS.2020.3006563

    Article  Google Scholar 

  40. Sherzer, J., Attardo, S.: Linguistic theories of humor. Language 72(1), 132 (1996). https://doi.org/10.2307/416799. arXiv:416799

    Article  Google Scholar 

  41. Stock, O., Strapparava, C.: Getting serious about the development of computational humor. In: IJCAI, vol 3. Citeseer, pp 59–64 (2003)

  42. Sun, Z., Sarma, P., Sethares, W., et al.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. Proc. AAAI Conf. Artif. Intell. 34(05), 8992–8999 (2020). https://doi.org/10.1609/aaai.v34i05.6431

    Article  Google Scholar 

  43. Szegedy, C., Zaremba, W., Sutskever, I., et al.: Intriguing properties of neural networks. (2014). https://doi.org/10.48550/arXiv.1312.6199, arXiv:1312.6199

  44. Tsai, Y.H.H., Bai, S., Liang, P.P., et al.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pp 6558–6569, (2019). https://doi.org/10.18653/v1/P19-1656

  45. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. (2017). https://doi.org/10.48550/arXiv.1706.03762, arXiv:1706.03762

  46. Yang, D., Lavie, A., Dyer, C., et al.: Humor recognition and humor anchor extraction. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp 2367–2376 (2015)

  47. Yang, Z., Nakashima, Y., Takemura, H.: Multi-modal humor segment prediction in video. Multimed. Syst. 29(4), 2389–2398 (2023). https://doi.org/10.1007/s00530-023-01105-x

    Article  Google Scholar 

  48. Zadeh, A., Liang, P.P., Mazumder, N., et al.: Memory fusion network for multi-view sequential learning. Proceedings of the AAAI Conference on Artificial Intelligence 32(1) (2018). https://doi.org/10.1609/aaai.v32i1.12021

  49. Zhang, X., Chen, Y., Li, G.: Multi-modal sarcasm detection based on contrastive attention mechanism. In: Wang L, Feng Y, Hong Y, et al (eds) Natural Language Processing and Chinese Computing. Springer International Publishing, Cham, pp 822–833, (2021). https://doi.org/10.1007/978-3-030-88480-2_66

  50. Zhang, Y., Wang, J., Liu, Y., et al.: A Multitask learning model for multimodal sarcasm, sentiment and emotion recognition in conversations. Inform. Fusion 93, 282–301 (2023). https://doi.org/10.1016/j.inffus.2023.01.005

    Article  Google Scholar 

  51. Zhu, L., Zhu, Z., Zhang, C., et al.: Multimodal sentiment analysis based on fusion methods: a survey. Inform. Fusion 95, 306–325 (2023). https://doi.org/10.1016/j.inffus.2023.02.028

    Article  Google Scholar 

  52. Zou, Z., Chen, K., Shi, Z., et al.: Object detection in 20 years: a survey. Proc. IEEE 111(3), 257–276 (2023). https://doi.org/10.1109/JPROC.2023.3238524

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

Peng Gao(First Author):Conceptualization, Methodology, Software, Visualization, Analysis, Writing Original Draft; Chuanqi Tao(Corresponding Author): Conceptualization, Resources, Supervision, Writing - Review & Editing; Donghai Guan: Conceptualization, Resources, Supervision, Writing - Review & Editing

Corresponding author

Correspondence to Chuanqi Tao.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Communicated by Qianqian Xu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gao, P., Tao, C. & Guan, D. FEF-Net: feature enhanced fusion network with crossmodal attention for multimodal humor prediction. Multimedia Systems 30, 195 (2024). https://doi.org/10.1007/s00530-024-01402-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-024-01402-z

Keywords

Navigation