Abstract
Humor segment prediction in video involves the comprehension and analysis of humor. Traditional humor prediction has been text-based; however, with the evolution of multimedia, the focus has shifted to multimodal approaches in humor prediction, marking a current trend in research. In recent years, determining whether a video is humorous has remained a challenge within the domain of sentiment analysis. Researchers have proposed multiple data fusion methods to address humor prediction and sentiment analysis. Within the realm of studying humor and emotions, text modality assumes a leading role, while audio and video modalities serve as supplementary data sources for multimodal humor prediction. However, these auxiliary modalities contain significant irrelevant information unrelated to the prediction task, resulting in redundancy. Current multimodal fusion models primarily emphasize fusion methods but overlook the issue of high redundancy in auxiliary modalities. The lack of research on reducing redundancy in auxiliary modalities introduces noise, thereby increasing the overall training complexity of models and diminishing predictive accuracy. Hence, developing a humor prediction method that effectively reduces redundancy in auxiliary modalities is pivotal for advancing multimodal research. In this paper, we propose the Feature Enhanced Fusion Network (FEF-Net), leveraging cross-modal attention to augment features from auxiliary modalities using knowledge from textual data. This mechanism generates weights to emphasize the redundancy of each corresponding time slice in the auxiliary modality. Further, employing Transformer encoders extracts high-level features for each modality, thereby enhancing the performance of humor prediction models. Experimental comparisons were conducted using the UR-FUNNY and MUStARD multimodal humor prediction models, revealing a 3.2% improvement in ‘Acc-2’ compared to the optimal model.




Similar content being viewed by others
Data availability
The original datasets have been published online. The datasets generated during or analyzed during the current study are available from the corresponding author upon reasonable request.
References
Annamoradnejad, I., Zoghi, G.: Colbert: using Bert sentence embedding for humor detection. arXiv preprint arXiv:2004.12765 1(3), (2020)
Baltrusaitis, T., Zadeh, A., Lim, Y.C., et al.: Openface 2.0: facial behavior analysis toolkit. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, pp 59–66 (2018)
Bertero, D., Fung, P.: Deep Learning of audio and language features for humor prediction. In: Calzolari N, Choukri K, Declerck T, et al (eds) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA), Portorož, Slovenia, pp 496–501 (2016a)
Bertero, D., Fung, P.: A long short-term memory framework for predicting humor in dialogues. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 130–135 (2016b)
Biewald, L.: Experiment tracking with weights and biases. Software available from wandb.com (2020)
Binsted, K., Ritchie, G.: Computational rules for generating punning riddles. HUMOR (1997)
Castro, S., Hazarika, D., Pérez-Rosas, V., et al.: Towards multimodal sarcasm detection (an _obviously_ perfect paper). (2019). https://doi.org/10.48550/arXiv.1906.01815, arXiv:1906.01815
Chapman, A.J.: Humor and laughter in social interaction and some implications for humor research. In: McGhee, P.E., Goldstein, J.H. (eds.) Handbook of Humor Research, pp. 135–157. Springer, New York, New York (1983). https://doi.org/10.1007/978-1-4612-5572-7_7
Chen, P.Y., Soo, V.W.: Humor recognition using deep learning. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 113–117 (2018)
Cho, K., van Merrienboer, B., Gulcehre, C., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. (2014). https://doi.org/10.48550/arXiv.1406.1078, arXiv:1406.1078
Christ, L., Amiriparian, S., Kathan, A., et al.: Towards multimodal prediction of spontaneous humour: a novel dataset and first results (2022). https://doi.org/10.48550/ARXIV.2209.14272
Degottex, G., Kane, J., Drugman, T., et al.: COVAREP—a collaborative voice analysis repository for speech technologies. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (Icassp). IEEE, pp 960–964 (2014)
Devlin, J., Chang, M.W., Lee, K., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. (2018). https://doi.org/10.48550/ARXIV.1810.04805
Fan, X., Lin, H., Yang, L., et al.: Humor detection via an internal and external neural network. Neurocomputing 394, 105–111 (2020). https://doi.org/10.1016/j.neucom.2020.02.030
Han, W., Chen, H., Gelbukh, A., et al.: Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In: Proceedings of the 2021 International Conference on Multimodal Interaction. Association for Computing Machinery, New York, NY, USA, ICMI21, pp 6–15 (2021), https://doi.org/10.1145/3462244.3479919
Hangloo, S., Arora, B.: Combating multimodal fake news on social media: methods, datasets, and future perspective. Multimed Syst 28(6), 2391–2422 (2022). https://doi.org/10.1007/s00530-022-00966-y
Hasan, M.K., Rahman, W., Bagher Zadeh, A., et al.: UR-FUNNY: a multimodal language dataset for understanding humor. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pp 2046–2056, (2019). https://doi.org/10.18653/v1/D19-1211
Hasan, M.K., Lee, S., Rahman, W., et al.: Humor knowledge enriched transformer for understanding multimodal humor. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 12972–12980 (2021)
Hazarika, D., Zimmermann, R., Poria, S.: MISA: modality-invariant and -specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM International Conference on Multimedia. ACM, Seattle, pp 1122–1131, (2020). https://doi.org/10.1145/3394171.3413678
Kayatani, Y., Yang, Z., Otani, M., et al.: The laughing machine: predicting humor in video. In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, Waikoloa, pp 2072–2081, (2021). https://doi.org/10.1109/WACV48630.2021.00212
Kim, K., Park, S.: AOBERT: all-modalities-in-One BERT for multimodal sentiment analysis. Inform. Fusion 92, 37–45 (2023). https://doi.org/10.1016/j.inffus.2022.11.022
Kurtzberg, T.R., Naquin, C.E., Belkin, L.Y.: Humor as a relationship-building tool in online negotiations. Int. J. Conflict Manag. 20(4), 377–397 (2009). https://doi.org/10.1108/10444060910991075
Liang, C., Yu, Y., Jiang, H., et al.: BOND: BERT-assisted open-domain named entity recognition with distant supervision. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Association for Computing Machinery, New York, KDD’20, pp 1054–1064, (2020). https://doi.org/10.1145/3394486.3403149
Liu, Y., Ott, M., Goyal, N., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv:1907.11692 (2019)
Lupyan, G.: Objective effects of knowledge on visual perception. J. Exp. Psychol. Hum. Percep. Perform. 43(4), 794 (2017)
Lyttle, J.: The effectiveness of humor in Persuasion: the case of business ethics training. J. General Psychol. 128(2), 206–216 (2001). https://doi.org/10.1080/00221300109598908
Ma, F., Zhang, Y., Sun, X.: Multimodal sentiment analysis with preferential fusion and distance-aware contrastive learning. In: 2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, Brisbane, Australia, pp 1367–1372, (2023). https://doi.org/10.1109/ICME55011.2023.00237
Mai, S., Xing, S., Hu, H.: Analyzing multimodal sentiment via acoustic- and visual-LSTM with channel-aware temporal convolution network. IEEE/ACM Trans. Audio, Speech, Lang. Process. 29, 1424–1437 (2021). https://doi.org/10.1109/TASLP.2021.3068598
Mai, S., Hu, H., Xing, S.: A unimodal representation learning and recurrent decomposition fusion structure for utterance-level multimodal embedding learning. IEEE Trans. Multimed. 24, 2488–2501 (2022). https://doi.org/10.1109/TMM.2021.3082398
Mao, J., Liu, W.: A BERT-based approach for automatic humor detection and scoring. In: IberLEF@ SEPLN, pp 197–202 (2019)
Mihalcea, R., Strapparava, C.: Making computers laugh: investigations in automatic humor recognition. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pp 531–538 (2005)
Paraskevopoulos, G., Georgiou, E., Potamianos, A.: Mmlatch: bottom–up top–down fusion for multimodal sentiment analysis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 4573–4577, (2022). https://doi.org/10.1109/ICASSP43922.2022.9746418
Patro, B.N., Lunayach, M., Srivastava, D., et al.: Multimodal humor dataset: predicting laughter tracks for sitcoms. In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, Waikoloa, HI, USA, pp 576–585, (2021). https://doi.org/10.1109/WACV48630.2021.00062
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Moschitti A, Pang B, Daelemans W (eds) Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1532–1543, (2014). https://doi.org/10.3115/v1/D14-1162
Rahman, W., Hasan, M.K., Lee, S., et al.: Integrating multimodal information in large pretrained transformers. In: Proceedings of the Conference. Association for Computational Linguistics. Meeting, vol 2020. NIH Public Access, p 2359 (2020)
Ramachandran, V.: The neurology and evolution of humor, laughter, and smiling: the false alarm theory. Med. Hypotheses 51(4), 351–354 (1998). https://doi.org/10.1016/S0306-9877(98)90061-5
Ray, A., Mishra, S., Nunna, A., et al.: A multimodal corpus for emotion recognition in sarcasm. In: Calzolari N, Béchet F, Blache P, et al (eds) Proceedings of the Thirteenth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, pp 6992–7003 (2022)
Rothe, S., Narayan, S., Severyn, A.: Leveraging pre-trained checkpoints for sequence generation tasks. Trans. Assoc. Comput. Ling. 8, 264–280 (2020). https://doi.org/10.1162/tacl_a_00313
Seo, S., Na, S., Kim, J.: HMTL: heterogeneous modality transfer learning for audio–visual sentiment analysis. IEEE Access 8, 140426–140437 (2020). https://doi.org/10.1109/ACCESS.2020.3006563
Sherzer, J., Attardo, S.: Linguistic theories of humor. Language 72(1), 132 (1996). https://doi.org/10.2307/416799. arXiv:416799
Stock, O., Strapparava, C.: Getting serious about the development of computational humor. In: IJCAI, vol 3. Citeseer, pp 59–64 (2003)
Sun, Z., Sarma, P., Sethares, W., et al.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. Proc. AAAI Conf. Artif. Intell. 34(05), 8992–8999 (2020). https://doi.org/10.1609/aaai.v34i05.6431
Szegedy, C., Zaremba, W., Sutskever, I., et al.: Intriguing properties of neural networks. (2014). https://doi.org/10.48550/arXiv.1312.6199, arXiv:1312.6199
Tsai, Y.H.H., Bai, S., Liang, P.P., et al.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pp 6558–6569, (2019). https://doi.org/10.18653/v1/P19-1656
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. (2017). https://doi.org/10.48550/arXiv.1706.03762, arXiv:1706.03762
Yang, D., Lavie, A., Dyer, C., et al.: Humor recognition and humor anchor extraction. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp 2367–2376 (2015)
Yang, Z., Nakashima, Y., Takemura, H.: Multi-modal humor segment prediction in video. Multimed. Syst. 29(4), 2389–2398 (2023). https://doi.org/10.1007/s00530-023-01105-x
Zadeh, A., Liang, P.P., Mazumder, N., et al.: Memory fusion network for multi-view sequential learning. Proceedings of the AAAI Conference on Artificial Intelligence 32(1) (2018). https://doi.org/10.1609/aaai.v32i1.12021
Zhang, X., Chen, Y., Li, G.: Multi-modal sarcasm detection based on contrastive attention mechanism. In: Wang L, Feng Y, Hong Y, et al (eds) Natural Language Processing and Chinese Computing. Springer International Publishing, Cham, pp 822–833, (2021). https://doi.org/10.1007/978-3-030-88480-2_66
Zhang, Y., Wang, J., Liu, Y., et al.: A Multitask learning model for multimodal sarcasm, sentiment and emotion recognition in conversations. Inform. Fusion 93, 282–301 (2023). https://doi.org/10.1016/j.inffus.2023.01.005
Zhu, L., Zhu, Z., Zhang, C., et al.: Multimodal sentiment analysis based on fusion methods: a survey. Inform. Fusion 95, 306–325 (2023). https://doi.org/10.1016/j.inffus.2023.02.028
Zou, Z., Chen, K., Shi, Z., et al.: Object detection in 20 years: a survey. Proc. IEEE 111(3), 257–276 (2023). https://doi.org/10.1109/JPROC.2023.3238524
Author information
Authors and Affiliations
Contributions
Peng Gao(First Author):Conceptualization, Methodology, Software, Visualization, Analysis, Writing Original Draft; Chuanqi Tao(Corresponding Author): Conceptualization, Resources, Supervision, Writing - Review & Editing; Donghai Guan: Conceptualization, Resources, Supervision, Writing - Review & Editing
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Communicated by Qianqian Xu.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gao, P., Tao, C. & Guan, D. FEF-Net: feature enhanced fusion network with crossmodal attention for multimodal humor prediction. Multimedia Systems 30, 195 (2024). https://doi.org/10.1007/s00530-024-01402-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-024-01402-z