Abstract
Multi-modal fusion methods for movie genre classification have shown to be superior over their single modality counterparts. However, it is still challenging to design a fusion strategy for real-world scenarios where missing data and weak labeling are common. Considering the heterogeneity in different modalities, most existing works design late fusion strategies that process and train models per modality, and combine the results at the decision level. A major drawback in such strategies is the potential loss of across-modality dependencies, which is important for understanding audiovisual contents. In this paper, we introduce a Shot-based Hybrid Fusion Network (SHFN) for movie genre classification. It consists of single-modal feature fusion networks for video and audio, a multi-modal feature fusion network working on a shot-basis, and finally a late fusion part for video-level decisions. An ablation study indicates the major contribution from video and the performance gain from the additional modality, audio. The experimental results on the LMTD-9 dataset demonstrate the effectiveness of our proposed method in movie genre classification. Our best model outperforms the state-of-the-art method by 5.7% on AUPRC(micro).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bi, T., Jarnikov, D., Lukkien, J.: Video representation fusion network for multi-label movie genre classification. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 9386–9391. IEEE (2021)
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Chu, W.T., Guo, H.J.: Movie genre classification based on poster images with deep neural networks. In: Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes, pp. 39–45 (2017)
Ertugrul, A.M., Karagoz, P.: Movie genre classification from plot summaries using bidirectional LSTM. In: 2018 IEEE 12th International Conference on Semantic Computing (ICSC), pp. 248–251. IEEE (2018)
Gadzicki, K., Khamsehashari, R., Zetzsche, C.: Early vs late fusion in multimodal convolutional neural networks. In: 2020 IEEE 23rd International Conference on Information Fusion (FUSION), pp. 1–6. IEEE (2020)
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. IEEE (2017)
Gibaja, E., Ventura, S.: A tutorial on multilabel learning. ACM Comput. Surveys (CSUR) 47(3), 1–38 (2015)
Mangai, U.G., Samanta, S., Das, S., Chowdhury, P.R.: A survey of decision fusion and feature fusion strategies for pattern classification. IETE Techn. Rev. 27(4), 293–307 (2010)
Simões, G.S., Wehrmann, J., Barros, R.C., Ruiz, D.D.: Movie genre classification with convolutional neural networks. In: 2016 International Joint Conference on Neural Networks (IJCNN), pp. 259–266. IEEE (2016)
Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., Xu, W.: CNN-RNN: a unified framework for multi-label image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2285–2294 (2016)
Wehrmann, J., Barros, R.C.: Convolutions through time for multi-label movie genre classification. In: Proceedings of the Symposium on Applied Computing, pp. 114–119 (2017)
Wehrmann, J., Barros, R.C.: Movie genre classification: a multi-label approach based on convolutions through time. Appl. Soft Comput. 61, 973–982 (2017)
Wimmer, M., Schuller, B., Arsic, D., Radig, B., Rigoll, G.: Low-level fusion of audio and video feature for multi-modal emotion recognition. In: Proceedings of the 3rd International Conference on Computer Vision Theory and Applications VISAPP, Funchal, Madeira, Portugal, pp. 145–151 (2008)
Wu, C.H., Lin, J.C., Wei, W.L.: Survey on audiovisual emotion recognition: databases, features, and data fusion strategies. APSIPA Trans. Sign. Inf. Process. 3 (2014)
Zeng, Z., Pantic, M., Roisman, G.I., Huang, T.S.: A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31(1), 39–58 (2008)
Zhou, Z.H.: A brief introduction to weakly supervised learning. Natl. Sci. Rev. 5(1), 44–53 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Bi, T., Jarnikov, D., Lukkien, J. (2022). Shot-Based Hybrid Fusion for Movie Genre Classification. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds) Image Analysis and Processing – ICIAP 2022. ICIAP 2022. Lecture Notes in Computer Science, vol 13231. Springer, Cham. https://doi.org/10.1007/978-3-031-06427-2_22
Download citation
DOI: https://doi.org/10.1007/978-3-031-06427-2_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06426-5
Online ISBN: 978-3-031-06427-2
eBook Packages: Computer ScienceComputer Science (R0)